Skip to content
Read our latest publication on optimal methods for LLM evaluation here

Blog

Technical articles on AI evaluation, failure modes, and what we're learning from production deployments.

All posts

A Living Map of LLM Failure Modes: Dominant, Emerging, Squashed

Most failure-mode analysis treats failures as a static taxonomy. In production they're a living distribution - some dominant, some emerging, some being squashed by recent fixes. Here's a clustering pipeline that tracks all three.

From one judge to a learning system

Build or buy eval infrastructure? The six layers after a baseline judge (ensemble through meta-eval), why a first slice of scoring is not the same milestone as release-grade depth, and how we benchmark quality (71% baseline judge to 83.6% on our internal set).

What We Found Inside Clinical AI Systems That Were Passing Every Eval

Findings from clinical AI engagements - actual failure patterns from production clinical AI, categorised by type, with real examples. Discussions becoming decisions, dangerous omissions, dosage errors, and diagnostic leaps.

Improving LLM Judges With Experiments, Not Vibes

Our open-source research on RewardBench 2 shows that two simple techniques — task-specific criteria injection and ensembling — improve LLM judge accuracy by up to 13.5pp (71.7% baseline → 85.8%).

An Ontology of LLM Failure Modes

A structured taxonomy of 60+ failure modes across eight categories, synthesizing recent research into a practical framework for understanding how and why large language models fail.

Composo Align Platform Release

Introducing Composo: AI evaluation that learns your standards. Not manual review, not LLM-as-judge -- a third option that gets better the more you use it.

How Composo Works Under The Hood

A deep dive into Composo's generative reward model architecture that achieves 95% agreement with expert evaluators, compared to ~70% for LLM-as-judge approaches.

Guide To Evaluating LangGraph Agents

A practical guide to evaluating LangGraph multi-agent workflows using Composo's agent evaluation framework with quantitative scoring across 5 key dimensions.

LLMs: Great Witnesses, Terrible Judges

LLM-as-a-judge consistency is largely illusory. The same hallucination produces wildly different scores depending on scale configuration, undermining evaluation trust.

AI Scribe Failures: The Lawsuits, the Patterns, and What Evaluation Should Catch

Published AI scribe failure rates are higher than vendor marketing suggests. Real lawsuits have started. Here is what the specific failure patterns look like and what clinical evaluation needs to catch.

The Complete Guide to Evaluating Tools & Agents

A component-based evaluation framework for agentic LLM systems covering tool call formulation, tool choice, response integration, reasoning evaluation, and system-level analysis.

Evaluating Clinical AI: A Practical Guide

A practical guide to evaluating clinical AI in production: specific failure modes to catch, why generic evaluation misses them, regulatory context, and what good quality infrastructure looks like.

Evaluating LLMs on Structured Classification Tasks

A comprehensive guide to evaluating LLM classification quality, covering supervised metrics, generative reward models, and LLM-as-judge approaches.

The Complete Guide to RAG Evaluation

A comprehensive guide to evaluating RAG applications, covering generation metrics, retrieval assessment, and advanced CAG-based oracle evaluation techniques.

Composo Align achieves state-of-the-art performance in evals

Composo Align achieves 95% agreement with expert preferences vs 72% for LLM-as-judge, with 100% score consistency through its deterministic generative reward model.

Introducing Composo Align

Composo Align uses a generative reward model architecture to provide deterministic, consistent scoring for LLM evaluation, achieving 95% agreement with expert preferences.

What Is LLM Evaluation? A Practical Explanation

A plain-language explanation of LLM evaluation: what it is, why LLM-as-judge plateaus, what production AI quality actually requires, and how to think about build vs buy.

Eval Drift: Why Your LLM Evaluations Stop Working (And How to Detect It)

LLM evaluations that worked last quarter can silently stop working this quarter. Evaluation drift is a first-class failure mode of production AI quality systems. Here is how to detect it and what to do about it.

Build vs Buy: Should You Build Your Own LLM Evaluation Pipeline?

An honest analysis of the real cost of building an LLM evaluation pipeline internally vs buying a deployed quality layer. Real numbers, real timelines, real trade-offs.

The Ultimate Guide to LLM App Evaluation

A structured guide to evaluating LLM applications, covering common challenges with human vibe checks and LLM-as-judge, and key steps to building a reliable evaluation framework.

AI Guardrails: How to Block Bad LLM Outputs in Production

A practical guide to AI guardrails for production LLM systems. What they catch, what they miss, and how to deploy domain-specific guardrails that actually work at inference time.