Blog

Technical articles on AI evaluation, failure modes, and what we're learning from production deployments.

26 May 2026

Composo Clinical Guardrails: real-time hallucination blocking for AI scribes

Domain-calibrated guardrails for AI scribes. Catches hallucinated medications, fabricated values, and unsupported clinical inference at the inference boundary, with sub-second latency.

Ryan Lail

15 Apr 2026

A Living Map of LLM Failure Modes: Dominant, Emerging, Squashed

Most failure-mode analysis treats failures as a static taxonomy. In production they're a living distribution - some dominant, some emerging, some being squashed by recent fixes. Here's a clustering pipeline that tracks all three.

Luke Markham

15 Apr 2026

From one judge to a learning system

Build or buy eval infrastructure? The six layers after a baseline judge (ensemble through meta-eval), why a first slice of scoring is not the same milestone as release-grade depth, and how we benchmark quality (71% baseline judge to 83.6% on our internal set).

Michael Karotsieris

All posts

26 May 2026

Composo Clinical Guardrails: real-time hallucination blocking for AI scribes

Domain-calibrated guardrails for AI scribes. Catches hallucinated medications, fabricated values, and unsupported clinical inference at the inference boundary, with sub-second latency.

Ryan Lail

15 Apr 2026

A Living Map of LLM Failure Modes: Dominant, Emerging, Squashed

Luke Markham

15 Apr 2026

From one judge to a learning system

Michael Karotsieris

5 Apr 2026

What We Found Inside Clinical AI Systems That Were Passing Every Eval

Findings from clinical AI engagements - actual failure patterns from production clinical AI, categorised by type, with real examples. Discussions becoming decisions, dangerous omissions, dosage errors, and diagnostic leaps.

Seb Fox

2 Apr 2026

Improving LLM Judges With Experiments, Not Vibes

Our open-source research on RewardBench 2 shows that two simple techniques — task-specific criteria injection and ensembling — improve LLM judge accuracy by up to 13.5pp (71.7% baseline → 85.8%).

Ryan Lail

1 Apr 2026

An Ontology of LLM Failure Modes

A structured taxonomy of 60+ failure modes across eight categories, synthesizing recent research into a practical framework for understanding how and why large language models fail.

Luke Markham

25 Mar 2026

Composo Align Platform Release

Introducing Composo: AI evaluation that learns your standards. Not manual review, not LLM-as-judge -- a third option that gets better the more you use it.

Luke Markham

21 Jan 2026

How Composo Works Under The Hood

A deep dive into Composo's generative reward model architecture that achieves 95% agreement with expert evaluators, compared to ~70% for LLM-as-judge approaches.

Seb Fox

1 Aug 2025

Guide To Evaluating LangGraph Agents

A practical guide to evaluating LangGraph multi-agent workflows using Composo's agent evaluation framework with quantitative scoring across 5 key dimensions.

Luke Markham

10 Jul 2025

LLMs: Great Witnesses, Terrible Judges

LLM-as-a-judge consistency is largely illusory. The same hallucination produces wildly different scores depending on scale configuration, undermining evaluation trust.

Ryan Lail

5 Jul 2025

AI Scribe Failures: The Lawsuits, the Patterns, and What Evaluation Should Catch

Published AI scribe failure rates are higher than vendor marketing suggests. Real lawsuits have started. Here is what the specific failure patterns look like and what clinical evaluation needs to catch.

Seb Fox

24 Jun 2025

The Complete Guide to Evaluating Tools & Agents

A component-based evaluation framework for agentic LLM systems covering tool call formulation, tool choice, response integration, reasoning evaluation, and system-level analysis.

Seb Fox

15 Jun 2025

Evaluating Clinical AI: A Practical Guide

A practical guide to evaluating clinical AI in production: specific failure modes to catch, why generic evaluation misses them, regulatory context, and what good quality infrastructure looks like.

Seb Fox

10 Jun 2025

Evaluating LLMs on Structured Classification Tasks

A comprehensive guide to evaluating LLM classification quality, covering supervised metrics, generative reward models, and LLM-as-judge approaches.

Seb Fox

10 Jun 2025

The Complete Guide to RAG Evaluation

A comprehensive guide to evaluating RAG applications, covering generation metrics, retrieval assessment, and advanced CAG-based oracle evaluation techniques.

Seb Fox

1 Jun 2025

Composo Align achieves state-of-the-art performance in evals

Composo Align achieves 95% agreement with expert preferences vs 72% for LLM-as-judge, with 100% score consistency through its deterministic generative reward model.

Seb Fox

1 Jun 2025

Introducing Composo Align

Composo Align uses a generative reward model architecture to provide deterministic, consistent scoring for LLM evaluation, achieving 95% agreement with expert preferences.

Seb Fox

25 May 2025

What Is LLM Evaluation? A Practical Explanation

A plain-language explanation of LLM evaluation: what it is, why LLM-as-judge plateaus, what production AI quality actually requires, and how to think about build vs buy.

Seb Fox

18 Mar 2025

Eval Drift: Why Your LLM Evaluations Stop Working (And How to Detect It)

LLM evaluations that worked last quarter can silently stop working this quarter. Evaluation drift is a first-class failure mode of production AI quality systems. Here is how to detect it and what to do about it.

Seb Fox

28 Feb 2025

Build vs Buy: Should You Build Your Own LLM Evaluation Pipeline?

An honest analysis of the real cost of building an LLM evaluation pipeline internally vs buying a deployed quality layer. Real numbers, real timelines, real trade-offs.

Seb Fox

16 Feb 2025

The Ultimate Guide to LLM App Evaluation

A structured guide to evaluating LLM applications, covering common challenges with human vibe checks and LLM-as-judge, and key steps to building a reliable evaluation framework.

Seb Fox

20 Jan 2025

AI Guardrails: How to Block Bad LLM Outputs in Production

A practical guide to AI guardrails for production LLM systems. What they catch, what they miss, and how to deploy domain-specific guardrails that actually work at inference time.

Seb Fox