Skip to content
Read our latest publication on optimal methods for LLM evaluation here

Blog

Technical articles on AI evaluation, failure modes, and what we're learning from production deployments.

All posts

From One Judge to a Learning System

From one judge to a full eval loop: layers, timelines (useful signal in ~8-10 weeks / 4-5 two-week sprints; mature in-house stack often 6+ months), build vs integrate, and how we measure quality (~71% baseline judge → up to 83.6% on our benchmark).

What We Found Inside Clinical AI Systems That Were Passing Every Eval

Findings from clinical AI engagements - actual failure patterns from production clinical AI, categorised by type, with real examples. Discussions becoming decisions, dangerous omissions, dosage errors, and diagnostic leaps.

Improving LLM Judges With Experiments, Not Vibes

Our open-source research on RewardBench 2 shows that three simple techniques — ensembling, mini models, and task-specific criteria — improve LLM judge accuracy from 71.7% to 83.6%.

An Ontology of LLM Failure Modes

A structured taxonomy of 60+ failure modes across eight categories, synthesizing recent research into a practical framework for understanding how and why large language models fail.

Composo Align Platform Release

Introducing Composo: AI evaluation that learns your standards. Not manual review, not LLM-as-judge -- a third option that gets better the more you use it.

How Composo Works Under The Hood

A deep dive into Composo's generative reward model architecture that achieves 95% agreement with expert evaluators, compared to ~70% for LLM-as-judge approaches.

Guide To Evaluating LangGraph Agents

A practical guide to evaluating LangGraph multi-agent workflows using Composo's agent evaluation framework with quantitative scoring across 5 key dimensions.

LLMs: Great Witnesses, Terrible Judges

LLM-as-a-judge consistency is largely illusory. The same hallucination produces wildly different scores depending on scale configuration, undermining evaluation trust.

The Complete Guide to Evaluating Tools & Agents

A component-based evaluation framework for agentic LLM systems covering tool call formulation, tool choice, response integration, reasoning evaluation, and system-level analysis.

Evaluating LLMs on Structured Classification Tasks

A comprehensive guide to evaluating LLM classification quality, covering supervised metrics, generative reward models, and LLM-as-judge approaches.

The Complete Guide to RAG Evaluation

A comprehensive guide to evaluating RAG applications, covering generation metrics, retrieval assessment, and advanced CAG-based oracle evaluation techniques.

Composo Align achieves state-of-the-art performance in evals

Composo Align achieves 95% agreement with expert preferences vs 72% for LLM-as-judge, with 100% score consistency through its deterministic generative reward model.

Introducing Composo Align

Composo Align uses a generative reward model architecture to provide deterministic, consistent scoring for LLM evaluation, achieving 95% agreement with expert preferences.

The Ultimate Guide to LLM App Evaluation

A structured guide to evaluating LLM applications, covering common challenges with human vibe checks and LLM-as-judge, and key steps to building a reliable evaluation framework.