Over the coming weeks we are going to be sharing all of the technical details behind our evals engine. This is the first instalment in that series, which discusses the motivation, architecture and high level benchmarking work we have done. Subsequent posts will dive into different aspects of this in much more detail, along with experiment findings, implementation details & code for reproducing.
Most teams evaluating LLM outputs face an uncomfortable choice. Numeric scores from LLM-as-judge are unstable, poorly calibrated, and often meaningless. Boolean assertions are reliable but lose granularity—you can check if a response was hallucinated, but not how badly. Neither approach scales to the nuanced quality judgments that production systems require.
Composo provides an alternative. The system uses a generative reward model architecture that produces deterministic, well-calibrated scores from single-sentence natural language criteria. It works out of the box without labelled data, learns from each customer's evaluation history automatically, and can optionally incorporate ground truth annotations and external data sources.
Evaluation isn't just a testing concern—it's the feedback loop that shapes your entire LLM application. When that feedback loop is noisy, the costs compound.
Bad prompts ship to production. If your evaluation can't distinguish between a 0.6 and a 0.8 response, you can't tell whether your prompt changes actually improved anything. Teams end up shipping changes based on noise, or worse, avoiding changes because they can't measure impact.
Regressions slip through. A model update or prompt tweak that degrades 15% of responses may not move your average score if that score was already bouncing around by ±20% due to evaluation variance. You discover the regression when users complain, not when you could have caught it in CI.
A/B tests mislead. When your evaluation metric has high variance, you need dramatically more samples to detect real differences. Teams either run underpowered tests and draw false conclusions, or give up on quantitative comparison entirely.
You can't automate quality gates. If you don't trust your scores, you can't use them to block deployments or trigger alerts. Every release requires manual review, which doesn't scale.
The irony is that teams adopt LLM-as-judge specifically to automate evaluation—and then can't trust the automation enough to act on it.
LLMs face significant challenges when used for evaluation tasks. Recent research systematically testing LLM judges across multiple models and scoring formats has confirmed patterns we've observed in production: numeric scores are fragile, poorly calibrated, and often misleading.
LLM judges don't degrade gracefully as output quality worsens. Instead, scores saturate quickly and then stop distinguishing between moderate and severe problems. A response with 20% hallucinated content and one with 60% hallucination may receive identical scores because the judge's output has already hit a ceiling.
Controlled experiments demonstrate this clearly: when researchers introduced progressively more errors into text passages, scores plateaued early and collapsed into narrow bands. Light and heavy corruption became indistinguishable. The expected linear relationship between quality and score simply doesn't exist—instead, you get discontinuous jumps, long flat stretches, and clustering around arbitrary values.
The same response with an obvious hallucination can receive 60% on a 1-5 scale, 30% on a 1-10 scale, and 85% on a 1-100 scale. These scores reflect the model's sensitivity to scale configuration rather than any calibrated understanding of quality. Without explicit training on what different score values mean, the mapping from quality assessment to numerical output is essentially arbitrary.
Testing across multiple formats—1-10, 0-1, -1 to 1, and letter grades (A-E)—shows that none produce a smooth correlation with actual quality. Changing the numeric range shifts the appearance of curves but doesn't address the underlying instability. Letter grades reduce variance but collapse fine distinctions, functioning more like categorical bins than calibrated numbers: most passages cluster into A-C, with D and E only appearing at extreme degradation levels.
Even with temperature set to zero, LLM scoring exhibits instability. While reasoning and judgments have become more consistent in recent models, final numerical scores still vary across runs. If you receive a score of 0.4, the true mean might actually be 0.6—you won't know without repeated sampling.
This means you're not getting a point estimate; you're sampling from a distribution. Distribution plots make this visible: what looks like a single score is actually a spread, and neighbouring quality levels often overlap entirely. This makes it harder to distinguish genuine regressions from evaluation noise in CI pipelines, and complicates detection of meaningful differences in A/B tests.
Scores from different LLM judges cannot be compared. GPT and Claude may both produce medians that rise with corruption density, but they differ in scale and behaviour. The same passage scored lower by one judge may be scored higher by another. This makes benchmarking across judge models unreliable and means any score is only meaningful relative to a specific, fixed judge configuration.
Research across model families—including OpenAI, Anthropic, and open-source models like Qwen—confirms this pattern holds regardless of model capability. Even reasoning-optimised models like o3 reduce variance but don't solve the fundamental calibration problem: outputs remain discrete tokens without grounded meaning.
Getting reliable results from LLM-as-judge requires extensive prompt engineering: detailed rubrics, few-shot examples, careful scale definitions. This optimisation work is domain-specific and often needs to be repeated for each new evaluation criterion. Even then, the instabilities described above persist.
A common response to these problems is to abandon numerical scores entirely and use boolean assertions: specific pass/fail checks for each test case. Research supports this approach for stability—binary judgments consistently separate clean from corrupted passages with low variance across runs.
However, there's a fundamental information loss. To capture the same granularity as a 10-point scale without losing information, you would need a separate boolean assertion for each point on that scale. In practice, teams end up with a sparse set of assertions that collapse meaningful quality distinctions into binary outcomes.
The problems with LLM-as-judge stem from a fundamental mismatch: language models output discrete tokens, not calibrated measurements. Asking a model to "rate this response from 1-10" forces it to map a complex quality judgment onto an arbitrary numeric scale it was never trained to use meaningfully.
Composo takes a different approach. Rather than asking models to generate scores directly, we use them for what they're good at—reasoning about quality in natural language—and handle scoring through a purpose-built reward model trained specifically on quality distributions. This separates the judgment from the measurement.
Composo uses a Generative Reward Model architecture that combines post-trained frontier models (up to 8 models, ensembled) with a custom reward head.

The reward head is trained on 1M+ expert-labelled data points from three sources:
When a trace and evaluation criteria are submitted, the system first assesses the input's ambiguity and complexity. Straightforward inputs with clear criteria can be processed with lower latency through a faster evaluation path. More complex or ambiguous inputs are routed to more extensive processing in the next stage. This allows the system to deliver faster results on simpler cases while maintaining full accuracy on harder ones.
Based on complexity routing, Composo generates multiple reasoning traces using an ensemble of up to 8 frontier-class models. These models are post-trained specifically for critical analysis rather than helpfulness - they focus on criteria adherence and are trained to surface issues rather than optimistically overlook them. Each model produces its own reasoning trace, creating a distribution of analytical perspectives.
Rather than asking models to generate numerical scores, Composo uses a custom-trained regression-based reward model that understands quality through learned distributions.
This directly addresses the calibration problem. Humans struggle to assign meaningful absolute numbers—is this response a 7 or an 8?—but reliably judge relative quality: "A is better than B." The same is true of language models. By training on comparisons rather than absolute scores, we sidestep the arbitrary numeric mapping that makes LLM-as-judge unstable.
The reward model is trained using the Bradley-Terry framework, based on pairwise preference comparisons (similar to Elo rankings in chess). During training, expert labellers review outputs across domains and indicate which of two outputs is better. Through tens of thousands of these comparisons, the model learns the quality spectrum without arbitrary numerical anchors. Rather than being told "this is 7/10 because it has minor errors", the model learns through comparisons: "Output A is better than B because it's more accurate, but C is better than A because it's more complete."
This builds a well-conditioned understanding of quality represented on a 0-1 scale to 2 decimal places. The reward-based approach produces stable scores even when underlying reasoning traces vary—the same input produces the same score. And because the scale is grounded in learned quality distributions rather than arbitrary prompting, scores at different points on the scale have consistent meaning: a 0.3 is always meaningfully worse than a 0.7, not just on a different day or with a different prompt.
Every evaluation query passes through a calibration layer that grounds scoring in reference data. The calibration store operates as a per-customer self-calibrating loop—and this is where Composo's scores become meaningful for your specific use case, not just in the abstract.
Here's how it works in practice. When a new customer runs their first evaluations, the system scores them using the base reward model trained on our general corpus. These early scores are accurate but generic—they reflect what "good" means across all domains.
As the customer runs more evaluations, the system builds an internal representation of what scores mean in their specific context. By evaluation 500, the calibration store has learned patterns: this customer's responses that score 0.8+ typically cite sources inline; their 0.4-0.6 range usually indicates correct information with formatting issues; below 0.4 means hallucination or off-topic responses. The system isn't told these patterns—it learns them from the distribution of the customer's actual outputs.
This means scores become increasingly calibrated to your quality bar, not a generic one. A legal tech company and a customer support bot might both care about "faithfulness," but what constitutes a 0.7 looks different in each context. Self-calibration handles this automatically.
When evaluating a new trace, the system performs adaptive retrieval from the calibration store:
No ground truth data or manual labelling is required for self-calibration to function. It happens automatically from evaluation history alone.
The system collects scores from all reasoning traces and takes the median. The output includes:
The self-calibration loop works without any customer-provided data. However, for use cases requiring explicit domain grounding, additional data can be added to the knowledge store to augment the calibration.
This data is integrated into the calibration store and influences retrieval in the same way as the self-calibrated data. Exact matches in golden data fully anchor the score; similar examples inform proportionally.
Academic benchmarks often focus on artificial tasks that don't translate to business needs. We validated Composo using PrimeBench, a benchmark constructed from diverse domain-specific challenges in real-world industry tasks. More detail on this work is available here.
PrimeBench integrates data from multiple sources requiring different capabilities:
We compared Composo Align against state-of-the-art LLM-as-judge implementations (Claude Sonnet, GPT-4.1) and RAGAS. For each test case, we conducted pairwise comparisons measuring agreement with expert human evaluators.

Composo achieves 95% agreement with expert evaluators compared to approximately 70% for LLM-as-judge approaches (71.5% Claude Sonnet, 71.0% GPT-4.1, 53.5% G-Eval, 47.7% RAGAS). This represents a 70% reduction in error rate.
Accurate evaluation requires visibility into what actually happened. You can't score an agent's performance if you can't see the decisions it made, the tools it called, or the intermediate reasoning that led to its final output.
Many agent frameworks abstract away underlying LLM calls, making it difficult to understand behaviour and evaluate performance. An orchestrator calls a planner, which calls a researcher, which makes three LLM calls—and all you see is the final response. When something goes wrong, you can't tell which component failed.
Composo provides a tracing SDK that instruments LLM provider APIs (OpenAI, Anthropic, Google GenAI) and allows marking of agent boundaries. This gives you the visibility needed for meaningful evaluation.
A key difference from other tracing systems is that Composo tracing returns a trace object directly in your code via the context manager's with variable. This enables immediate local evaluation without exporting data to an external platform:
The tracer.trace object is available immediately after the context manager exits. You can evaluate it locally, make decisions based on the scores, or take corrective action - all within the same execution context. There is no requirement to push traces to a remote system and wait for async results. This makes it practical to use evaluation results to influence runtime behaviour, gate deployments, or trigger alerts without external dependencies.
The tracing system supports nested agents for hierarchical multi-agent architectures. Parent-child relationships are captured automatically when AgentTracer contexts are nested. The compiled trace includes all LLM calls made by each agent, and evaluate_trace() returns per-agent evaluation results with summary statistics (mean, min, max, standard deviation) for each agent independently.
Any natural language criterion can be used (single sentence, starting with 'Reward...' or 'Penalize...'). For common use cases, Composo provides pre-built criteria frameworks.
