How Composo Works Under The Hoood

Composo Team

Over the coming weeks we are going to be sharing all of the technical details behind our evals engine. This is the first instalment in that series, which discusses the motivation, architecture and high level benchmarking work we have done. Subsequent posts will dive into different aspects of this in much more detail, along with experiment findings, implementation details & code for reproducing.

Introduction

Most teams evaluating LLM outputs face an uncomfortable choice. Numeric scores from LLM-as-judge are unstable, poorly calibrated, and often meaningless. Boolean assertions are reliable but lose granularity—you can check if a response was hallucinated, but not how badly. Neither approach scales to the nuanced quality judgments that production systems require.

Composo provides an alternative. The system uses a generative reward model architecture that produces deterministic, well-calibrated scores from single-sentence natural language criteria. It works out of the box without labelled data, learns from each customer's evaluation history automatically, and can optionally incorporate ground truth annotations and external data sources.

What unreliable evaluation costs you

Evaluation isn't just a testing concern—it's the feedback loop that shapes your entire LLM application. When that feedback loop is noisy, the costs compound.

Bad prompts ship to production. If your evaluation can't distinguish between a 0.6 and a 0.8 response, you can't tell whether your prompt changes actually improved anything. Teams end up shipping changes based on noise, or worse, avoiding changes because they can't measure impact.

Regressions slip through. A model update or prompt tweak that degrades 15% of responses may not move your average score if that score was already bouncing around by ±20% due to evaluation variance. You discover the regression when users complain, not when you could have caught it in CI.

A/B tests mislead. When your evaluation metric has high variance, you need dramatically more samples to detect real differences. Teams either run underpowered tests and draw false conclusions, or give up on quantitative comparison entirely.

You can't automate quality gates. If you don't trust your scores, you can't use them to block deployments or trigger alerts. Every release requires manual review, which doesn't scale.

The irony is that teams adopt LLM-as-judge specifically to automate evaluation—and then can't trust the automation enough to act on it.

Section 1 - Why LLM-as-judge struggles

LLMs face significant challenges when used for evaluation tasks. Recent research systematically testing LLM judges across multiple models and scoring formats has confirmed patterns we've observed in production: numeric scores are fragile, poorly calibrated, and often misleading.

Score saturation and collapse

LLM judges don't degrade gracefully as output quality worsens. Instead, scores saturate quickly and then stop distinguishing between moderate and severe problems. A response with 20% hallucinated content and one with 60% hallucination may receive identical scores because the judge's output has already hit a ceiling.

Controlled experiments demonstrate this clearly: when researchers introduced progressively more errors into text passages, scores plateaued early and collapsed into narrow bands.  Light and heavy corruption became indistinguishable. The expected linear relationship between quality and score simply doesn't exist—instead, you get discontinuous jumps, long flat stretches, and clustering around arbitrary values.

Numerical instability across scales

The same response with an obvious hallucination can receive 60% on a 1-5 scale, 30% on a 1-10 scale, and 85% on a 1-100 scale. These scores reflect the model's sensitivity to scale configuration rather than any calibrated understanding of quality. Without explicit training on what different score values mean, the mapping from quality assessment to numerical output is essentially arbitrary.

Testing across multiple formats—1-10, 0-1, -1 to 1, and letter grades (A-E)—shows that none produce a smooth correlation with actual quality. Changing the numeric range shifts the appearance of curves but doesn't address the underlying instability. Letter grades reduce variance but collapse fine distinctions, functioning more like categorical bins than calibrated numbers: most passages cluster into A-C, with D and E only appearing at extreme degradation levels.

Score variance across runs

Even with temperature set to zero, LLM scoring exhibits instability. While reasoning and judgments have become more consistent in recent models, final numerical scores still vary across runs. If you receive a score of 0.4, the true mean might actually be 0.6—you won't know without repeated sampling.

This means you're not getting a point estimate; you're sampling from a distribution. Distribution plots make this visible: what looks like a single score is actually a spread, and neighbouring quality levels often overlap entirely. This makes it harder to distinguish genuine regressions from evaluation noise in CI pipelines, and complicates detection of meaningful differences in A/B tests.

Cross model incomparability

Scores from different LLM judges cannot be compared. GPT and Claude may both produce medians that rise with corruption density, but they differ in scale and behaviour. The same passage scored lower by one judge may be scored higher by another. This makes benchmarking across judge models unreliable and means any score is only meaningful relative to a specific, fixed judge configuration.

Research across model families—including OpenAI, Anthropic, and open-source models like Qwen—confirms this pattern holds regardless of model capability. Even reasoning-optimised models like o3 reduce variance but don't solve the fundamental calibration problem: outputs remain discrete tokens without grounded meaning.

Setup overhead

Getting reliable results from LLM-as-judge requires extensive prompt engineering: detailed rubrics, few-shot examples, careful scale definitions. This optimisation work is domain-specific and often needs to be repeated for each new evaluation criterion. Even then, the instabilities described above persist.

The boolean assertion workaround

A common response to these problems is to abandon numerical scores entirely and use boolean assertions: specific pass/fail checks for each test case. Research supports this approach for stability—binary judgments consistently separate clean from corrupted passages with low variance across runs.

However, there's a fundamental information loss. To capture the same granularity as a 10-point scale without losing information, you would need a separate boolean assertion for each point on that scale. In practice, teams end up with a sparse set of assertions that collapse meaningful quality distinctions into binary outcomes.

Section 2 - A different approach to evaluation

The problems with LLM-as-judge stem from a fundamental mismatch: language models output discrete tokens, not calibrated measurements. Asking a model to "rate this response from 1-10" forces it to map a complex quality judgment onto an arbitrary numeric scale it was never trained to use meaningfully.

Composo takes a different approach. Rather than asking models to generate scores directly, we use them for what they're good at—reasoning about quality in natural language—and handle scoring through a purpose-built reward model trained specifically on quality distributions. This separates the judgment from the measurement.

Composo’s architecture overview

Composo uses a Generative Reward Model architecture that combines post-trained frontier models (up to 8 models, ensembled) with a custom reward head.

Figure 1: Generative reward model architecture

The reward head is trained on 1M+ expert-labelled data points from three sources:

  • Customer data: From customers who opt in to data sharing.
  • Open-source datasets: FinQA, TechQA, PubMedQA, and similar domain-specific corpora.
  • Synthetic data: Including PrimeBench for example - our open-source approach to synthetic data generation.

Stage 1: Complexity routing

When a trace and evaluation criteria are submitted, the system first assesses the input's ambiguity and complexity. Straightforward inputs with clear criteria can be processed with lower latency through a faster evaluation path. More complex or ambiguous inputs are routed to more extensive processing in the next stage. This allows the system to deliver faster results on simpler cases while maintaining full accuracy on harder ones.

Stage 2: Ensemble reasoning

Based on complexity routing, Composo generates multiple reasoning traces using an ensemble of up to 8 frontier-class models. These models are post-trained specifically for critical analysis rather than helpfulness - they focus on criteria adherence and are trained to surface issues rather than optimistically overlook them. Each model produces its own reasoning trace, creating a distribution of analytical perspectives.

Stage 3: Reward model scoring

Rather than asking models to generate numerical scores, Composo uses a custom-trained regression-based reward model that understands quality through learned distributions.

This directly addresses the calibration problem. Humans struggle to assign meaningful absolute numbers—is this response a 7 or an 8?—but reliably judge relative quality: "A is better than B." The same is true of language models. By training on comparisons rather than absolute scores, we sidestep the arbitrary numeric mapping that makes LLM-as-judge unstable.

The reward model is trained using the Bradley-Terry framework, based on pairwise preference comparisons (similar to Elo rankings in chess). During training, expert labellers review outputs across domains and indicate which of two outputs is better. Through tens of thousands of these comparisons, the model learns the quality spectrum without arbitrary numerical anchors. Rather than being told "this is 7/10 because it has minor errors", the model learns through comparisons: "Output A is better than B because it's more accurate, but C is better than A because it's more complete."

This builds a well-conditioned understanding of quality represented on a 0-1 scale to 2 decimal places. The reward-based approach produces stable scores even when underlying reasoning traces vary—the same input produces the same score. And because the scale is grounded in learned quality distributions rather than arbitrary prompting, scores at different points on the scale have consistent meaning: a 0.3 is always meaningfully worse than a 0.7, not just on a different day or with a different prompt.

Self-calibration

Every evaluation query passes through a calibration layer that grounds scoring in reference data. The calibration store operates as a per-customer self-calibrating loop—and this is where Composo's scores become meaningful for your specific use case, not just in the abstract.

Here's how it works in practice. When a new customer runs their first evaluations, the system scores them using the base reward model trained on our general corpus. These early scores are accurate but generic—they reflect what "good" means across all domains.

As the customer runs more evaluations, the system builds an internal representation of what scores mean in their specific context. By evaluation 500, the calibration store has learned patterns: this customer's responses that score 0.8+ typically cite sources inline; their 0.4-0.6 range usually indicates correct information with formatting issues; below 0.4 means hallucination or off-topic responses. The system isn't told these patterns—it learns them from the distribution of the customer's actual outputs.

This means scores become increasingly calibrated to your quality bar, not a generic one. A legal tech company and a customer support bot might both care about "faithfulness," but what constitutes a 0.7 looks different in each context. Self-calibration handles this automatically.

When evaluating a new trace, the system performs adaptive retrieval from the calibration store:

  • Exact matches: If the trace matches an entry exactly, the system anchors fully on that example's score. This guarantees determinism—identical inputs always produce identical outputs.
  • Similar examples: For traces that are semantically similar but not exact matches, the system retrieves relevant calibration examples to inform scoring. A new response about return policies will be scored in the context of how similar return policy responses were scored previously.
  • Spectrum coverage: The retrieval mechanism selects calibration examples that span the full 0-1 scoring range. If a retrieval would only surface high-scoring examples, the system deliberately includes lower-scoring calibration data. This prevents score distribution collapse—a common failure mode where all scores drift toward the mean—and keeps the reward model well-conditioned across the entire range.

No ground truth data or manual labelling is required for self-calibration to function. It happens automatically from evaluation history alone.

Stage 4: Aggregation and output

The system collects scores from all reasoning traces and takes the median. The output includes:

  • A score between 0 and 1 (to 2 decimal places)
  • Detailed analysis explaining the reasoning
  • Specific claims or issues identified in the output

Adding your own data

The self-calibration loop works without any customer-provided data. However, for use cases requiring explicit domain grounding, additional data can be added to the knowledge store to augment the calibration.

Types of data that can be added

  • Golden data and annotations: Pre-labelled examples with known scores from expert review. When you have domain experts who can label what good and bad looks like, these annotations provide direct grounding.
  • Third-party data sources: External reference material such as medical guidelines, regulatory documents, or industry standards.
  • Internal knowledge stores: Customer documentation, style guides, accumulated documents, or any corpus that defines domain-specific quality.

This data is integrated into the calibration store and influences retrieval in the same way as the self-calibrated data. Exact matches in golden data fully anchor the score; similar examples inform proportionally.

Section 3 - Performance validation

Academic benchmarks often focus on artificial tasks that don't translate to business needs. We validated Composo using PrimeBench, a benchmark constructed from diverse domain-specific challenges in real-world industry tasks. More detail on this work is available here.

PrimeBench dataset

PrimeBench integrates data from multiple sources requiring different capabilities:

  • FinQA: Financial corpus requiring numerical reasoning and domain knowledge.
  • TechQA: Technical support inquiries with specialised terminology.
  • PubMedQA: Medical information assessment demanding scientific accuracy.
  • XSum: Text summarisation evaluating information preservation and conciseness.

Methodology

We compared Composo Align against state-of-the-art LLM-as-judge implementations (Claude Sonnet, GPT-4.1) and RAGAS. For each test case, we conducted pairwise comparisons measuring agreement with expert human evaluators.

Figure 2: Agreement with expert human evaluators across PrimeBench

Composo achieves 95% agreement with expert evaluators compared to approximately 70% for LLM-as-judge approaches (71.5% Claude Sonnet, 71.0% GPT-4.1, 53.5% G-Eval, 47.7% RAGAS). This represents a 70% reduction in error rate.

Section 4 - Additional capabilities

Multi-agent tracing

Accurate evaluation requires visibility into what actually happened. You can't score an agent's performance if you can't see the decisions it made, the tools it called, or the intermediate reasoning that led to its final output.

Many agent frameworks abstract away underlying LLM calls, making it difficult to understand behaviour and evaluate performance. An orchestrator calls a planner, which calls a researcher, which makes three LLM calls—and all you see is the final response. When something goes wrong, you can't tell which component failed.

Composo provides a tracing SDK that instruments LLM provider APIs (OpenAI, Anthropic, Google GenAI) and allows marking of agent boundaries. This gives you the visibility needed for meaningful evaluation.

Local evaluation without export

A key difference from other tracing systems is that Composo tracing returns a trace object directly in your code via the context manager's with variable. This enables immediate local evaluation without exporting data to an external platform:


with AgentTracer('orchestrator') as tracer:
    # agent logic here

results = client.evaluate_trace(tracer.trace, criteria=criteria.agent)

The tracer.trace object is available immediately after the context manager exits. You can evaluate it locally, make decisions based on the scores, or take corrective action - all within the same execution context. There is no requirement to push traces to a remote system and wait for async results. This makes it practical to use evaluation results to influence runtime behaviour, gate deployments, or trigger alerts without external dependencies.

Hierarchical agent support

The tracing system supports nested agents for hierarchical multi-agent architectures. Parent-child relationships are captured automatically when AgentTracer contexts are nested. The compiled trace includes all LLM calls made by each agent, and evaluate_trace() returns per-agent evaluation results with summary statistics (mean, min, max, standard deviation) for each agent independently.

Pre-built evaluation frameworks

Any natural language criterion can be used (single sentence, starting with 'Reward...' or 'Penalize...'). For common use cases, Composo provides pre-built criteria frameworks.

RAG framework

  • rag_faithfulness: Reward responses that make only claims directly supported by provided source material without hallucination.
  • rag_completeness: Reward responses that include all relevant information from source material needed to answer the question.
  • rag_precision: Reward responses that include only information necessary to answer the question.
  • rag_relevance: Reward responses where all content directly addresses the user's question.

Agent framework

  • agent_exploration: Reward agents that explore new information and investigate unknowns despite uncertainty.
  • agent_exploitation: Reward agents that exploit existing knowledge to create reliable plans.
  • agent_tool_use: Reward agents that operate tools correctly in accordance with tool definitions.
  • agent_goal_pursuit: Reward agents that work towards the goal specified by the user.
  • agent_faithfulness: Reward agents that only make claims supported by source material or tool returns

Model variants

Summary

  • Simpler to use: Single sentence natural language criteria. No rubrics, few-shot examples, or prompt engineering.
  • More accurate: 95% agreement with expert evaluators vs ~70% for LLM-as-judge.
  • Well calibrated: Scores are grounded in learned quality distributions via self-calibration, not arbitrary scales.
  • Works immediately: No ground truth data required. Self-calibration learns from evaluation history automatically.
  • Learns from your data: Optionally incorporate golden annotations, external guidelines, or internal knowledge stores.
  • Stable: Deterministic scoring. Same input produces same output. Similar inputs cluster tightly.
  • Faster and cheaper: Options from 100ms to 15s depending on accuracy/speed tradeoff. No repeated sampling needed.