Skip to content
Read our latest publication on optimal methods for LLM evaluation here
← Back to Blog

Improving LLM Judges With Experiments, Not Vibes

Ryan Lail ·

For the full methodology and analysis, see the paper or technical report.

In 2026, we’re seeing AI companies make a decisive shift — from demos, betas, and trials to running in production at scale. As that happens, output quality is becoming the clear bottleneck to shipping new features, especially for our customers in high-stakes domains with domain experts in the loop. One core part of that problem is the unreliability of LLM judges — the models teams rely on to test, monitor, and guard their AI applications.

At Composo, we’ve been building state-of-the-art eval approaches that help our customers scale their quality layer across the full unit test, monitoring, annotation, and improvement cycle. One part of this process is the LLM judge, which has a mixed reputation amongst people working with AI apps. When prompted effectively, they can be extremely powerful quality signals that drive real improvements. When used incorrectly, you’ll spend weeks optimising for the wrong things while real problems go unnoticed.

We systematically tested five techniques for improving LLM judge accuracy on RewardBench 2 and found that three simple, drop-in changes improve accuracy from 71.7% to 83.6%. The full code and technical report are open source on GitHub.

What Works

We evaluated several techniques using GPT-5.4 and GPT-5.4 mini on 1,753 examples across five categories. Here’s what worked.

1. Ask more than once. LLM judges give different scores on every call. Ensemble scoring turns that from a bug into a feature: request k=8 independent scores and average them. The noise cancels out. Result: +9.8pp at 5x cost. Most of the gain comes by k=3.

2. Try mini models. GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost — nearly matching the full-model ensemble at roughly one-quarter the price. Add criteria and it hits 81.5%, tying the full-model ensemble outright. For real-time guardrails where you need to score every request, this is the operating point that matters. And if even mini is too expensive, GPT-5.4 nano with k=8 reaches 71.4% at just 0.4x baseline cost — the cheapest path to baseline-level accuracy.

3. Be specific. The standard judge prompt asks for generic qualities like “helpfulness, relevance, accuracy.” We added a single sentence specifying what actually matters for each task. For Math: “Focus on whether the mathematical reasoning is logically valid, the steps are correct, and the final answer is accurate.” Result: +3.0pp at near-zero cost. The criteria were pre-registered — no post-hoc tuning.

Combined, criteria + ensembling reach 83.6% accuracy at 5.3x baseline cost — no fine-tuning required.

ConditionAccuracy (95% CI)Costvs Baseline
Nano (k=8)71.4% (±2.1pp)$0.0060.4x
Baseline (full k=1)71.7% (±2.1pp)$0.0131.0x
Criteria (full k=1)74.7% (±2.1pp)$0.0141.1x
Criteria (mini k=8)81.5% (±1.9pp)$0.0161.2x
Ensemble (full k=8)81.5% (±1.8pp)$0.0665.0x
Criteria + ensemble (full k=8)83.6% (±1.7pp)$0.0705.3x

Accuracy by condition and category

Cost vs accuracy tradeoff — criteria + ensembling dominates the Pareto frontier

Diminishing returns — most gains captured by k=3

Open Source

The full experiment is at github.com/composo-ai/llm-judge-criteria-ensembling — code and the full technical report.

At Composo, our Align platform goes further than just LLM-as-judge. But for teams using LLM-as-judge today, these are the highest-impact improvements available to standard LLM judge setups, and we wanted to make the research accessible.