Improving LLM Judges With Experiments, Not Vibes
For the full methodology and analysis, see the technical report.
In 2026, we’re seeing AI companies make a decisive shift — from demos, betas, and trials to running in production at scale. As that happens, output quality is becoming the clear bottleneck to shipping new features, especially for our customers in high-stakes domains with domain experts in the loop. One core part of that problem is the unreliability of LLM judges — the models teams rely on to test, monitor, and guard their AI applications.
At Composo, we’ve been building state-of-the-art eval approaches that help our customers scale their quality layer across the full unit test, monitoring, annotation, and improvement cycle. One part of this process is the LLM judge, which has a mixed reputation amongst people working with AI apps. When prompted effectively, they can be extremely powerful quality signals that drive real improvements. When used incorrectly, you’ll spend weeks optimising for the wrong things while real problems go unnoticed.
We systematically tested five techniques for improving LLM judge accuracy on RewardBench 2 and found that three simple, drop-in changes improve accuracy from 71.7% to 83.6%. The full code and technical report are open source on GitHub.
What Works
We evaluated several techniques using GPT-5.4 and GPT-5.4 mini on 1,753 examples across five categories. Here’s what worked.
1. Ask more than once. LLM judges give different scores on every call. Ensemble scoring turns that from a bug into a feature: request k=8 independent scores and average them. The noise cancels out. Result: +9.8pp at 5x cost. Most of the gain comes by k=3.
2. Try mini models. GPT-5.4 mini with k=8 achieves 79.2% at just 0.4x baseline cost — better than the full model baseline at less than half the price. Add criteria and it hits 81.5%. For real-time guardrails where you need to score every request, this is the operating point that matters.
3. Be specific. The standard judge prompt asks for generic qualities like “helpfulness, relevance, accuracy.” We added a single sentence specifying what actually matters for each task. For Math: “Focus on whether the mathematical reasoning is logically valid, the steps are correct, and the final answer is accurate.” Result: +3.0pp at near-zero cost. The criteria were pre-registered — no post-hoc tuning.
Combined, criteria + ensembling reach 83.6% accuracy at 5.3x baseline cost — no fine-tuning required.
| Condition | Accuracy (95% CI) | Cost | vs Baseline |
|---|---|---|---|
| Baseline (full k=1) | 71.7% (±2.0pp) | $0.0133 | 1.0x |
| Criteria (full k=1) | 74.7% (±1.9pp) | $0.0140 | 1.1x |
| Ensemble (full k=8) | 81.5% (±1.8pp) | $0.0663 | 5.0x |
| Mini + criteria (k=8) | 81.5% (±1.7pp) | $0.0053 | 0.4x |
| Criteria + ensemble (full k=8) | 83.6% (±1.6pp) | $0.0702 | 5.3x |



Open Source
The full experiment is at github.com/composo-ai/llm-judge-criteria-ensembling — code and the full technical report.
At Composo, our Align platform goes further than just LLM-as-judge. But for teams using LLM-as-judge today, these are the highest-impact improvements available to standard LLM judge setups, and we wanted to make the research accessible.