From One Judge to a Learning System
How we built an agent evaluation engine that gets better the more you use it, and what it really takes to rebuild something like that in-house.
Why we’re publishing this
We sell evaluation infrastructure. This isn’t a feature brochure or a “just trust us” story. It’s a walkthrough of:
- the problems teams hit in order when they build alone,
- how we check that our own system still works,
- and what else you need (beyond scoring) to run evals as a real product.
If you’re deciding build vs buy, the question usually isn’t “can we prompt an LLM to score things?”
The harder question is what happens when you need the full loop (memory, review habit, regression + meta-eval, multi-tenant product surface) to behave like infrastructure. That depth is where calendar really bites: a mature in-house stack is often 6+ months of focused work, not because teams are slow, but because layers are sequential and compounding. Integrating an operated platform is the other path: same clock, less build, but you still own criteria, labels, and release decisions.
Still, a competent team can often get useful scoring and wiring in roughly 8-10 weeks (think 4-5 two-week sprints, sometimes faster): enough to ship a judge, pipe traffic, and learn.
Either way, you need something you can inspect and measure (accuracy on a fixed test set, false passes, reruns after engine changes). A single headline number can hide false passes. Optimise for that alongside accuracy, or you get false confidence in a degrading agent.
You need several pieces working together: multiple judges and tuning, remembering past evals, humans in the loop, comparing versions on a fixed test set, and re-checking the scorer itself, plus normal product work: users, APIs, rate limits and scale.
TL;DR
If your core product isn’t evaluation infrastructure, you are choosing where the months go. In-house you can get something useful in weeks, but trusting the full loop as infrastructure usually takes many months of build. The other path is to integrate an operated platform: that depth already exists on the other side of the API—you wire traffic and criteria, hold the same bar on proof, and keep your engineers on the product you actually sell. This post walks both sides; the ladder below is what “build it all” really means.
| Build in-house | Buy (integrate Composo) | |
|---|---|---|
| Useful early (typical) | 8-10 weeks can get you a judge, basic plumbing, maybe a dashboard: enough to learn and gate obvious failures. Not the full ladder below. | Integration work, not a blank slate: you connect traffic and criteria; we run the platform mechanics. |
| Mature loop + product surface | Everything in the ladder, plus criteria/tenancy, stable APIs, reviewer UX, gold-set hygiene. Often 6+ months before we’d treat it like a dependency we stake releases on. | You don’t implement every layer from scratch; you still own criteria, labels, ship calls, and process. Measured depth on our side (~71% baseline judge → up to 83.6% on our internal benchmark). |
| Trust / proof | You build frozen gold, ablations, false-pass tracking yourself, or you don’t have them. | Inspectable mechanics + reruns on frozen data; hold us to the same standard you’d use in-house. |
Technical ladder (what “build” actually means): Most teams start with one LLM “judge,” then keep adding pieces as scores get noisy, week-to-week numbers stop being comparable, the system looks right but isn’t, ship vs traffic gets mixed up, and you change the engine and nobody notices. The next table is only the scoring side; it doesn’t include how you model criteria, separate customers, or wire callers (see Product realities).
Numbers here come from our own internal runs (your product will differ), but the order of problems and the calendar cost recur. We score against expert-labelled examples on a fixed benchmark. A single baseline judge is about ~71% on that; our strongest setup gets up to 83.6%.
| Step | What goes wrong | What you’d have to build | Rough effort in-house |
|---|---|---|---|
| Starting point | You can’t read every conversation | One prompt + one model score | Days; feels “done,” usually isn’t |
| 1 · Ensemble | Same case gets different scores; trend charts lie | Multiple judges, calibration, ways to cut cost (tiering, cheaper models, sampling) | Weeks to months, plus upkeep when judge models change |
| 2 · Memory | A score this week doesn’t mean the same as last week | Store past evals, search similar ones, wire them into prompts (per criterion) | Months: search, storage, ops |
| 3 · Human review | The system looks consistent but is systematically wrong | Review UI, what to label first, use labels to fix retrieval | Product build + ongoing reviewer time |
| 4 · Regression checks | You can’t tell “bad deploy” from “harder users” | Frozen test sets, compare releases, break down by criterion | Weeks for a first version, then versioning over time |
| 5 · Test the eval itself | You don’t know if your scorer got worse | Fixed gold set in version control, rerun when the engine changes | Small team ongoing; dataset and hygiene |
Calendar: don’t conflate first useful signal (weeks) with infrastructure you trust (often 6+ months for a mature in-house loop like the one we run, with some teams seeing a broader first cut in 3-6 months and full tenancy/product surface longer). Steps depend on each other, humans have to review, and you still wrap the scorer in real product (criteria, tenants, callers). If you integrate instead of building every layer, the same months can still accumulate eval data through an operated path while your engineers focus on the agent.
Where builds often stall: Teams commonly ship a judge, then an ensemble, then some memory, and lose steam on sustained human review (queues, prioritisation, habit). That layer is as much process and ownership as code.
Calendar trap: many roadmaps assume “a few weeks” for scoring; months later they’re still mid-stack while the agent team waits.
Compounding: memory and labels improve with volume; a stack that has run for a year is not something you reproduce on day one by shipping the same code paths.
Each row in the ladder is a different kind of work (models, search, reviewer tools, monitoring, tests of the eval itself). Doing all of it yourself is what competes with your product roadmap; the table above separates early value from maturity.
For ensembles, experiments, and more technical detail, see Improving LLM Judges With Experiments, Not Vibes.
Starting point: one judge
You get too much traffic to read everything. You need a signal: is the agent doing its job? The usual first step is a judge prompt: a rule (e.g. “did we fix the issue without escalating for no reason?”) plus the conversation, returning a score.
# First judge (pseudocode)
JUDGE_PROMPT = """Criteria: {criteria} ... Interaction: {interaction} ...
Score 0.0-1.0; JSON: score, explanation."""
def evaluate(interaction, criteria):
return parse_json(llm.complete(JUDGE_PROMPT.format(...), temperature=0))
That’s fine for obvious cases and low volume. It breaks down when people actually use the scores for decisions and trends. Small shifts in wording, models, or criteria add up. It’s still measurement: it wiggles and biases like any other metric, not magic.
Layer 1: Inconsistency and cost
The problem: run the same conversation twice and you can get different scores (e.g. 0.74 vs 0.71). Turning temperature down helps, but you still get noise from the hardware and steady biases in the model (favouring long answers, confident tone, etc.). Tweaking the prompt helps a little; it doesn’t remove that.
Why trends break: if repeat runs wobble by about 0.03, you can’t trust small moves on a chart, so a drop from 0.82 to 0.76 might be noise, not a real regression.
How we solve it: we run an ensemble (several prompts or models), then combine scores. Same idea as elsewhere: multiple independent raters (clinical scales, peer review, juries) average out individual bias. If judges agree, you can trust the mean more; if they spread apart, the case is ambiguous, the rubric is fuzzy, or we route that case to a human instead of trusting a single number.
# Repeat variance (pseudocode)
scores = [evaluate(same_interaction, criteria)["score"] for _ in range(10)]
# Ensemble (pseudocode)
def ensemble_evaluate(interaction, criteria):
scores = [judge(v, interaction, criteria) for v in JUDGE_VARIANTS]
return mean(scores), std(scores), scores
Tuning: a plain average is a start; in production we line judges up to the same scale using labelled data, and re-do that when we change judge models.
Cost: eight full calls can cost ~8× one call. We combine the usual mitigations:
# Tiering (pseudocode)
def tiered_evaluate(interaction, criteria):
t = fast_judge.evaluate(interaction, criteria)
if t.score > 0.88 or t.score < 0.32:
if random.random() < 0.05:
return ensemble_evaluate(interaction, criteria) # spot-check easy buckets
return t
return ensemble_evaluate(interaction, criteria)
# Who to score at volume (pseudocode)
def should_evaluate(meta) -> bool:
if meta.new_version or meta.flagged or meta.in_reference_set: return True
if stratified_need_coverage(meta.category): return True
return random.random() < BASE_RATE
We mix big and small models (and sometimes fine-tuned small ones on your labels). Criteria text changes rarely compared to traffic. We cache the stable parts of prompts and shave a large share of tokens on every call. At high volume we don’t score every conversation: we always cover new releases and flagged traffic, then sample the rest so trends stay meaningful without scoring 100% of rows.
Altogether we often land around ~2-3× the cost of one naive judge instead of 8×, with most of the quality gain.
Layer 2: Memory and context
The problem: scores look stable one run at a time, but you can’t compare week to week. You change criteria text, traffic changes, or the judge model updates. Humans compare to what they’ve seen before; a judge with no memory doesn’t.
How we solve it: we store past evals (searchable by criterion, agent, time) and retrieve similar cases into the prompt so scoring is relative to examples.
def memory_evaluate(interaction, criteria, store):
similar = store.retrieve(interaction, criteria=criteria, top_k=3, min_similarity=0.75)
return ensemble_evaluate(few_shot_prompt(criteria, similar, interaction))
# Stored row (pseudocode)
# EvaluationMemory: id, agent, criteria_hash, text, embedding,
# ensemble_score, spread, human_label?, agent_version, judge_model_version
We keep retrieval within each criterion. No cross-criteria mixing. New products need seed examples at the start. The more history you have, the better neighbours you get. You can’t substitute that with a bigger model; it’s calendar and data. Whatever you build (or buy), time in production is part of the asset.
Layer 3: Ground truth and humans
The problem: everything can look internally consistent and still be wrong in the same direction: bad scores get saved and reused.
How we solve it: human review with full context (conversation, rubric, machine scores, similar cases). We triage what to review first (big disagreement, borderline scores, new releases, thin coverage). If reviewers often disagree on the same rubric, we fix the rubric, not only “more labels.”
This is the step many in-house efforts under-invest in: the UI and workflow have to be low-friction, and someone has to own the queue, or labels dry up and the loop stops.
def retrieve_weighted(query, store):
cands = vector_search(query, k=15)
for c in cands:
if c.human_label == "correct": c.weight *= 1.4
elif c.human_label == "incorrect": c.weight *= 0.3
return top_k_by_weight(cands, k=5)
We track accuracy vs experts by slice (criterion, agent, score band). False passes (machine says OK, expert says not) are usually the dangerous ones: they look like wins on a dashboard and hide real regressions. A high top-line accuracy with a bad false-pass rate is worse than a modest accuracy you distrust, because it drives false confidence.
Layer 4: Drift and regression
The problem: you change a prompt, tool, or model; behaviour shifts without a hard error. Live traffic also shifts. Raw score trends mix up “we shipped something worse” with “users got harder questions.”
How we solve it: we keep a fixed basket of test conversations per agent (labelled, versioned), compare new release vs old on the same items, and split results by criterion so “overall down 0.08” becomes actionable.
def compare_versions(agent_a, agent_b, ref: ReferenceSet) -> RegressionReport:
for item, crit in ref.items:
...
return RegressionReport(overall=..., by_criterion=..., worst_examples=...)
Example report (illustrative):
- agent
- support_bot
- baseline → current
- v1.2.0 → v1.3.0
- reference set
- n = 200
- overall change
- −0.06
- by criterion
- resolution −0.13tone +0.04
Why this matters beyond the headline: a useful regression view also surfaces the worst individual drops (which conversations broke?) and often a pattern (e.g. failures cluster on billing after a tool change), so engineering has a hypothesis, not only a red number.
If you alert on live scores, you may also want stats that don’t cry wolf on normal noise (we use standard process-control style checks where it matters).
The full stack
Each layer addresses what the last one did not. There isn’t one model that replaces the whole thing: it is measurement, memory, and feedback loops, in the order you already read. The five-step ladder in TL;DR is the same path in table form if you want a single bookmark.
You send in criteria, conversations (lab or production traces when live traffic is wired in), and you get back more than a single number: score, spread (how much to trust it and when to escalate), explanation, provenance, and comparisons across versions so product and engineering can act.
The system gets better with use: more evals improve search; more labels improve weighting and training; more releases improve baselines.
Making sure the scorer still works
When you change judges, retrieval, or scoring code, you need a fixed labelled set you re-run after each real change. You track accuracy vs experts, and especially how often you wrongly pass a bad answer.
def run_meta_eval(engine, frozen_examples) -> Metrics:
# compare engine labels to expert labels; false pass rate, ...
...
Ablations, not vibes: turn one layer off at a time on the same gold set (baseline judge → +ensemble → +memory → +human-weighted retrieval, etc.). Each layer targets a different failure mode; the point is to see what actually moved accuracy and false passes, not a one-off lucky run. Our internal benchmark (~71% baseline single judge → up to 83.6% full stack) is the sort of claim we only treat seriously when it’s tied to reruns on frozen examples after each substantive change.
You should be able to ask any vendor “did you rerun the gold set after that change?”, including us.
Product realities: what the layers imply
The sections above are the technical path in order. Moving from “we have a scorer” to eval infrastructure is not a late add-on: once more than one team or customer depends on the path, you need the wrapper around that path—versioned criteria, tenancy and access control (who defines rules, who labels, who sees production vs test), throughput and fairness so one tenant does not starve another, review habits and prioritisation (roles and queues, not only a function in code), jobs and storage for regression at ship cadence, and CI or governance so gold reruns when the eval engine changes.
The moment something other than a notebook calls your eval path, you inherit the usual properties of a dependency: stable HTTP surface (versioning, auth, rate limits), SDKs if you want consistent retries and errors, exports or webhooks if scores need to reach Slack, a warehouse, or alerting, and monitoring (latency, cost, failures, queue depth) because other teams will treat this like a database, not a script.
Conclusion
The build vs buy matrix in TL;DR is not “you have nothing until month six.” It separates useful early work (weeks) from a mature, infrastructure-grade loop (often many months in-house). Here is that split in prose.
- Build: the ladder above plus criteria and tenancy plus callers (API / SDK / UI as needed) plus ongoing tuning, search upkeep, reviewer habit, and gold-set hygiene. You can ship something fast; trusting the full stack for releases is what consumes many engineer-months. Teams often decide implicitly: they budget a short spike, then pay in calendar while the agent ships anyway.
- Buy (Composo): you still own criteria, labels, and ship decisions; we operate the platform depth and publish measured quality on a frozen set (and false-pass tracking). You trade platform build for integration and process on your side.
Eval with a vendor still takes your time on criteria, labels, and process—that should be expected. The point is the maturity gap: the same months can go into scaffolding every layer yourself, or into agent work while eval data accrues through an operated stack you inspect and rerun against gold, whichever path you choose.
Numbers from our internal studies. Example benchmark: 200 expert-labelled conversations across 8 agent types and 12 criteria. Best accuracy we’ve seen vs a baseline single judge: 83.6% (baseline ~71% on the same measure).