Skip to content
Read our latest publication on optimal methods for LLM evaluation here
← Back to Blog

From one judge to a learning system

Michael Karotsieris · Founding Engineer ·

Summary

A baseline judge gives you a score on traffic. It is not the same thing as eval infrastructure. The mature path adds six layers on top of that judge (ensemble, memory, human review, drift monitoring, regression detection, meta-eval), plus the product wrapper around them: APIs, tenancy, reviewer UX, and the ongoing work of keeping judges and embeddings current.

After a baseline judge, the mature path adds six layers (ensemble through meta-eval; the introduction lists them), plus APIs, reviewer UX, tenancy, and ongoing maintenance.

If evals are not your product, decide whether you implement every layer yourself or integrate so your team stays more focused on the agent.

Learning loop: production traffic through six layers (ensemble, memory, review, drift, regression, meta-eval), closing the loop with labels and baselines

Each new eval improves search; each label improves weighting; each release improves baselines. The system improves as you use it.

Introduction

If you are building a company or a product around AI agents, you need evals. Disclaimer: we sell evaluation infrastructure. Without evals, broken behavior shows up as churn.

If you run heavier agentic workloads on production traffic, a minimal setup is not enough. You hit problems with score interpretation and thresholds. You lack a durable memory of past evals. You see medium to long-term regressions and judgments that are not grounded in examples. You also carry normal engineering load, such as rate limits and API failures. You can address each issue, but every fix trades against latency. The system needs ongoing maintenance, like any other product. You also maintain judge models and embedding models.

Eval infrastructure is not one prompt. If you ship a very simple agentic app, a single eval skill or prompt might be enough. Otherwise, you need a stack of six layers on top of a baseline judge:

  1. Ensemble: inconsistency and cost.
  2. Memory: comparability over time.
  3. Human review: ground truth and labels.
  4. Drift monitoring: live signal and noisy attribution.
  5. Regression detection: fixed reference sets and release proof.
  6. Meta-eval: frozen gold, ablations, and false-pass tracking so that the scorer stays calibrated.

You still need tenancy, APIs, rate limits, and reviewer UX around that stack.

Next is a build vs buy table (timelines, trust, where work lands). After that, the post goes from a single judge through six layers in order, then a short recap figure and conclusion.

Build vs buy

Build in-houseBuy (integrate Composo)
Useful early4-8 weeks for useful scoring: a judge, basic plumbing, and a first dashboard, enough to learn and to block obvious failures. This is not the full six-layer stack or product surface.A couple of days to integrate. You connect traffic and criteria, then start using the learning system on production traffic.
Production-ready loop3-6 months for infrastructure that you can stake releases on: all six layers, criteria, tenancy, stable APIs, reviewer UX, gold-set hygiene, and the operational load from the opening paragraphs (latency trade-offs, rate limits, resilience, judge and embedding upgrades).You do not implement every layer from scratch. You still own criteria, labels, ship calls, and process. Measured depth on our side: 71% baseline judge to 83.6% on our internal benchmark.
Trust and proofYou build frozen gold, ablations, and false-pass tracking yourself, or you do without them.You get inspectable mechanics and reruns on frozen data. Hold us to the same standard that you would use in-house.
Build in-house versus integrate Composo: two timelines over the same calendar, different allocation of work

Same calendar. Different allocation.

Starting point: a single judge

You receive too much traffic to read everything. You need a signal: is the agent doing its job? The first step is a judge prompt. You combine a rule (for example, “did we fix the issue without escalating for no reason?”) with the conversation, and you return a score.

# First judge (pseudocode)

JUDGE_PROMPT = """Criteria: {criteria} ... Interaction: {interaction} ...
Score 0.0-1.0; JSON: score, explanation."""

def evaluate(interaction, criteria):
    return parse_json(llm.complete(JUDGE_PROMPT.format(...), temperature=0))

This pattern works for obvious cases and low volume. When you use scores for decisions and trends, small shifts in wording, models, or criteria add up. You are running a measurement pipeline. You see variance across runs, systematic bias, and drift when you change the prompt, model, or rubric. These are the same failure modes as any metric that you run in production.

Layer 1: inconsistency and cost

The problem: if you run the same conversation twice, you can get different scores (for example, 0.74 versus 0.71). Lower temperature helps, but you still get noise from the stack and steady bias in the model (for example, favoring long answers or a confident tone). Prompt tweaks help a little, but they do not remove the issue.

If repeat runs differ by 0.03, you cannot trust small moves on a chart. A drop from 0.82 to 0.76 is just noise, not regression.

Our solution: we run an ensemble (several prompts or models), then we combine scores. Independent raters average out individual bias. When scores cluster, the mean is a stronger signal. When they spread out, we do not pick one model and call it final: we hand off to a human, or we keep the aggregate but mark it as low confidence. Wide spread can mean a borderline interaction, a rubric that does not pin down scores, or both.

# Repeat variance (pseudocode)
scores = [evaluate(same_interaction, criteria)["score"] for _ in range(10)]

# Ensemble (pseudocode)
def ensemble_evaluate(interaction, criteria):
    scores = [judge(v, interaction, criteria) for v in JUDGE_VARIANTS]
    return mean(scores), std(scores), scores

A plain average is a start. In production, we align judges to the same scale with labeled data, and we repeat that step when we change judge models.

Eight full calls cost eight times one call. We combine tiering, sampling, and a mix of model sizes:

# Tiering (pseudocode)
def tiered_evaluate(interaction, criteria):
    t = fast_judge.evaluate(interaction, criteria)
    if t.score > 0.88 or t.score < 0.32:
        if random.random() < 0.05:
            return ensemble_evaluate(interaction, criteria)  # spot-check easy buckets
        return t
    return ensemble_evaluate(interaction, criteria)
# Who to score at volume (pseudocode)
def should_evaluate(meta) -> bool:
    if meta.new_version or meta.flagged or meta.in_reference_set: return True
    if stratified_need_coverage(meta.category): return True
    return random.random() < BASE_RATE

We mix large and small models. Criteria text changes rarely compared to traffic. We cache the stable parts of prompts and cut a large share of tokens on every call. At high volume, we do not score every conversation. We always cover new releases and flagged traffic, then we sample the rest so that trends stay meaningful without scoring 100% of rows.

In total, we land at about two to three times the cost of one naive judge instead of eight times, with most of the quality gain.

Layer 2: memory and context

The problem: scores look stable for one run, but you cannot compare week to week. You change criteria text, traffic changes, or the judge model updates. Humans compare to what they have seen before. A judge with no memory cannot.

Our solution: we store past evals (searchable by criterion, agent, and time). We retrieve similar cases into the prompt so that scoring is relative to examples.

def memory_evaluate(interaction, criteria, store):
    similar = store.retrieve(interaction, criteria=criteria, top_k=3, min_similarity=0.75)
    return ensemble_evaluate(few_shot_prompt(criteria, similar, interaction))

Retrieval runs inside one criterion at a time. We never use neighbors from another criterion to score a case.

New products start with a small set of seed examples. Neighbor quality then tracks how many evals you store for that criterion. A bigger model does not replace that history.

When eval is a service, isolate data and quotas per customer. If you mix tenants in one index or one review queue, you create a privacy and measurement failure.

Layer 3: ground truth and humans

The problem: everything can look internally consistent and still be wrong in the same direction. Bad scores get saved and reused.

Our solution: we use human review with full context (conversation, rubric, machine scores, and similar cases). We triage what to review first (large disagreement, borderline scores, new releases, thin coverage). If reviewers repeatedly disagree on the same rubric, we fix the rubric, not only “more labels.”

Most internal builds under-invest in this step. The UI and workflow must stay low-friction, and someone must own the queue.

def retrieve_weighted(query, store):
    cands = vector_search(query, k=15)
    for c in cands:
        if c.human_label == "correct":   c.weight *= 1.4
        elif c.human_label == "incorrect": c.weight *= 0.3
    return top_k_by_weight(cands, k=5)

We track accuracy versus experts by slice (criterion, agent, score band). For false passes (machine marks acceptable, expert marks not), we report the rate next to headline accuracy. Those rows barely move a coarse average, but they still show up as positives in rollups, so a decision that only watches the average can miss them. A strong top-line score with a weak false-pass slice can read as “healthy” next to a lower score you already treat as noisy, unless you inspect both.

Layer 4: drift (live signal)

The problem: you change a prompt, tool, or model, and behavior shifts without a hard error. Live traffic also shifts: harder questions, new intents, seasonality.

Our solution: we still monitor live scores, but we treat them as noisy attribution, not proof of a release regression. Stratify where you can. If you alert on live aggregates, use statistics that limit false alarms on ordinary noise. We use process-control-style thresholds on those paths. When you ask whether a build made things worse, you need controlled regression detection on a fixed set (layer 5).

Layer 5: regression detection

The problem: if you do not re-score the same labeled conversations after each material change, you cannot turn a vague trend into a clear yes or no on quality. You need a repeatable basket that stays stable across versions.

Our solution: we keep a fixed basket of test conversations per agent (labeled, versioned). We compare the new release to the old one on the same items. We split results by criterion so that “overall down 0.08” becomes actionable.

def compare_versions(agent_a, agent_b, ref: ReferenceSet) -> RegressionReport:
    for item, crit in ref.items:
        ...
    return RegressionReport(overall=..., by_criterion=..., worst_examples=...)

Example report (illustrative):

agent
support_bot
baseline → current
v1.2.0 → v1.3.0
reference set
n = 200
overall change
−0.06
by criterion
resolution −0.13
tone +0.04

A strong regression view shows the largest individual drops: which conversations broke, and which clusters point to a root cause. For example, billing after a tool change. Engineering gets a hypothesis, not only a red number.

Layer 6: meta-eval

The problem: when you change judges, retrieval, weighting, or scoring code, the pipeline can look healthy while it drifts away from expert judgment. Headline accuracy stays flat while the false-pass rate creeps up. If you do not rerun the same frozen gold after each substantive change, you cannot tell whether the eval regressed.

Our solution: we keep a fixed labeled set. We rerun it whenever the scoring stack changes in a meaningful way. We track accuracy versus experts and the rate of false passes (machine OK, expert not). We run ablations on that same gold: we turn one layer off at a time (baseline judge, then +ensemble, +memory, +human-weighted retrieval, +reference-set regression, and so on). Each layer targets a different failure mode. You need to see what moved accuracy and false passes, not a one-off lucky run.

def run_meta_eval(engine, frozen_examples) -> Metrics:
    # compare engine labels to expert labels; false pass rate, ...
    ...

We treat headline benchmark numbers as real only when every substantive stack change is followed by those reruns and ablations, not a one-off run.

Ask any vendor, including us: “Did you rerun the gold set after that change?”

For ensembles, experiments, and more technical detail, see Improving LLM Judges With Experiments, Not Vibes.

The six layers at a glance

The figure shows the six-layer scoring path after the base judge. Criteria modeling and API callers sit outside it. Accuracy figures sit in the build vs buy table and the footnote. Your numbers will differ.

Composo evaluation architecture: base judge then six layers (ensemble, memory, human review, drift monitoring, regression detection, meta-eval), with a compounding feedback loop

After the base judge, six layers stack: (1) ensemble, (2) memory, (3) review, (4) drift monitoring, (5) regression detection, (6) meta-eval. The flywheel shows how use compounds (more evals, labels, and releases).

Compounding: more evals improve retrieval; more labels improve weighting; a stack that has run for a year is not something that you reproduce on day one by shipping the same code paths. The six layers map to different work: models (1-2), search (2), reviewer tools (3), live monitoring (4), regression harnesses (5), tests of the eval (6). Doing all of it in-house competes with your product roadmap.

Conclusion

The build vs buy table captures timelines, trust, and where work lands.

  • Build: the six layers, criteria, tenancy, callers (API, SDK, UI), tuning, search upkeep, reviewer habit, and gold-set hygiene. You can ship a thin slice quickly. Release-grade trust costs multiple engineer-years.
  • Buy (Composo): you keep criteria, labels, and ship decisions. We run platform depth and publish measured quality on frozen data (see the table and footnote).

Integration still costs your time on criteria, labels, and process. The trade is building every layer versus accruing eval data through an operated stack that you can inspect and rerun against gold.


Numbers from our internal studies. Example benchmark: 200 expert-labeled conversations across 8 agent types and 12 criteria. Best accuracy we have seen versus a baseline single judge: 83.6% (baseline 71% on the same measure).