Eval Drift: Why Your LLM Evaluations Stop Working (And How to Detect It)
Your evaluations worked last quarter. They might not work this quarter. Evaluation drift is the quiet failure mode of production LLM quality systems.
We run into this pattern across almost every Composo deployment. A team set up LLM-as-judge or an eval framework six or twelve months ago. It worked. The dashboards were green. Then the team swapped models, or shipped a product update, or the customer behaviour shifted - and without anyone noticing, the evaluations stopped catching real failures.
This post covers what evaluation drift actually is, why LLM evaluations are unusually prone to it, how to detect it, and what to do about it.
What evaluation drift is
Evaluation drift is the silent decoupling of your evaluation system from reality. The eval continues to run. It continues to emit scores. But the scores no longer track the actual quality of your AI.
Three underlying causes:
- Generator model drift. The model producing the outputs changed. Failures that used to manifest one way now manifest another way. Your evaluation criteria, written for the old manifestation, miss the new one.
- Product surface drift. Your team added features, changed prompts, or expanded the use cases your AI handles. New failure modes appear that the evaluation does not cover.
- Distribution drift. The customers or traffic patterns using your AI change. New question types, new edge cases, new context shapes that the evaluation was not calibrated for.
All three look the same from the outside: the dashboard is green, but customers are finding failures.
Why LLM evaluations drift more than traditional ML evaluations
Traditional ML evaluation is usually a fixed objective function applied to a fixed test set. Accuracy, F1, AUC. The scoring definition does not depend on interpretation.
LLM evaluation is different. Most production LLM evaluations encode failure modes through natural-language criteria. “Does the output contain unsupported claims?” “Does the output omit the required disclosure?” “Is the tool call consistent with the user’s intent?”
These criteria are interpretation-heavy. How an LLM-as-judge answers them depends on:
- Which foundation model is running the evaluation
- Which foundation model produced the output being evaluated
- The exact wording of the evaluation criterion
- The distribution of training data underlying both models
Change any of those, and the evaluation output can change without the underlying quality changing. The inverse also holds: real quality can degrade without the evaluation output changing.
How to detect drift
Signal 1: The pass rate moves without a product change
If your evaluation pass rate jumped from 85% to 91% last week, and nothing about your product changed, your evaluation drifted. The generator model might have updated, the evaluator might have updated, or the prompt for one of them shifted. Investigate before celebrating.
Signal 2: The pass rate does not move despite real product changes
The opposite case. You shipped a major feature, changed the prompt, rolled out to new customers. Pass rate did not budge. That is not evidence of quality stability; it is evidence your evaluation is not sensitive to the change.
Signal 3: Production incidents the evaluation missed
The highest-signal detector. When a real customer issue surfaces and you run the failing trace through your evaluation, the eval should flag it. If it does not, you have a drift problem.
We heard this from a head of AI at an enterprise SaaS platform:
“When someone on my team is doing a peer assessment and they say ‘oh, I spotted that the AI did something weird’ - that’s really a byproduct rather than the deliberate process. We don’t do a very good job of that today. It’s more ad hoc and reactive than proactive.”
That is drift showing up as an operational problem. The eval was supposed to catch it. It did not.
Signal 4: Evaluator variance increases
If you run the same output through the evaluator multiple times, you should get similar scores. When variance goes up - the same output gets scored 0.6, 0.8, 0.5 on three runs - the evaluator has become less certain. That is often an early signal that the evaluation criteria no longer map cleanly to the current data distribution.
Signal 5: The eval stops firing entirely
A criterion designed to catch 3% of outputs is now catching 0.1%. Either the problem got much better (possible), or the eval stopped detecting it (more likely).
What to do about drift
1. Re-calibrate on model swaps
Every time you change the generator model or the evaluator model, re-run the evaluation against a golden set of domain-expert-labelled examples. If the scores shift meaningfully, re-tune.
This is the single most important drift-mitigation action. Model swaps are the most common drift cause.
2. Keep a golden evaluation set that predates the current models
Maintain a set of hand-labelled examples that covers your important failure modes. Run it through each new model and each evaluation version. Track the evaluation’s accuracy on the golden set over time.
If the accuracy on the golden set drops, the evaluation has drifted. Regardless of what the live pass rate looks like.
3. Monitor evaluator variance as a first-class signal
We covered this in detail in a recent paper on LLM-as-judge variance. Variance is a leading indicator. If evaluator variance starts increasing on a specific criterion, expect that criterion to lose accuracy soon.
4. Make corrections compound, not vanish
When a human reviewer fixes an evaluation output - “the eval said this was fine, it was not fine, here’s why” - that correction is extremely valuable. Most eval systems throw these corrections away. A good system adds them to the calibration set so the evaluator learns from them.
Karl Wiseman, who works on AI at Pigment, articulated the compounding version of this:
“I would like to have something that is constantly monitoring any AI within our process… it’s the over time thing that I think is important for me. What works today doesn’t work tomorrow. Like, why? What’s changed? How can we better handle those things? Because that feels like the stressful part of never being able to get on top of it.”
That is the problem drift handling is trying to solve. The eval should be getting better over time, not stale.
5. Re-run the failure taxonomy quarterly
The failure modes your AI was exhibiting six months ago are not the failure modes it is exhibiting now. Re-surface the taxonomy from fresh production traces. Check that your evaluation criteria still map to the failures that matter.
How Composo handles drift
Drift is handled as a first-class feature in Composo’s evaluation system, not a bolt-on.
- Golden set maintenance. The initial deployment produces a golden evaluation set labelled by the customer’s domain experts. Composo tracks evaluator accuracy on the golden set over time and alerts when it drops.
- Variance monitoring. Per-criterion evaluator variance is tracked and surfaced. An increase in variance triggers review.
- Corrections compound. When a domain expert reviews an evaluation output and disagrees, that correction goes back into the calibration data. The evaluation model learns from it. The next version is better.
- Re-calibration on model swaps. When the generator model or evaluator model is upgraded, Composo automatically re-scores the golden set and flags any shift.
- Quarterly failure-taxonomy refresh. Every three months, we pull a fresh sample of production traces and re-surface the current failure taxonomy. New failure modes that have emerged get added as evaluation criteria.
The effect is that Composo’s evaluation quality at month 18 of a deployment is higher than at month 3. That is the opposite of what happens with a homegrown system left to drift.
The shortest version of this post
LLM evaluations drift. Silently, over weeks. Every model swap is a drift event. Every product change is a drift event. The fix is to track evaluator accuracy against a golden set, monitor variance, absorb human corrections back into calibration, and re-run the failure taxonomy periodically.
If you have not looked at your eval system’s accuracy on a held-out golden set in the last three months, it has probably drifted.
For a read on what your production AI is actually doing right now (regardless of what your evals are saying), book a diagnostic.
Frequently asked questions
What is evaluation drift?
Evaluation drift is when your LLM evaluation system stops catching the failures it was designed to catch. It happens when the underlying model changes, the product surface area changes, or the domain distribution shifts - and the evaluation criteria written for the old state are no longer accurate for the new state.
How do I know if my evals are drifting?
Symptoms include: the eval pass rate suddenly improves or deteriorates without a corresponding product change; production incidents that the eval system did not flag; customer-reported failures that evals said were fine. If you rely on a single evaluation metric and it has not moved in months despite real product changes, that is itself a drift signal.
Why do LLM evaluations drift more than traditional ML evaluations?
Because LLM outputs are open-ended natural language and evaluations are often built on LLM-as-judge with prompts that encode specific failure-mode definitions. When the generator model changes (a model swap from GPT-4.1 to GPT-5, for example), the way failures manifest changes, and evaluation criteria written for the old manifestation miss the new one.
Should I re-run calibration every time I swap a model?
Yes. Model swaps are one of the most common causes of eval drift. A 30-minute recalibration run comparing evaluation scores against a golden set of domain-expert-labelled examples catches most drift introduced by model changes before it matters in production.
How does Composo handle evaluation drift?
Composo monitors the distribution of evaluation scores over time, flags statistically significant shifts, and re-calibrates the evaluation model when corrections from domain experts indicate that the old criteria are no longer accurate. Drift is handled as a first-class feature of the system, not a bolt-on.