What Is LLM Evaluation? A Practical Explanation
LLM evaluation is how you know whether your AI is working. For production AI, this is not optional.
This post is the plain-language introduction. What LLM evaluation is, how it is typically done, where common methods break down, and what production AI quality actually requires.
The problem LLM evaluation solves
When you ship an AI feature, there is no unit test that tells you whether it is working. The output is open-ended natural language. “Did this work?” is a judgement call.
If you have ten outputs a day, you can read them. If you have ten thousand outputs a day, you cannot. You need a system that evaluates them for you.
That system is LLM evaluation.
LLM evaluation answers questions like:
- Did this response answer the user’s actual question?
- Did this output contain unsupported claims?
- Did this AI scribe record what the patient actually said?
- Did this agent’s tool call make sense given the context?
- Is this output dramatically different from last week’s output for the same input?
The harder question - “is this output good?” - is domain-specific and requires domain-specific definitions of “good.”
The three layers of LLM evaluation
In practice, production AI evaluation breaks into three distinct layers:
1. Offline evaluation
Also called eval sets, regression testing, or benchmarking. You have a fixed dataset of inputs with known-correct outputs (or known evaluation criteria). You run your AI against the dataset and score the outputs.
Typical use cases:
- Testing a prompt change before shipping
- Comparing two models (GPT-5 vs Claude 4.6)
- Regression testing in CI/CD
- Benchmarking against a published leaderboard
Offline evaluation is the most mature part of the stack. Open-source frameworks like Promptfoo, DeepEval, and Ragas cover it well.
2. Online evaluation
Scoring production outputs continuously. Not testing a fixed dataset; measuring what the system is actually doing in production, at scale.
Typical use cases:
- Detecting quality drops after a deployment
- Tracking per-customer or per-tenant quality
- Surfacing failure patterns you did not anticipate
- Flagging traces for human review
3. Runtime guardrails
Evaluating outputs at inference time, before they reach the user, and blocking or modifying those that fail.
Typical use cases:
- Blocking AI outputs that violate policy
- Gating tool calls in agent systems
- Rewriting outputs that fail a quality check
- Escalating uncertain cases to human review
Complete production AI quality typically uses all three. Most teams have one (offline evaluation). Fewer have all three working well.
How LLM evaluation is typically done
LLM-as-judge
The most common approach. You write a prompt that defines the evaluation criterion and ask an LLM to score outputs against it.
Evaluate the following response for factual accuracy.
Score 1-5 where 5 is fully accurate and 1 is fully inaccurate.
Response: [output]
Reference: [source]
LLM-as-judge is cheap, flexible, and easy to set up. It is also the approach most production teams are using today.
The limitations
LLM-as-judge has three well-documented problems:
1. High variance, even at temperature 0. The same output evaluated by the same model with the same prompt produces different scores across multiple runs. We covered this in detail in LLMs: Great Witnesses, Terrible Judges. The practical implication is that A/B testing based on LLM-as-judge can fail to detect real quality differences because evaluator noise drowns out the signal.
2. Scale-dependent bias. The same output scored on a 1-5 scale might get 3 (60%). On a 1-10 scale, it might get 3 (30%). On a 1-100 scale, it might get 85 (85%). The scale configuration introduces systematic bias unrelated to the actual quality of the output.
3. Human-expert alignment plateau. Basic LLM-as-judge reaches around 70% alignment with human domain experts on most tasks. For production gating - where the cost of a false positive or false negative is meaningful - 70% is usually not good enough.
Crossing above that plateau requires techniques beyond prompt engineering:
- Criteria ensembling. Using multiple specialised evaluation criteria rather than a single monolithic judge prompt.
- Variance-informed calibration. Treating high-variance evaluations as low-confidence and aggregating across runs.
- Reward modelling. Training a model specifically for evaluation, rather than prompting a general-purpose LLM.
With these techniques layered, Composo’s evaluation reaches 90%+ alignment with human experts across most production contexts. That is the operational bar for catching domain-specific failures reliably.
Domain-specific vs generic evaluation
A central distinction in LLM evaluation is whether the evaluation is generic or domain-specific.
Generic evaluation asks questions like “is this grounded in the source?” or “is this factually consistent?” or “is this coherent?” These are general-purpose criteria that apply to most LLM outputs.
Domain-specific evaluation asks questions like “did the AI scribe record the red-flag symptom the patient mentioned?” or “is the FX rate in this financial advice from today’s market data?” or “did this legal memo cite real cases?”
Generic evaluation plateaus. It catches generic failures and misses specific ones.
Domain-specific evaluation requires:
- A failure taxonomy - the specific ways your AI fails in your specific domain
- Evaluation criteria written for each failure mode
- Calibration against domain-expert-labelled examples
- Ongoing drift handling as the domain evolves
This is what makes domain-specific evaluation harder to build and more valuable when it works. A generic evaluator is the same across every customer. A domain-specific evaluator is unique per deployment.
What production AI quality actually requires
In our experience across 30+ production deployments, production AI quality needs:
- A failure taxonomy specific to your AI. Built from actual production traces, not a template.
- 90%+ alignment with human domain experts. Below the 80% mark, false-positive rates make the system unusable.
- Runtime guardrails for high-stakes outputs. Anything where a silent failure creates customer harm needs to be gated, not just observed.
- Drift detection. Models change, product changes, domains drift. The evaluation system needs to track and respond.
- Audit trails. Especially for regulated industries, every evaluation decision needs to be reproducible and reviewable.
- A calibration loop. Domain experts should be able to correct evaluation mistakes and have those corrections improve the system.
Most homegrown LLM-as-judge setups cover #1 and #2 to some extent and miss #3, #4, and #6 entirely. Getting all six right is roughly a 6 to 12 month internal engineering commitment, or a 2 to 4 week deployment with a platform built for it.
See Build vs Buy LLM Evaluation for the honest numbers on that decision.
Further reading
- LLM-as-judge specifically: LLMs: Great Witnesses, Terrible Judges
- Improving LLM-as-judge accuracy: Improving LLM Judges With Experiments, Not Vibes
- RAG evaluation: The Complete Guide to RAG Evaluation
- Agent evaluation: Guide to Evaluating LangGraph Agents
- The failure taxonomy approach: An Ontology of LLM Failure Modes
- Production guardrails: AI Guardrails: How to Block Bad LLM Outputs in Production
- Evaluation drift: Eval Drift: Why Your LLM Evaluations Stop Working
The shortest version of this post
LLM evaluation is the practice of measuring whether AI outputs meet a quality bar. It breaks into offline evaluation (before shipping), online evaluation (monitoring production), and runtime guardrails (blocking bad outputs). The most common method is LLM-as-judge, which typically plateaus around 70% alignment with human domain experts. Crossing above that (to the 90%+ range) requires techniques like criteria ensembling, reward modelling, and domain-specific calibration. Production AI quality typically uses all three evaluation layers.
For a read on what your production AI is actually doing right now, book a diagnostic.
Frequently asked questions
What is LLM evaluation?
LLM evaluation is the practice of measuring whether an LLM's outputs meet a required quality standard. It covers offline evaluation (scoring outputs against reference data for regression testing and monitoring), online evaluation (scoring production outputs for quality tracking), and runtime guardrails (blocking outputs that fail quality checks before they reach users).
What is LLM-as-judge?
LLM-as-judge is using one LLM to score the outputs of another LLM. A prompt defines the evaluation criterion; the judge LLM returns a score. It is the most common approach to LLM evaluation today. It is cheap to set up but typically only reaches around 70% alignment with human domain experts, which is often not sufficient for production gating.
What are the limitations of LLM-as-judge?
High variance even at temperature 0, scale-dependent bias (the same output gets different scores on a 1-5 vs a 1-10 scale), inability to catch domain-specific failures without extensive prompt engineering, and a plateau in accuracy that requires techniques beyond prompt engineering to exceed.
What is a reward model?
A reward model is a model trained specifically to evaluate outputs, rather than a general-purpose LLM prompted to evaluate. Reward models can be calibrated to domain-specific failure modes through supervised training on labelled examples. They typically reach higher accuracy than LLM-as-judge on the same evaluation task.
What is the difference between offline evaluation and runtime guardrails?
Offline evaluation runs after the fact - scoring outputs to track quality, detect regressions, and surface failure patterns. Runtime guardrails run inline - evaluating each output at inference time and blocking or modifying those that fail a quality check. A complete production AI quality system typically uses both.