While LLM-as-a-judge (LLMaaJ) appears attractive due to its perceived consistency compared to human evaluators and its seemingly reproducible results, this consistency is largely illusory and fundamentally undermines your ability to test and monitor your AI application. When you can't trust your evaluation system, you lose visibility into your application's actual performance - making it impossible to know whether your system is working correctly or failing catastrophically.
Even with temperature settings locked to near-zero values, LLMs exhibit measurable variability in their judgments - meaning the same model output can receive different evaluation scores across multiple runs. This inconsistency directly translates to blindness about your application's true performance.
Take the following simple example which we extract from PrimeBench where we have added a clear hallucination Hope Street in the Republic of Ireland into the text (the source text clearly states this is in Northern Ireland):
claude-sonnet-4-20250514 with a temperature set to 0 responds with 3/5 (60%).
Now let’s change the last line to as we want some better granularity in our evals.
Interestingly, we still get a response of 3 but this time it is out of 10! This has now reduced the confidence in the score to only 30%. Now let’s try and get some more detail, we change the last line to:
And we now get a score of 85%.
If you try reproducing our results with this prompt you might see different values as well, as even with the same prompt at temperature 0 results are far from deterministic. Here we are essentially asking the LLM to perform Likert scoring, and the LLM is failing to determine the true position of the score on an arbitrary sized number line.
This example reveals a fundamental problem: you cannot trust your application's performance metrics when they're built on LLM-as-a-judge. The identical factual hallucination produces highly variable scores (60%, 30%, 85%) due to systematic bias introduced by the scoring scale configuration—not from any objective quality assessment. This high variance in measurements compromises the validity of performance evaluation, introducing noise that obscures true system behaviour.
At Composo, we have spent a lot of time looking at LLM-as-a-judge scoring, here’s some results we got from calling sonnet 3.7 with the same input 20 times for a complex task one of our customers had:
This measurement unreliability cripples your ability to understand and improve your application's performance. In the example above, if you happened to get a result from LLM as a judge of 0.4 you would have no idea that the mean score was actually 0.61 without taking more samples which can get very expensive. In continuous integration pipelines, you cannot determine whether a failing test indicates a genuine regression in your application or simply reflects the judge model's inconsistency—leaving you blind to real performance degradations while chasing false alarms and creating a state where true performance degradations become indistinguishable from evaluation system variance. When running A/B tests to compare different versions of your application, the evaluation noise makes it impossible to detect meaningful performance differences, preventing you from knowing which version actually works better for your users.
Production monitoring becomes equally compromised. When evaluation metrics exhibit random variance, threshold-based alerting systems lose discriminative power—creating states of both false negative detection (missed critical failures) and false positive alerts. Performance trend analysis becomes compromised by measurement noise, preventing reliable detection of system state changes over time. Quality assurance protocols fail due to measurement bias: the same system output can receive both high and low quality scores depending on variation in the evaluation process.
The core issue is that LLM-as-a-judge transforms what should be a reliable measurement system into a source of uncertainty that obscures your application's true performance. Without trustworthy evaluation, you're essentially flying blind - unable to test effectively, monitor confidently, or optimize meaningfully. You may think you're rigorously evaluating your AI application, but you're actually just measuring the inconsistency of your evaluation system itself.