Evaluating Clinical AI: A Practical Guide
Clinical AI is producing specific, documented harms in production today. Generic evaluation tools do not catch those harms. This is the practical guide to doing it properly.
The landscape has changed in the last 12 months. ECRI named AI chatbot misuse the #1 health technology hazard - the first time an AI application has topped their annual list. Specific lawsuits are now working through US courts. Published research shows failure rates far above what most teams believed. Clinical AI is no longer a future-risk conversation; it is a current-harm conversation.
If you are building or deploying clinical AI, this post covers what you actually need to evaluate, why standard evaluation methods fall short, and what a defensible quality infrastructure looks like today.
The failure rate is higher than the marketing says
Three published findings from the past 12 months have reframed the clinical AI quality conversation:
Mayo Clinic Proceedings. A systematic study of ambient AI scribes found a mean 26.3% error rate across recorded patient encounters, with a mean of 13.9 errors per transcript and 3.0 errors per case at moderate-to-severe harm potential.
JAMA Network Open (Harvard-led). Leading AI models, when evaluated against a standardised clinical reasoning benchmark, failed to produce appropriate differential diagnoses more than 80% of the time.
ECRI, Top 10 Health Technology Hazards. AI chatbot misuse ranked #1. The first time an AI application has topped ECRI’s annual list.
These are not fringe findings. They are the current published consensus on what clinical AI is doing in production.
The lawsuits have started
The legal and regulatory environment has moved:
Sharp HealthCare (November 2025). Sued over AI scribes that auto-inserted fabricated consent statements into patient records - claiming patients “were advised” and “consented” when they had not been. 100,000+ encounters affected. The AI vendor named in the filing was Abridge.
Sutter Health and MemorialCare. A class action has since followed, citing California CIPA, CMIA, and Federal Wiretap violations. Same vendor, same pattern.
The specific mechanism in both cases was not exotic. It was the AI scribe adding language that was not said by either the clinician or the patient - a failure mode a good clinical AI evaluation should catch.
Why generic evaluation fails in clinical contexts
Most off-the-shelf LLM evaluation does one of three things:
- Text similarity to a reference. BLEU, ROUGE, embedding similarity. Useless for clinical AI: a note can be semantically similar to a reference and still be clinically wrong.
- Generic hallucination checks. Looking for claims not grounded in the input. Partially useful, but blind to the specific ways clinical hallucination manifests (adding medications, inferring diagnoses, fabricating consent).
- LLM-as-judge with a generic quality prompt. “Rate this clinical note 1 to 5 on quality.” Results are noisy, miss specifics, and plateau around 72% accuracy on benchmarks.
None of these reliably catch a hallucinated medication, an omitted differential, or a fabricated consent statement. That is the entire problem.
What clinical AI evaluation actually needs to catch
Based on the failure patterns we have seen across clinical AI deployments, these are the failure categories any clinical evaluation system should cover:
1. Hallucinated medications
The AI documents a prescription, dosage, or medication class that was not discussed in the encounter. This is the Sharp/Sutter failure at the medication level. Detection requires comparing the note against the source transcript at medication granularity, not whole-note similarity.
2. Omitted red-flag findings
The patient mentioned a symptom with diagnostic importance. The note did not record it. This is where omission failures create real clinical harm - the downstream clinician reading the note has no signal that the symptom exists.
3. Diagnostic leaps
The note states a diagnosis or narrows a differential without the documented reasoning a clinician would need to verify. The AI “decided” something the human did not.
4. Unsupported inferences
The patient mentioned headaches for two weeks. The note records “chronic migraines.” The AI turned a symptom description into a named diagnosis without clinical justification.
5. Dosage and route errors
Wrong strength, wrong frequency, wrong administration route. This is one of the highest-severity failure categories for any clinical AI with prescribing involvement.
6. Contraindication misses
A patient history that should flag a contraindication to a recommended medication. The AI either does not flag it or actively recommends something that should not be given.
7. Fabricated consent or procedure statements
The Sharp/Sutter failure pattern. The AI adds language that asserts consent, advising, or patient agreement that was not actually said.
For any clinical AI deploying in production, your evaluation system should at minimum be able to catch instances of each of these. Most cannot.
The accuracy bar for clinical evaluation
A clinical evaluation system has to be accurate enough that false positives do not drown out real signals, and sensitive enough that real clinical failures do not slip past.
In practice that means:
- Above 80% accuracy on domain-specific failure classification. Below that, the false-positive rate makes the system unusable; clinicians stop trusting it.
- Sensitivity tuned by failure category. A missed medication has higher cost than a missed stylistic issue. Thresholds should reflect that.
- Sub-second latency for runtime guardrails. A guardrail that takes 10 seconds is not a clinical-AI guardrail. At-the-encounter checks have to be fast.
Standard LLM-as-judge methods plateau below this bar. Reaching it requires techniques like criteria ensembling, variance-informed calibration, and reward-model-based evaluation. The operational bar that matters in production is alignment with human domain experts - Composo reaches 90%+ alignment with human experts in most clinical contexts, which is what makes domain-specific failure detection reliable.
What the regulatory and procurement environment now demands
Healthcare AI vendors selling into hospitals, payers, or clinical groups are now facing procurement requirements that were uncommon two years ago:
- A documented failure taxonomy. What specifically does the AI fail on? At what rate?
- Audit-ready evaluation logs. Every evaluation decision with trace, criteria, rationale.
- Clinician sign-off on evaluation criteria. Not engineer-written criteria; criteria reviewed by a clinical SME.
- SOC 2 Type II and BAA. Non-negotiable at most health system procurement.
- A drift-handling plan. Models change. Evaluation has to change with them.
- An independent evaluation layer. Vendors self-attesting to quality is no longer sufficient at enterprise healthcare procurement.
If you are building clinical AI, you need these. If you are buying or validating clinical AI, you should be demanding them.
What Composo deploys for clinical AI
Composo deploys a clinical-AI evaluation layer in 2 to 4 weeks. What that includes:
- Clinical failure taxonomy. Week 1: we pull a representative sample of production traces and surface the specific failure modes occurring in the specific AI system. This gets reviewed and signed off by the customer’s clinical team.
- Calibrated evaluation. Weeks 2 to 3: evaluation criteria written for each failure mode, calibrated against clinician-labelled examples. The evaluation model learns the domain’s specific definition of “wrong”.
- Runtime guardrails. Week 4: for customers who need them, the evaluation model runs inline at inference time, blocking or flagging outputs that fail.
- Ongoing operation. Clinician corrections feed back into calibration. Drift is tracked. The evaluation model gets better over time.
Composo is SOC 2 Type II, signs BAAs, offers EU data residency, and keeps complete audit trails suitable for model risk management and regulatory review.
The shortest version of this post
The evidence is clear: clinical AI is producing specific, documented harms in production. Generic evaluation does not catch them. A defensible clinical AI quality infrastructure needs a clinician-signed failure taxonomy, evaluation criteria that reach 90%+ alignment with human domain experts, and audit trails sufficient for procurement and regulatory review.
For a failure report on your clinical AI delivered in under a week, book a diagnostic. For more on clinical failure modes specifically, see Clinical AI Failure Modes.
Frequently asked questions
What makes clinical AI evaluation different from general LLM evaluation?
Clinical AI failures are specific and consequential: a hallucinated medication, an omitted red-flag symptom, a fabricated diagnosis. Generic text-similarity or hallucination checks do not catch these. Clinical evaluation requires failure criteria defined by clinicians, calibrated to what 'wrong' means in a clinical context.
What failure modes should clinical AI evaluation cover?
At minimum: hallucinated medications (prescriptions not discussed), omitted findings (red-flag symptoms the patient mentioned but not documented), diagnostic leaps (conclusions without supporting reasoning), unsupported inferences (assertions not present in source material), dosage and route errors, and contraindication misses.
Are there published benchmarks for clinical AI failure rates?
Yes. A Mayo Clinic Proceedings study on ambient AI scribes documented a mean 26.3% error rate, with 3.0 errors per case at moderate-to-severe harm potential. A Harvard-led study in JAMA Network Open found leading AI models fail to produce appropriate differential diagnoses more than 80% of the time.
Why has clinical AI been named a top health technology hazard by ECRI?
ECRI named AI chatbot misuse the top health technology hazard - the first time an AI application has topped their annual list. The rationale is real-world harm: specific documented cases of AI scribes inserting fabricated consent statements, missed differential diagnoses, and medication errors affecting patient records at scale.
What does regulatory-ready clinical AI evaluation look like?
Complete audit trails for every evaluation (trace, criteria, rationale, reviewer corrections). SOC 2 Type II compliance. BAA capability for HIPAA-regulated data. A failure taxonomy signed off by clinical reviewers. Documented calibration methodology. These support internal model risk management and external audit.