What We Found Inside Clinical AI Systems That Were Passing Every Eval

Findings from clinical AI engagements at Composo

We’ve been evaluating clinical AI systems in production - ambient scribes, clinical decision support, patient-facing content generators. The teams building them are good. Their evals are passing. Their clinical teams are spot-checking samples weekly.

And their systems are failing in ways nobody is catching.

This is what we’ve found. Not theoretical risks - actual failure patterns from production clinical AI, categorised by type, with real examples.

1. Discussions become decisions

This is the most dangerous pattern we see, and the hardest to detect.

A clinician discusses whether antibiotics are warranted. The patient agrees to a delayed prescribing strategy - collect the prescription only if symptoms worsen after 48 hours. The AI-generated note reads: “Start amoxicillin 500mg TDS for 7 days.”

A pharmacist reading that note would dispense immediately. The actual clinical decision was watchful waiting.

In one engagement, we found 19 instances of this across 847 notes. 11 were severity-critical. In one case, a clinician specifically advised a patient to stop ibuprofen due to NSAID-related gastritis. The note recorded ibuprofen as a recommendation.

Standard hallucination detection won’t catch this. The medication name is correct. The dose is correct. The context is correct. What’s wrong is the decision status - “discussed” has become “decided.” An LLM-as-judge checking faithfulness to the transcript passes it, because the information is there. The transformation from discussion to directive is invisible to it.

Faithfulness to content passed. Faithfulness to intent failed.

This distinction matters. Most evaluation approaches test whether the note contains the right information. The failure here isn’t missing information - it’s the wrong clinical status attached to correct information.

2. Dangerous omissions hiding behind complete-looking notes

A patient mentions 8kg of unintentional weight loss over 3 weeks. This is a red flag finding - it requires urgent investigation to exclude malignancy. The note attributes the presentation to “work stress and poor sleep hygiene.”

The note looks complete. It has paragraphs, structure, clinical language. Without reading the source alongside the output, the gap is invisible. A covering clinician has no reason to expedite investigations.

We found 34 omitted findings in the same 847-note analysis. Not random omissions - systematic ones. The AI drops information when it has low confidence about where to place it in structured notes. For notes with 50+ sections (common in European primary care templates), the omission rate increases in sections where the AI is uncertain about categorisation.

The clinical risk isn’t about completeness. It’s about whether the omission breaks the safety net. Missing that the patient mentioned their daughter’s wedding is different from missing exertional chest tightness. Omission severity needs to be weighted by clinical safety-netting impact, not just information completeness.

And here’s the pattern that makes omission detection genuinely hard: the AI documents the ECG and troponin results but omits the symptom description that prompted the tests. The investigation results are there, but the clinical indication is missing. A covering clinician sees normal results with no context for why they were done. The follow-up plan makes no sense without the symptom.

3. Steroid tapering errors and dosage truncation

The AI consistently truncated steroid tapering regimens to flat courses followed by abrupt stops. Prednisolone 40mg for 5 days then stop - when the clinician had prescribed a reducing course over two weeks. Abrupt cessation of steroids risks adrenal crisis.

We found 12 wrong dosages in the same engagement. The pattern was consistent - the AI captured the starting dose and duration correctly but dropped the tapering schedule. A GP reviewing the note would see a reasonable-looking prescription with a dangerous omission.

This pattern extends beyond steroids. Any medication with a complex dosing schedule - loading doses, titration protocols, stepped reductions - is at risk. The AI defaults to the simplest representation of the prescribing decision.

4. Diagnostic leaps across sections

The note jumps from symptoms to a specific diagnosis without documenting the supporting examination findings. “Cough and fever” becomes “community-acquired pneumonia” - but the chest is clear on auscultation and oxygen saturations are normal. The diagnosis may be correct, but the note doesn’t contain the evidence to support it.

This matters for three reasons. First, medicolegal - the documented diagnosis has no documented basis. Second, referral quality - the receiving clinician expects examination findings that aren’t there. Third, it creates a training problem - junior clinicians reviewing AI-generated notes learn that diagnostic shortcuts are acceptable documentation.

In structured notes where information from the consultation is spread across multiple sections, diagnostic leaps become more common. The AI synthesises a conclusion from scattered inputs without documenting the chain of reasoning. The conclusion might be right. The documentation doesn’t show why.

5. The eval gap underneath all of this

In every clinical AI engagement we’ve run, the team’s existing evaluation was giving them a clean bill of health. Eval pass rates were healthy. Spot checks weren’t surfacing anything alarming. Teams were planning expansions - new markets, new specialties, broader coverage.

The gap isn’t between good evals and bad evals. It’s between what evals measure and what actually matters in clinical practice. Standard faithfulness checks test whether the information in the note matches the source. They don’t test whether a discussion has been transformed into a directive. They don’t test whether a missing red flag symptom changes the safety net for a covering clinician. They don’t test whether a tapering schedule has been truncated to a flat course.

The 15% failure rate we found in one engagement wasn’t the scary part. The scary part was that the existing evals were showing 0% failures while 23 severity-critical findings went through to the clinical record every two weeks.

What we’ve learned about fixing this

Three things from running these engagements:

Clinician corrections compound. 50 corrections is the typical turning point. Each correction doesn’t just fix one case - it improves detection for every similar case. A correction on “the model confused elective vs emergency surgery” improves future scoring on all surgical classification traces. By month 2, the system catches failure types it missed in month 1.

Fix, don’t flag. One clinical team told us something that changed how we think about guardrails: if you start showing errors inline in notes, clinicians anchor on the flags rather than reading the clinical content. They stop looking for the fourth error after you’ve flagged three. The guardrail system should fix the output silently or route it for human review - never present an error inline and expect the clinician to catch it.

Cheap guardrails for known patterns, expensive evaluation for discovery. You don’t need to evaluate every note with a frontier model. Run targeted, fast guardrails on everything to catch the known high-severity patterns (hallucinated prescribing decisions, omitted red flags, dosage truncations). Run the comprehensive evaluation on a sample to discover new failure patterns. That gets you broad coverage at a fraction of the cost of evaluating 100% of notes at full depth.

The regulatory dimension

This isn’t just a quality problem any more. The EU MDR and AI Act now overlap for clinical AI. Ambient scribes and clinical documentation AI are increasingly classified as medical devices, and the AI Act classifies them as high-risk. That means two regulatory frameworks, each with their own post-market surveillance requirements.

By August 2026, providers and deployers of high-risk AI systems must have continuous monitoring programs in place with real-world performance tracking and strict incident reporting timeframes.

The regulator’s question isn’t “what’s your accuracy?” It’s “what’s your failure taxonomy, what are your detection rates per category, and what’s your process for improvement over time?” A benchmark number - 98% accuracy - is one data point. A structured failure ontology with severity classifications, detection rates, and trend data over time is a regulatory narrative.

The failure taxonomy becomes the regulatory asset. If you can show regulators: here are the 6 categories of failure in our clinical notes, here’s how we detect each one, here’s the detection rate per category, and here’s the improvement curve over 6 months - that’s a fundamentally stronger submission than a single accuracy figure.

These findings are from production clinical AI systems. The teams building them are talented. The failures aren’t from negligence - they’re from evaluating the wrong dimensions. If you’re shipping clinical AI and want to know what your evals are missing, we can run a diagnostic on your production traces in under a week.

Our evaluation techniques are backed by an ablation study on RewardBench 2 (1,753 examples) - vanilla LLM-as-judge: 72.1%, our combined techniques: 85.4%. Full study on GitHub.