Healthcare: The Notes That Looked Fine
Vertical: Clinical AI / Ambient Scribe Stage: Series B, European expansion What we analysed: 847 AI-generated clinical notes from a primary care ambient scribe system, 14 days of production output
A Series B clinical AI company was generating consultation notes from doctor-patient conversations. Their ambient scribe had been in production for months. Internal evals were passing. The clinical team reviewed a sample each week and things looked good.
We analysed 847 notes. Found 127 failures across 6 categories. 23 were severity-critical - the kind that could directly alter a clinical decision.
What they thought was happening
The system was working. Eval pass rates were healthy. Weekly spot checks by their clinical team surfaced the occasional formatting issue but nothing alarming. They were preparing to expand into three new European markets.
What was actually happening
The AI was converting medication discussions into medication decisions. A clinician discusses whether antibiotics are warranted, patient agrees to a delayed prescribing strategy - collect the prescription only if symptoms worsen after 48 hours. The note reads: “Start amoxicillin 500mg TDS for 7 days.” A pharmacist reading that note would dispense immediately.
19 hallucinated medication entries. 11 severity-critical. In one case, a clinician specifically advised a patient to stop ibuprofen due to NSAID-related gastritis. The note recorded ibuprofen as a recommendation.
34 omitted findings. A patient mentioned 8kg of unintentional weight loss over 3 weeks - a red flag requiring urgent investigation to exclude malignancy. The note attributed the presentation to “work stress and poor sleep hygiene.” A covering clinician would have no reason to expedite investigations.
12 wrong dosages. The AI consistently truncated steroid tapering regimens to flat courses followed by abrupt stops. Prednisolone 40mg for 5 days then stop - when the clinician had prescribed a reducing course over two weeks. Abrupt cessation risks adrenal crisis.
Their LLM-as-judge wasn’t catching any of this. The judge’s prompt checked: “Does the note accurately capture the key information discussed in the consultation?” The answer was yes - because the information was discussed. What the judge didn’t check was whether a discussion had been transformed into a directive. The eval was testing faithfulness to content. The failure was in faithfulness to intent.
What changed
We configured guardrails for the three highest-risk patterns: hallucinated prescribing decisions, omitted red flag symptoms, and steroid tapering errors. Their clinical team reviewed the 23 critical findings in about 2 hours. Guardrails ran in shadow mode for two weeks, then went live.
During shadow mode, the guardrails initially over-flagged steroid tapering - catching legitimate short courses alongside the genuine truncation errors. Their lead GP spent an afternoon reviewing the false positives and we recalibrated the threshold. That correction also improved detection for the antibiotic pattern because the same decision-status logic applied. This is the part that doesn’t show up in a summary: the first configuration is wrong in predictable ways, and the calibration with the clinical team is what makes it work.
The system catches hallucinated medication decisions before they reach the clinical record. It flags when red flag symptoms appear in the transcript but not the note. It checks laterality consistency - left knee examined, right knee in the plan - across all sections.
What they own now
A calibrated evaluation engine running on their infrastructure alongside their existing EHR integration. Guardrails blocking the highest-severity failure patterns. A failure taxonomy with 6 categories specific to their primary care notes. An audit trail of every flagged output and correction - which their clinical governance lead confirmed meets their internal audit requirements.
Their clinical team’s total time: roughly 2 hours reviewing the initial findings, 4 hours calibrating thresholds during shadow mode (spread across the lead GP and two other clinicians), and 4 hours validating the live system in the first week. About 10 hours total over three weeks - and decreasing, because the system learns from each correction.
The 15% failure rate wasn’t the scary part. The scary part was that their existing evals were giving them a clean bill of health while 23 severity-critical failures went through to the clinical record every two weeks.