Skip to content
Read our latest publication on optimal methods for LLM evaluation here
← Back to Blog

AI Scribe Failures: The Lawsuits, the Patterns, and What Evaluation Should Catch

Seb Fox · CEO & Co-founder · · Updated

AI scribes are creating fabricated consent statements, adding medications that were never discussed, and missing red-flag symptoms. The error rate is roughly 26% of encounters. The lawsuits have started. Here is what the specific failure patterns look like.

This is not a generic “AI is scary” post. It is a specific catalogue of the documented failures of production AI scribes, with citations to peer-reviewed studies and filed lawsuits, and a practical description of what clinical AI evaluation needs to catch.

If you are deploying, buying, or validating an AI scribe, these are the specific failure categories you need to have opinions about.

The published evidence base

Three published sources frame where the industry actually is:

Mayo Clinic Proceedings

A systematic study of ambient AI scribes across major US vendors, published in Mayo Clinic Proceedings. Key findings:

  • Mean error rate: 26.3% of encounters
  • Mean errors per transcript: 13.9
  • Mean errors per case at moderate-to-severe harm potential: 3.0

This is not a stress test on an edge case. It is an averaged figure across typical clinical encounters using production AI scribe products.

JAMA Network Open (Harvard-led)

A Harvard-led study evaluating leading AI models on clinical reasoning and differential diagnosis generation. Finding: leading AI models failed to produce appropriate differential diagnoses more than 80% of the time when evaluated against standardised clinical cases.

The implication for AI scribes is that the underlying model is not necessarily clinically reliable in the first place. If the generator model is failing diagnostic reasoning 80%+ of the time, every downstream use depends on how well the scribe pipeline constrains the model’s outputs to what was actually said in the encounter.

ECRI Top 10 Health Technology Hazards

ECRI’s annual list ranked AI chatbot misuse at #1 - the first time an AI application has topped the hazards list. The rationale specifically cited real-world patient-safety events from AI documentation and advisory tools.

The lawsuits

US health systems have been sued over AI scribe deployments.

Sharp HealthCare

Sharp HealthCare was sued over its AI scribe deployment. The specific allegation: the AI scribe auto-inserted fabricated consent statements into patient records. Language asserting that patients “were advised” and “consented” appeared in notes covering encounters where neither the clinician nor the patient had said those words.

Scale of alleged impact: more than 100,000 encounters affected.

Sutter Health and MemorialCare

A class action followed against Sutter Health and MemorialCare. Same vendor pattern, same failure mode - fabricated consent and advising statements inserted into patient records without the encounter supporting them.

Legal theories cited: California Invasion of Privacy Act (CIPA), California Confidentiality of Medical Information Act (CMIA), and the Federal Wiretap Act.

These are early cases. More are likely.

The specific failure patterns

Based on the Mayo Clinic study, the lawsuits, and what we see across deployments, these are the specific failure categories appearing in production AI scribes:

Hallucinated medications

The note records a prescription, dosage, or medication class that was not mentioned in the encounter.

Example: A patient discusses chronic pain management with their GP. The note records a new prescription for a specific NSAID at a specific dosage. Neither the drug nor the dosage was said in the consultation.

Omitted red-flag symptoms

The patient mentioned a symptom that carries diagnostic significance. The note did not record it.

Example: A patient with a headache mentions visual disturbance and neck stiffness. The note records “patient presents with headache.” The red flags are gone.

Diagnostic leaps

The note states a diagnosis or narrows a differential without the supporting reasoning that was discussed (or that the clinician did not actually conclude).

Example: The clinician is considering a broad differential including GI and cardiac causes. The note records “Diagnosis: GERD” with no supporting reasoning. The differential narrowing did not happen in the conversation.

Unsupported inferences

The patient described a symptom. The AI translated the description into a named diagnosis without clinical justification.

Example: The patient describes intermittent headaches over several weeks. The note records “chronic migraine” as a diagnosis. The patient did not say “migraine.” The clinician did not diagnose one.

The Sharp/Sutter failure pattern. The note asserts that the patient “was advised,” “consented,” or “agreed to” something that was not said by either party.

Example: The note records “Patient was advised of the risks of the procedure and consented to proceed.” Neither the clinician nor the patient said those words in the encounter.

Dosage and route errors

Wrong strength, wrong frequency, wrong administration route. High-severity when involving medications with narrow therapeutic windows.

Contraindication misses

The patient’s history includes a known contraindication. The AI does not flag it, or actively recommends the contraindicated intervention.

What generic evaluation misses

Most AI scribe vendors evaluate their outputs using:

  • Text similarity to a reference note (BLEU, ROUGE, embedding similarity)
  • Generic LLM-as-judge with a “rate note quality” prompt
  • Sampling-based human review at a low sampling rate

Each of these misses the specific failure categories above.

Text similarity misses hallucinated content that is “similar” to the source (same style, same length, same section structure) but adds medications or consent statements that are not in the source.

Generic LLM-as-judge plateaus around 72% accuracy on benchmarks and does not know what “omitted red-flag symptom” means for a specific specialty.

Sampling human review at 1-5% of encounters, even by clinicians, misses systematic failures that happen on 26% of encounters but cluster in specific case types.

Catching the specific failure categories requires:

  1. A failure taxonomy written with clinicians, not engineers
  2. Evaluation criteria calibrated to domain-specific definitions
  3. An evaluation model that reaches above 80% accuracy on domain-specific failure classification (reachable with techniques like criteria ensembling and reward modelling)
  4. Evaluation of 100% of encounters, not 1%

What AI scribe deployments need

Given where the regulatory, legal, and evidentiary environment is, a defensible AI scribe deployment now needs:

  • A documented failure taxonomy, reviewed and signed off by the deploying clinical team, covering at minimum: hallucinated medications, omitted findings, diagnostic leaps, fabricated consent statements, dosage errors, contraindication misses.
  • An evaluation layer that runs on 100% of encounters, not sampling. The 1% sampling rate is the ECRI-level hazard rate.
  • Complete audit trails per encounter: trace, evaluation criteria, rationale, any reviewer corrections.
  • Independent evaluation, not only vendor self-attestation. Healthcare procurement is starting to require this.
  • SOC 2 Type II and BAA capability from the evaluation vendor.
  • A drift-handling plan, because AI scribe underlying models are updated frequently.

What Composo deploys for AI scribes

Composo deploys clinical AI evaluation infrastructure in 2 to 4 weeks. The failure taxonomy is reviewed and signed off by the customer’s clinical team. The evaluation model calibrates to the specific ways the customer’s AI scribe (regardless of vendor) fails. Evaluation runs on 100% of encounters with complete audit trails.

If you are deploying, validating, or buying AI scribes, book a diagnostic and see the specific failures Composo would catch on your actual production transcripts.

Further reading: Clinical AI Failure Modes, Evaluating Clinical AI: A Practical Guide.

Frequently asked questions

What is the documented error rate of AI scribes in production?

A Mayo Clinic Proceedings study reported a mean 26.3% error rate across recorded encounters, 13.9 errors per transcript, and 3.0 errors per case at moderate-to-severe harm potential. This is an averaged figure across major AI scribe vendors in the US market.

Have AI scribe lawsuits been filed?

Yes. US health systems including Sharp HealthCare, Sutter Health, and MemorialCare have been sued over AI scribes that auto-inserted fabricated consent statements into patient records. One reported case affected over 100,000 encounters. Legal theories cited include California CIPA, CMIA, and Federal Wiretap violations.

What specific failure modes should AI scribe evaluation catch?

Hallucinated medications (prescriptions not discussed), omitted findings (red-flag symptoms mentioned but not recorded), diagnostic leaps (conclusions without documented reasoning), fabricated consent or procedure statements, dosage and route errors, and contraindication misses.

Why has ECRI named AI chatbot misuse a top health technology hazard?

Documented real-world harms, including fabricated consent statements, medication errors, and missed differential diagnoses at scale. This was the first time an AI application has topped ECRI's annual health technology hazards list.

Is Composo specifically designed to catch AI scribe failures?

Yes. Composo's clinical failure taxonomy covers the specific categories of AI scribe failures documented in the clinical research literature and in recent lawsuits, including hallucinated medications, omitted red-flag findings, diagnostic leaps, and fabricated procedure or consent statements. Evaluation criteria are reviewed and signed off by the customer's clinical team.