Skip to content
Read our latest publication on optimal methods for LLM evaluation here
← Back to Blog

Composo Clinical Guardrails: real-time hallucination blocking for AI scribes

Ryan Lail · Founding Engineer ·

Your AI scribe is hallucinating into patient charts

Across ambient AI scribe encounters in production, 26.3% contain a clinical error – hallucinated medications, omitted findings, diagnostic leaps (Mayo Clinic Proceedings, 2026). The standard responses don’t work in the inference path:

  • Manual review is accurate, but doesn’t scale to every note.
  • Frontier LLM-as-judge is fast enough offline, but at tens of seconds and cents per note it can’t sit between the scribe and the clinician.
  • Generic safety guardrails catch profanity and PII, not a fabricated lab value or a flipped negation.

So most teams ship hoping the failure rate is low enough.

Introducing Composo Clinical Guardrails

Domain-calibrated hallucination detection for AI scribes, in the inference path.

A small, fast classifier sitting between the scribe model and the clinician. Trained on your specific labelled error set, calibrated to the failure modes that matter in your vertical, returning a YES/NO call in around two seconds.

Not a generic safety filter. Not a frontier-LLM judge. A purpose-built guardrail that learns what counts as a violation in your clinical setting.

What you get on day one

A tuned classifier deployed at the scribe → clinician boundary. Specifically:

  • A YES/NO call per generated note, on whether it contains a clinical hallucination, omission, or unsupported inference.
  • Sub-second-to-couple-of-seconds latency. Roughly six times faster than running a frontier reasoner in the same path.
  • Calibrated to your label set. The classifier is fine-tuned on synthetic data generated from constitutions that were themselves evolved against your labelled error corpus. The decision boundary matches the failure modes your team flags.
  • Deployable wherever you already run. Currently shipped as a Gemini 2.5 Flash fine-tune on Vertex AI; the same recipe ports to any small model with supervised-tuning support.

How we built it

Three stages. Each is reusable, calibrated to the customer’s labelled error set, and runs on Composo’s existing in-house tooling.

GEPA-evolved constitutions

A constitution in this setting is a short, human-readable rulebook – what counts as a violation, and what is permitted clinical synthesis. We write two: a harmful constitution listing violation categories (fabricated values, direction flips, unsupported clinical inference), and a harmless constitution cataloguing the permitted transformations a faithful scribe will routinely apply (paraphrase, abbreviation, unit normalisation, the routine inference any clinician would draw). Neither ever appears in the deployed classifier’s prompt – they shape the training data only.

Crucially, the constitutions are evolved, not written. We treat each constitution as a prompt under optimisation and use GEPA – reflective prompt evolution – to fit it to the labelled error corpus. The constitution gets wrapped in a single-call binary classifier over the corpus; per-claim accuracy with per-error-type feedback is the optimisation metric; GEPA’s reflection model reads execution traces, diagnoses failure modes, and proposes mutations to the constitution text. A Pareto frontier of diverse strong candidates is maintained. The output is a structured, human-auditable rulebook calibrated by the data rather than by hand-tuning.

Constitution-audited synthetic data

Two-step pipeline. Step 1 evolves two constitutions: a labelled-error corpus feeds GEPA reflective prompt evolution (DSPy, per-error-type feedback, Pareto-frontier candidates) to produce a harmful constitution (what counts as a violation) and a harmless constitution (what counts as a legitimate clinical paraphrase). Step 2 then uses both: real transcripts feed an LLM whose context contains both constitutions, generating a faithful SOAP note grounded only in the transcript; each sentence of that note is audited per-sentence against the harmful constitution, and any note with zero flagged sentences is kept as a verified-clean negative; a verified-clean note is then passed to a second LLM call that rewrites exactly one sentence into a hallucination of a chosen error class (alternating Hallucination + Misunderstanding and Inference for balance); the corrupted note is re-audited to confirm the injection actually landed; surviving notes form a balanced SFT corpus used to fine-tune Gemini 2.5 Flash on Vertex AI; the held-out test uses real notes from transcripts entirely withheld from synthetic generation.

Both constitutions go into the generator’s prompt. For each training transcript, an LLM writes a faithful SOAP note grounded only in the transcript; each sentence is then audited per-sentence against the harmful constitution; notes with zero flagged sentences are kept as verified-clean negatives. The clean audit is non-trivial – about 7% of “faithful” generations sneak in an accidental hallucination, and would otherwise poison the negative class without ever being noticed by inspection.

For positives, a verified-clean note is passed through a second LLM call that rewrites exactly one sentence into a hallucination of a target error class, alternating between Hallucination + Misunderstanding and Inference for class balance. The corrupted note is then re-run through the same per-sentence classifier: if any sentence now flags under the constitution, the injection has landed and the note is kept as a positive.

Concrete example of one synthetic violation kept in the training set. The original faithful sentence reads "Hill is a 41-year-old female presenting with pain at the end of her right middle finger." The injected version reads "Hill is a 42-year-old female presenting with pain at the end of her right middle finger." with the changed value highlighted in orange. This is a single-token wrong-value hallucination labelled as Hallucination + Misunderstanding.

Fine-tuned small model

The audited corpus – balanced positives and negatives, both classes spread across the training transcripts – is used to fine-tune Gemini 2.5 Flash on Vertex AI. The deployed system prompt is a short generic classifier instruction; the constitutions never appear at inference. The result is a small, fast model that has learned the constitutions’ decision boundary implicitly.

Why this beats the alternatives

Generic guardrails miss what’s specific. Off-the-shelf safety classifiers – including Anthropic’s published constitutional classifiers – are calibrated for jailbreak detection on a safety policy. They’re not trained to spot a fabricated lab value or a flipped negation in a clinical summary, because that’s not what they were built for. Domain-specific failures need domain-calibrated classifiers, and that calibration has to come from your labelled errors, not someone else’s policy.

Frontier LLM-as-judge is too expensive to sit in the path. A frontier reasoning model gives reliable answers – see the table below – but at thirteen seconds and several cents per note, it can’t sit between the scribe and the clinician. The Composo guardrail matches the same agreement at around two seconds and at a fraction of the cost.

DIY costs months that aren’t on the schedule. The GEPA optimisation loop, the constitution-audit pipeline, the supervised-tuning plumbing, and the evaluation harness are all in-house at Composo and reusable across customers and verticals. The expensive part – the part that cannot be synthesised away – is collecting and labelling the errors in the first place. Everything downstream is the same code path, the same tuning job, the same evaluation harness.

Proof

We score the tuned guardrail against held-out real ACI-Bench notes whose transcripts were never touched by the synthesis pipeline, comparing against gpt-5.5 with xhigh reasoning effort. Both models see the same short, generic deployment prompt; the constitutions are nowhere in the inference path. Agreement is reported as Cohen’s κ to control for class skew on the held-out set.

ModelMedian latencyκ
gpt-5.5 (xhigh reasoning)13.0 s1.000
Composo Clinical Guardrail (Gemini 2.5 Flash, fine-tuned)~2 s*1.000

* We’ve reached around 1 s on some production deployments.

Perfect agreement with the gold labels – every dirty note correctly flagged, every clean note correctly cleared. The same agreement as a frontier reasoner, at roughly six times lower latency.

Get started

Book a Diagnostic → – a clinical failure report on your AI scribe, categorised by type, severity, and frequency. Delivered in under a week, and shows you exactly which failure modes a Composo guardrail would catch.