Composo Clinical Guardrails: real-time hallucination blocking for AI scribes

Your AI scribe is hallucinating into patient charts

Across ambient AI scribe encounters in production, 26.3% contain a clinical error – hallucinated medications, omitted findings, diagnostic leaps (Mayo Clinic Proceedings, 2026). The standard responses don’t work in the inference path:

Manual review is accurate, but doesn’t scale to every note.
Frontier LLM-as-judge is fast enough offline, but at tens of seconds and cents per note it can’t sit between the scribe and the clinician.
Generic safety guardrails catch profanity and PII, not a fabricated lab value or a flipped negation.

So most teams ship hoping the failure rate is low enough.

Introducing Composo Clinical Guardrails

Domain-calibrated hallucination detection for AI scribes, in the inference path.

A small, fast classifier sitting between the scribe model and the clinician. Trained on your specific labelled error set, calibrated to the failure modes that matter in your vertical, returning a YES/NO call in around two seconds.

Not a generic safety filter. Not a frontier-LLM judge. A purpose-built guardrail that learns what counts as a violation in your clinical setting.

What you get on day one

A tuned classifier deployed at the scribe → clinician boundary. Specifically:

A YES/NO call per generated note, on whether it contains a clinical hallucination, omission, or unsupported inference.
Sub-second-to-couple-of-seconds latency. Roughly six times faster than running a frontier reasoner in the same path.
Calibrated to your label set. The classifier is fine-tuned on synthetic data generated from constitutions that were themselves evolved against your labelled error corpus. The decision boundary matches the failure modes your team flags.
Deployable wherever you already run. Currently shipped as a Gemini 2.5 Flash fine-tune on Vertex AI; the same recipe ports to any small model with supervised-tuning support.

How we built it

Three stages. Each is reusable, calibrated to the customer’s labelled error set, and runs on Composo’s existing in-house tooling.

GEPA-evolved constitutions

A constitution in this setting is a short, human-readable rulebook – what counts as a violation, and what is permitted clinical synthesis. We write two: a harmful constitution listing violation categories (fabricated values, direction flips, unsupported clinical inference), and a harmless constitution cataloguing the permitted transformations a faithful scribe will routinely apply (paraphrase, abbreviation, unit normalisation, the routine inference any clinician would draw). Neither ever appears in the deployed classifier’s prompt – they shape the training data only.

Crucially, the constitutions are evolved, not written. We treat each constitution as a prompt under optimisation and use GEPA – reflective prompt evolution – to fit it to the labelled error corpus. The constitution gets wrapped in a single-call binary classifier over the corpus; per-claim accuracy with per-error-type feedback is the optimisation metric; GEPA’s reflection model reads execution traces, diagnoses failure modes, and proposes mutations to the constitution text. A Pareto frontier of diverse strong candidates is maintained. The output is a structured, human-auditable rulebook calibrated by the data rather than by hand-tuning.

Constitution-audited synthetic data

Both constitutions go into the generator’s prompt. For each training transcript, an LLM writes a faithful SOAP note grounded only in the transcript; each sentence is then audited per-sentence against the harmful constitution; notes with zero flagged sentences are kept as verified-clean negatives. The clean audit is non-trivial – about 7% of “faithful” generations sneak in an accidental hallucination, and would otherwise poison the negative class without ever being noticed by inspection.

For positives, a verified-clean note is passed through a second LLM call that rewrites exactly one sentence into a hallucination of a target error class, alternating between Hallucination + Misunderstanding and Inference for class balance. The corrupted note is then re-run through the same per-sentence classifier: if any sentence now flags under the constitution, the injection has landed and the note is kept as a positive.

Fine-tuned small model

The audited corpus – balanced positives and negatives, both classes spread across the training transcripts – is used to fine-tune Gemini 2.5 Flash on Vertex AI. The deployed system prompt is a short generic classifier instruction; the constitutions never appear at inference. The result is a small, fast model that has learned the constitutions’ decision boundary implicitly.

Why this beats the alternatives

Generic guardrails miss what’s specific. Off-the-shelf safety classifiers – including Anthropic’s published constitutional classifiers – are calibrated for jailbreak detection on a safety policy. They’re not trained to spot a fabricated lab value or a flipped negation in a clinical summary, because that’s not what they were built for. Domain-specific failures need domain-calibrated classifiers, and that calibration has to come from your labelled errors, not someone else’s policy.

Frontier LLM-as-judge is too expensive to sit in the path. A frontier reasoning model gives reliable answers – see the table below – but at thirteen seconds and several cents per note, it can’t sit between the scribe and the clinician. The Composo guardrail matches the same agreement at around two seconds and at a fraction of the cost.

DIY costs months that aren’t on the schedule. The GEPA optimisation loop, the constitution-audit pipeline, the supervised-tuning plumbing, and the evaluation harness are all in-house at Composo and reusable across customers and verticals. The expensive part – the part that cannot be synthesised away – is collecting and labelling the errors in the first place. Everything downstream is the same code path, the same tuning job, the same evaluation harness.

Proof

We score the tuned guardrail against held-out real ACI-Bench notes whose transcripts were never touched by the synthesis pipeline, comparing against gpt-5.5 with xhigh reasoning effort. Both models see the same short, generic deployment prompt; the constitutions are nowhere in the inference path. Agreement is reported as Cohen’s κ to control for class skew on the held-out set.

Model	Median latency	κ
gpt-5.5 (xhigh reasoning)	13.0 s	1.000
Composo Clinical Guardrail (Gemini 2.5 Flash, fine-tuned)	~2 s*	1.000

* We’ve reached around 1 s on some production deployments.

Perfect agreement with the gold labels – every dirty note correctly flagged, every clean note correctly cleared. The same agreement as a frontier reasoner, at roughly six times lower latency.

Get started

Book a Diagnostic → – a clinical failure report on your AI scribe, categorised by type, severity, and frequency. Delivered in under a week, and shows you exactly which failure modes a Composo guardrail would catch.