Healthcare AI

Your clinical AI is getting things wrong in production right now.

Hallucinated medications. Omitted findings. Diagnostic leaps. Standard evals miss all of it. We catch them before they reach patients.

Book a Diagnostic

A clinical failure report on your AI - categorised by type, severity, and frequency. Delivered in under a week.

Evaluations > Run #2847

0 traces scored · 0 failures found

CRITICALHallucinated medication

"...recommend starting lisinopril 40mg daily alongside lifestyle modifications..."

Medication not discussed in consultation. No evidence in transcript.

CRITICALOmitted red flag finding

"Patient reported intermittent chest pain on exertion - absent from clinical note"

Clinically significant symptom mentioned at 04:32 but not documented.

WARNINGUnsupported inference

"...consistent with chronic migraine pattern..."

Patient described "occasional headaches" - diagnostic leap to chronic migraine unsupported.

0 critical0 warning0 passed

90% agreement with domain experts

The 2026 evidence

Clinical AI is failing at documented rates. In production. Right now.

26.3%

Error rate across ambient AI scribe encounters. Mayo Clinic Proceedings, 2026.

80%+

Rate at which leading AI models fail to produce appropriate differential diagnoses. JAMA Network Open, April 2026 (Harvard-led).

AI chatbot misuse ranked as the top health technology hazard for 2026. ECRI - first year any AI application topped the list.

100,000+

Patient encounters affected by fabricated consent statements in the Sharp HealthCare AI scribe lawsuit, November 2025.

Further reading: Evaluating Clinical AI: A Practical Guide · AI Scribe Failures in 2026: Lawsuits and Patterns · Clinical AI Failure Modes

Why us

Built by Doctors.

Seb Fox trained in medicine at Oxford before leading AI teams at McKinsey and QuantumBlack for four years. Clinical quality evaluation isn't text similarity - it's knowing whether an omitted finding changes a clinical decision, whether a hallucinated medication could cause harm, whether a diagnostic leap skips reasoning a clinician needs to verify.

That's why we built Composo. Generic eval tools miss all of it.

Use cases

For any clinical AI that generates text a clinician or patient reads.

Clinical note generation

Ambient scribes, clinical documentation AI, encounter summaries. We catch hallucinated medications, omitted findings, and diagnostic leaps that text similarity metrics miss entirely.

Clinical decision support

Diagnostic recommendations, treatment suggestions, risk scores with explanations. We evaluate whether the reasoning is clinically sound - not just whether it's grammatically correct.

Patient-facing content

After-visit summaries, discharge instructions, patient education materials. We catch inaccuracies, inappropriate clinical language, and missing safety information before patients read it.

Radiology AI

AI-generated radiology reports, finding summaries, and impression text. We catch omitted findings, unsupported conclusions, and laterality errors - the failures that matter when a clinician is reading 50 reports before lunch.

Pathology AI

AI-assisted pathology reports, diagnostic classifications, and specimen summaries. We evaluate whether findings are consistent, grading is accurate, and critical results are flagged - not just whether the text is coherent.

What we catch

Six failure types standard evals miss entirely

Hallucinated medications

AI documents a prescription that was never discussed. The clinician may act on fabricated information.

Omitted findings

Red flag symptoms mentioned by the patient but absent from the clinical note. The next clinician won't know to follow up.

Diagnostic leaps

Conclusions without documented supporting evidence. "Likely pneumonia" without exam findings bypasses clinical reasoning.

Unsupported inferences

Patient mentions occasional headaches, note records "chronic migraines". Diagnostic escalation without evidence.

Dosage errors

Wrong dose, frequency, or route. Particularly dangerous when the error is plausible and passes surface-level review.

Contraindication misses

Drug interactions present in patient history but not flagged by AI. The information is there - it's just not being used.

The engagement

4-8 weeks. Your clinical team shapes every decision.

Pre-engagement

BAA/DPA

Security review, BAA execution, DPA signing. We don't touch any data until legal is complete. This typically takes 2-4 weeks. We'll send you a security pack (SOC 2 attestation, pen test summary, DPA/BAA templates, data flow diagram) on first contact to speed this up.

Week 1-2

The clinical failure report

We connect to your AI traces and run our evaluation engine. You get a failure report - every clinical failure categorised by type, severity, and frequency. This is usually the moment teams realise what's been reaching patients.

Weeks 3-5

Clinicians calibrate

Your pharmacists, clinicians, or clinical informaticists review what we flagged and correct where we're wrong. 50 corrections is the typical turning point. We build out the clinical failure taxonomy and set guardrail thresholds for the highest-severity patterns.

Week 6-8

Handover

You own everything. Clinical evaluation criteria, failure taxonomy mapped to patient safety categories, guardrail rules, all clinician correction data. Additional deliverables: clinical safety summary, incident response protocol, clinician-facing correction guide, and a full audit trail.

In production

When a guardrail blocks an output

Every blocked output is logged with a full audit trail: the original AI output, the failure type, the severity classification, and the guardrail rule that triggered. Your clinical team gets notified through your existing alerting (Slack, PagerDuty, email - whatever you use). The blocked output never reaches the patient or clinician.

For high-severity blocks (hallucinated medications, dangerous dosage errors), the incident response protocol defines escalation paths - who reviews, how fast, and what gets documented. Every block, every review, every escalation is logged and exportable for compliance reporting.

No silent failures. No outputs slipping through between quarterly audits. Every decision is traceable.

Deployment

Self-hosted options. Patient data never leaves your infrastructure.

We evaluate your AI's outputs. We do see production data and would need to process PHI under a BAA. Patient data can stay in your environment with a self-hosted deployment, or we host with EU or US data residency.

Data retention: Configurable per your policy. Default 90 days for evaluation logs. Deletion on request within 10 business days. For self-hosted deployments, retention is entirely under your control - we never hold a copy.

Your environment

Runs in your Azure or AWS tenant. No cloud dependency on Composo.

No EMR access needed

We evaluate AI outputs, not the systems that produce them. Ingest traces from your AI layer.

Audit-ready artefacts

Every evaluation logged with timestamp, criteria version, confidence score. Evidence trail for certification reviewers.

Integration

How Composo connects to your stack

Composo connects to your AI's production traces - not to your EHR. We work with whatever logging you already have: Langfuse, Azure AI logs, custom JSON, or our Python SDK (4 lines of code).

We sit between your AI's output and the clinician's review. We never touch Epic, Cerner, or any EHR system directly. No HL7, no FHIR integration, no EHR access needed.

If your AI pulls from an EHR and generates clinical text, we evaluate that text. The AI-to-EHR pipeline is yours. The quality layer is ours.

Compliance

Built for regulated environments

Certification	Status
SOC 2 Type II	Certified (Dec 2024). Full report available.
Penetration test	Complete (Aug 2025). Executive summary available.
DPA	Template ready. GDPR-compliant.
BAA	Template ready. HIPAA.
Security policies	26 policies documented and current.

Where we sit in the regulatory picture

Composo is not a medical device. We don't generate clinical decisions - we evaluate AI outputs that do. We sit between your AI's output and the clinician's review, flagging failures before they're acted on.

On FDA clearance: We're monitoring the FDA's evolving framework for AI/ML-enabled software, including the finalised Predetermined Change Control Plan (PCCP) guidance. Our current product - evaluating and flagging AI outputs for clinical review - does not make autonomous clinical decisions and is designed to support, not replace, clinician judgment. If the regulatory landscape changes, we'll pursue clearance. Today, we're focused on giving clinical teams the tools to catch what their AI gets wrong.

What we replace

The alternative is a clinical review board that checks 2% of outputs once a quarter.

Without Composo

Your clinical team manually reviews a small sample of AI outputs every quarter

Reviews catch what they catch - but miss everything between audits

When the AI model updates, nobody re-validates clinical accuracy

The clinical expert who designed the review process leaves. The review process decays

Failures accumulate silently between review cycles

No audit trail. When something goes wrong, you can't trace what the AI said, when it said it, or who reviewed it

With Composo

Every AI output evaluated against clinical criteria - not a sample

50 clinician corrections is the turning point. By month 2, the system catches failure types it missed in month 1

Guardrails block clinically dangerous outputs before they reach patients

Full audit trail on every evaluated output - exportable for compliance reporting

Deployed in your environment in 4-8 weeks

You own the evaluation criteria, the failure taxonomy, all correction data

See it in action

See a hallucinated medication caught in real time

See how Composo evaluates a real clinical AI output - catching citation errors, omitted findings, and unsupported inferences that generic evals miss.

From our customers

Trusted by teams where quality isn't optional

We embedded Composo into our AI Workers from day one - best decision we've made on testing. They provide peace of mind for us and our customers. No brainer.
Fehmi Sener
CTO, 5u.ai
We cut our QA cycle time by 70%. Instead of relying purely on human review, now we instantly know which prompts are failing and why.
Head of AI Engineering
Enterprise SaaS platform
For the first time, we can ship with complete confidence knowing exactly what our AI quality looks like at scale.
Senior Software Engineer
Instrumentl
LLM as a Judge was far too unreliable. Composo gave us the deterministic scoring we needed to actually track improvements.
Senior ML Engineer
Fortune 500 Financial Services

Your clinical AI is getting things wrong right now. We catch them before they reach patients.

Book a Diagnostic

A clinical failure report on your AI - categorised by type, severity, and frequency. Delivered in under a week. Deployed inside your environment.