Your clinical AI is getting things wrong in production right now.
Hallucinated medications. Omitted findings. Diagnostic leaps. Standard evals miss all of it. We catch them before they reach patients.
A clinical failure report on your AI - categorised by type, severity, and frequency. Delivered in under a week.
Evaluations > Run #2847"...recommend starting lisinopril 40mg daily alongside lifestyle modifications..."
Medication not discussed in consultation. No evidence in transcript.
"Patient reported intermittent chest pain on exertion - absent from clinical note"
Clinically significant symptom mentioned at 04:32 but not documented.
"...consistent with chronic migraine pattern..."
Patient described "occasional headaches" - diagnostic leap to chronic migraine unsupported.
Why us
Built by Doctors.
Seb Fox trained in medicine before leading AI teams at McKinsey and QuantumBlack for 6 years. Clinical quality evaluation isn't text similarity - it's knowing whether an omitted finding changes a clinical decision, whether a hallucinated medication could cause harm, whether a diagnostic leap skips reasoning a clinician needs to verify.
That's why we built Composo. Generic eval tools miss all of it.
Use cases
For any clinical AI that generates text a clinician or patient reads.
Clinical note generation
Ambient scribes, clinical documentation AI, encounter summaries. We catch hallucinated medications, omitted findings, and diagnostic leaps that text similarity metrics miss entirely.
Clinical decision support
Diagnostic recommendations, treatment suggestions, risk scores with explanations. We evaluate whether the reasoning is clinically sound - not just whether it's grammatically correct.
Patient-facing content
After-visit summaries, discharge instructions, patient education materials. We catch inaccuracies, inappropriate clinical language, and missing safety information before patients read it.
Radiology AI
AI-generated radiology reports, finding summaries, and impression text. We catch omitted findings, unsupported conclusions, and laterality errors - the failures that matter when a clinician is reading 50 reports before lunch.
Pathology AI
AI-assisted pathology reports, diagnostic classifications, and specimen summaries. We evaluate whether findings are consistent, grading is accurate, and critical results are flagged - not just whether the text is coherent.
What we catch
Six failure types standard evals miss entirely
Hallucinated medications
AI documents a prescription that was never discussed. The clinician may act on fabricated information.
Omitted findings
Red flag symptoms mentioned by the patient but absent from the clinical note. The next clinician won't know to follow up.
Diagnostic leaps
Conclusions without documented supporting evidence. "Likely pneumonia" without exam findings bypasses clinical reasoning.
Unsupported inferences
Patient mentions occasional headaches, note records "chronic migraines". Diagnostic escalation without evidence.
Dosage errors
Wrong dose, frequency, or route. Particularly dangerous when the error is plausible and passes surface-level review.
Contraindication misses
Drug interactions present in patient history but not flagged by AI. The information is there - it's just not being used.
The engagement
4-8 weeks. Your clinical team shapes every decision.
Pre-engagement
BAA/DPA
Security review, BAA execution, DPA signing. We don't touch any data until legal is complete. This typically takes 2-4 weeks. We'll send you a security pack (SOC 2 attestation, pen test summary, DPA/BAA templates, data flow diagram) on first contact to speed this up.
Week 1-2
The clinical failure report
We connect to your AI traces and run our evaluation engine. You get a failure report - every clinical failure categorised by type, severity, and frequency. This is usually the moment teams realise what's been reaching patients.
Weeks 3-5
Clinicians calibrate
Your pharmacists, clinicians, or clinical informaticists review what we flagged and correct where we're wrong. 50 corrections is the typical turning point. We build out the clinical failure taxonomy and set guardrail thresholds for the highest-severity patterns.
Week 6-8
Handover
You own everything. Clinical evaluation criteria, failure taxonomy mapped to patient safety categories, guardrail rules, all clinician correction data. Additional deliverables: clinical safety summary, incident response protocol, clinician-facing correction guide, and a full audit trail.
In production
When a guardrail blocks an output
Every blocked output is logged with a full audit trail: the original AI output, the failure type, the severity classification, and the guardrail rule that triggered. Your clinical team gets notified through your existing alerting (Slack, PagerDuty, email - whatever you use). The blocked output never reaches the patient or clinician.
For high-severity blocks (hallucinated medications, dangerous dosage errors), the incident response protocol defines escalation paths - who reviews, how fast, and what gets documented. Every block, every review, every escalation is logged and exportable for compliance reporting.
No silent failures. No outputs slipping through between quarterly audits. Every decision is traceable.
Deployment
Self-hosted options. Patient data never leaves your infrastructure.
We evaluate your AI's outputs. We do see production data and would need to process PHI under a BAA. Patient data can stay in your environment with a self-hosted deployment, or we host with EU or US data residency.
Data retention: Configurable per your policy. Default 90 days for evaluation logs. Deletion on request within 10 business days. For self-hosted deployments, retention is entirely under your control - we never hold a copy.
Your environment
Runs in your Azure or AWS tenant. No cloud dependency on Composo.
No EMR access needed
We evaluate AI outputs, not the systems that produce them. Ingest traces from your AI layer.
Audit-ready artefacts
Every evaluation logged with timestamp, criteria version, confidence score. Evidence trail for certification reviewers.
Integration
How Composo connects to your stack
Composo connects to your AI's production traces - not to your EHR. We work with whatever logging you already have: Langfuse, Azure AI logs, custom JSON, or our Python SDK (4 lines of code).
We sit between your AI's output and the clinician's review. We never touch Epic, Cerner, or any EHR system directly. No HL7, no FHIR integration, no EHR access needed.
If your AI pulls from an EHR and generates clinical text, we evaluate that text. The AI-to-EHR pipeline is yours. The quality layer is ours.
Compliance
Built for regulated environments
| Certification | Status |
|---|---|
| SOC 2 Type II | Certified (Dec 2024). Full report available. |
| Penetration test | Complete (Aug 2025). Executive summary available. |
| DPA | Template ready. GDPR-compliant. |
| BAA | Template ready. HIPAA. |
| Security policies | 26 policies documented and current. |
Where we sit in the regulatory picture
Composo is not a medical device. We don't generate clinical decisions - we evaluate AI outputs that do. We sit between your AI's output and the clinician's review, flagging failures before they're acted on.
On FDA clearance: We're monitoring the FDA's evolving framework for AI/ML-enabled software, including the finalised Predetermined Change Control Plan (PCCP) guidance. Our current product - evaluating and flagging AI outputs for clinical review - does not make autonomous clinical decisions and is designed to support, not replace, clinician judgment. If the regulatory landscape changes, we'll pursue clearance. Today, we're focused on giving clinical teams the tools to catch what their AI gets wrong.
What we replace
The alternative is a clinical review board that checks 2% of outputs once a quarter.
Without Composo
Your clinical team manually reviews a small sample of AI outputs every quarter
Reviews catch what they catch - but miss everything between audits
When the AI model updates, nobody re-validates clinical accuracy
The clinical expert who designed the review process leaves. The review process decays
Failures accumulate silently between review cycles
No audit trail. When something goes wrong, you can't trace what the AI said, when it said it, or who reviewed it
With Composo
Every AI output evaluated against clinical criteria - not a sample
50 clinician corrections is the turning point. By month 2, the system catches failure types it missed in month 1
Guardrails block clinically dangerous outputs before they reach patients
Full audit trail on every evaluated output - exportable for compliance reporting
Deployed in your environment in 4-8 weeks
You own the evaluation criteria, the failure taxonomy, all correction data
See it in action
See a hallucinated medication caught in real time
See how Composo evaluates a real clinical AI output - catching citation errors, omitted findings, and unsupported inferences that generic evals miss.
From our customers
Trusted by teams where quality isn't optional
We embedded Composo into our AI Workers from day one - best decision we've made on testing. They provide peace of mind for us and our customers. No brainer.

Fehmi Sener
CTO, 5u.ai
We cut our QA cycle time by 70%. Instead of relying purely on human review, now we instantly know which prompts are failing and why.
Head of AI Engineering
Enterprise SaaS platform
For the first time, we can ship with complete confidence knowing exactly what our AI quality looks like at scale.

Senior Software Engineer
Instrumentl
LLM as a Judge was far too unreliable. Composo gave us the deterministic scoring we needed to actually track improvements.
Senior ML Engineer
Fortune 500 Financial Services
Your clinical AI is getting things wrong right now. We catch them before they reach patients.
A clinical failure report on your AI - categorised by type, severity, and frequency. Delivered in under a week. Deployed inside your environment.