Composo Align Platform Release

Luke Markham | CTO Composo

Your AI is making mistakes you can't see

You don't really know how your AI app is performing.

Push a change and it's hard to tell quickly whether quality improved or regressed. Run in production and it's hard to spot where things are quietly failing. The feedback loop is too slow, too manual, or too unreliable to act on.

So teams choose between two bad options. Manual human review is accurate, but slow and doesn't scale - you're sampling and hoping. LLM-as-judge is fast and cheap, but evaluates blind - no domain context, 30%+ score variance, roughly 70% agreement with your actual experts. You end up not trusting the scores anyway.

The result: most teams ship on vibes.

Introducing Composo

AI evaluation that learns your standards.

Not manual review. Not LLM-as-judge. A third option - an evaluation engine that actually gets better the more you use it.

Think of hiring a smart reviewer into your team. Day one, they catch real issues - already better than a generic tool. But they don't know your specific standards yet. Over weeks, they see more examples, your team corrects them, they read the guidelines. By month two, they evaluate like your experts.

LLM-as-judge is the reviewer who never learns. Composo is the one who does.

What you get on day one

Send traces with simple criteria, get evaluations back immediately. No training period, no setup.

Scores you can trust. Every evaluation returns a score on a 0–1 scale with less than 1% variance - same input, same score, every time. Compare that to 30%+ variance with LLM-as-judge.

Not just a number. Every evaluation returns a detailed written analysis: the specific reasoning behind the score, citations pointing to exactly what in the output caused it, and actionable recommendations on what to fix.

Frameworks you already need. Pre-built evaluation for the patterns you care about - RAG (faithfulness, completeness, precision, relevance), agents (goal pursuit, tool use, exploration/exploitation, faithfulness), and response quality. Or write your own criteria in plain English: "Reward responses that use precise medical terminology appropriate for the audience."

Already more reliable than LLM-as-judge, before you give it any domain context. This isn't an LLM guessing at a number. It's a purpose-built scoring engine - an ensemble of analysis models feeding a trained reward model.

What you can see - and why you can trust it

Every evaluation is transparent and auditable. Not a black box.

Each score shows which sources informed the judgment: which of your uploaded documents were used, which memories from previous evaluations contributed. You can trace any score back to the knowledge that produced it.

The analysis is specific and domain-aware. In a healthcare example: not "this response seems accurate" but "NICE NG226 cited accurately - arthroscopic lavage recommendation correctly characterised. Two precision issues reduce the score: Cochrane review attributed to 2017 instead of the current 2022 update, and CG147 cited alongside NG226 without noting its superseded status." Every claim grounded in the clinical guidelines you uploaded.

Your domain experts review evaluations and annotate them - good trace, bad trace - directly in the platform. That feedback gets encoded. One correction ripples outward.

For regulated domains - healthcare, finance, legal - this auditability isn't a nice-to-have. It's a requirement. You need to show why a score was given, not just that it was.

How it gets smarter

Three things make it compound over time.

Evaluation memory builds automatically from every evaluation. The system learns your full range of cases - where the edges are, what patterns matter - with zero effort from you.

Expert feedback encodes your team's corrections into future evaluations. When someone flags where it's wrong, that correction doesn't just fix one case. Similar cases don't need re-review, and the fix improves judgment on different cases too. One expert review ripples outward.

Domain knowledge turns your guidelines, SOPs, clinical protocols - whatever defines "correct" in your world - into evaluation context. Composo processes them and surfaces the right material automatically at eval time. You write short criteria. Your documents do the heavy lifting.

The progression:

  • Day 1: Reliable scores and analysis, out of the box.
  • Weeks 1–2: Domain docs ingested, evaluation becomes specific to your world.
  • Month 1+: Expert feedback calibrates, memory builds - the system matches your experts.

What you provide makes it compound. It's not a prerequisite for it to work.

What this looks like in practice

In a recent customer deployment - a UK healthcare AI company scaling an AI medical scribe across NHS regions and international markets - hand-tuned LLM judges required 60 hours of initial setup and 45 hours per iteration, totalling 150 hours over three iterations. With Composo, initial setup took 8 hours and each iteration 6 hours - 20 hours total. An 85% reduction in expert time.

The difference isn't just speed. When quality definitions shifted, the team provided new examples instead of re-engineering prompts. They iterated weekly instead of quarterly.

How to start

Python SDK, drop-in replacement for LLM-as-judge. Three lines:

result = composo.evaluate(  messages=conversation,  criteria="Reward responses that are faithful to the source material")

Native support for OpenAI, Anthropic, and Gemini response formats - pass them straight in. Auto-instrumentation for multi-agent traces: orchestrator, sub-agents, tool calls - one line to set up. Non-blocking in production, zero latency on your inference path.

Composo is not a platform you migrate to. It's the scoring engine underneath whatever you already use - Langfuse, Arize, LangSmith, your own dashboards. Build the cockpit. We'll build the engine.

Proof

In a financial services production deployment, Composo achieved 92% agreement with senior compliance reviewers - compared to 78% for the existing LLM judge. The inter-annotator agreement rate among the human experts themselves was 90%. Composo effectively matched expert-level agreement. (The 2-point difference between 92% and 90% is within the natural variance you'd expect when testing on real-world data - we are not claiming to outperform human experts!)

The numbers across deployments:

  • 92% agreement with domain experts vs 78% for LLM-as-judge - matching the 90% human expert inter-annotator rate
  • 85% reduction in evaluator configuration time - from 40–60 hours to 6–8 hours
  • £650K in avoided staffing costs for one healthcare deployment - a 3-person team scaling across 6 markets
  • Evaluation engine in production for two years
  • SOC2 Type II compliant, EU data residency by default

Get started

Book a demo → See it working on your data.

Join the startup program → Early-stage teams get 3 months free access anddedicated support.