AI evaluation that learns from your judgement

Composo learns your quality bar and scores like your best reviewers. 95% accuracy vs 70% for LLM as Judge. 4 lines of code.

Trusted by the best AI teams

Works immediately. Gets smarter with your data.

No rubrics. No prompt engineering. Just reliable scoring that matches your experts - and improves as you add feedback.


from composo import Composo

client = Composo(api_key="YOUR_API_KEY")

result = client.evaluate(
    messages=[
        {"role": "user", "content": """What's the return policy?

Context: [Retrieved returns policy document...]"""},
        {"role": "assistant", "content": "You can return items within 30 days of purchase. You'll receive store credit for the full amount. Contact our support team to initiate the return."}
    ],
    criteria="Reward agents that accurately reflect the provided context"
)
  

import composo
composo.init()

@composo.agent_tracer("your_agent")
def your_agent(...):
    ...

criteria = ["Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls"]

composo.evaluate_trace(trace, criteria)
  
Response

{
  "score": 0.68,
  "explanation": "
  - The response correctly states the 30-day return window and store credit policy, accurately conveying the core return terms
  - The timeframe 'within 30 days of purchase' and refund type are both factually accurate, providing reliable information to customers
  - The response fails to direct customers to the specific returns portal (returns.store.com), instead giving vague instructions to 'contact support team' which could cause confusion"
}
  
Know instantly if changes helped

Test any prompt or model change in minutes, not weeks of manual review.

Scoring that matches your experts

Purpose built evaluation engine trained on 1M+ expert comparisons. Calibrates to your standards, not generic rubrics.

4 lines of code to integrate

Native tracing for OpenAI, Anthropic, LangChain and more. Single sentence criteria - no complex rubrics needed.

Works with your existing stack

Drop into any pipeline, CI/CD or observability tool. Or use our dashboards.

The evaluation layer for teams who can't afford to guess

Used by leading AI teams for offline testing, production monitoring & real-time guardrails

Ship 10x faster

Instant scores on every change. No more waiting weeks for manual review, or hoping your LLM as judge is right.

learning curve
build around...

Catch issues before users complain

Real-time monitoring finds problems in minutes. Teams without this find out from support tickets.

Testing in progress...

See what's broken and why

Not just a score - detailed analysis showing exactly what failed, with citations and improvement recommendations.

Your experts define quality. But they don't scale.

Manual review take weeks and only covers a fraction of outputs. LLM as Judge is inconsistent and doesn't match your standards. Teams shipping to production need both speed & accuracy.

Finally, evaluation that matches your experts

95% agreement with expert judgement vs 70% for LLM as Judge - a 6x reduction in errors.

Trusted by teams where quality isn't optional

From early-stage startups to enterprise AI teams in healthcare, legal, financial & other complex domains.

Our Blog

See how teams ship faster with confidence

Our Pricing

Starter

100 evaluations/month
Full access to evaluation engine
Direct API access
5 requests/min rate limit
Support for all evaluation types (agents, tool calls, RAG)

Best for individuals & testing

Professional

50k to 1m+ evaluations/month
Priority processing & high throughput
Analytics dashboard + API
Direct support from founders
Calibration to your quality standards
Enterprise features available (on-prem, SLAs, DPAs) 

Best for teams shipping to production

Scale your experts with confidence.

Get evaluation that matches your quality bar - in 4 lines of code.