The testing & optimization engine for AI applications.

Deterministic scoring that's 3x more accurate than LLM as Judge.
10x faster. 4 lines of code.

Trusted by the best AI teams

Elegantly simple to use

No rubrics. No prompt engineering. Just deterministic scoring that works.


from composo import Composo

client = Composo(api_key="YOUR_API_KEY")

result = client.evaluate(
    messages=[
        {"role": "user", "content": """What's the return policy?

Context: [Retrieved returns policy document...]"""},
        {"role": "assistant", "content": "You can return items within 30 days of purchase. You'll receive store credit for the full amount. Contact our support team to initiate the return."}
    ],
    criteria="Reward agents that accurately reflect the provided context"
)
  

import composo
composo.init()

@composo.agent_tracer("your_agent")
def your_agent(...):
    ...

criteria = ["Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls"]

composo.evaluate_trace(trace, criteria)
  
Response

{
  "score": 0.68,
  "explanation": "
  - The response correctly states the 30-day return window and store credit policy, accurately conveying the core return terms
  - The timeframe 'within 30 days of purchase' and refund type are both factually accurate, providing reliable information to customers
  - The response fails to direct customers to the specific returns portal (returns.store.com), instead giving vague instructions to 'contact support team' which could cause confusion"
}
  
Know instantly if changes helped

Test any prompt or model change in minutes, not weeks

Deterministic scoring you can trust

Combines trained reasoning models with deterministic reward models. Fully customisable to your domain.

4 lines of code to integrate

Native tracing for openAI, anthropic, langchain and more. Single sentence criteria.

Works with your existing stack

Drop into any pipeline, CI/CD or observability tool. Or use our dashboards

The testing layer for teams shipping AI to production

Used by leading AI teams to power unit tests, benchmarking, monitoring & real-time guardrails

Ship 10x faster

Instant scores on every change, no more 2-week manual review cycles

learning curve
build around...

Catch issues before customers do

Real-time monitoring finds problems in minutes. Teams without this find out from complaints.

Testing in progress...

See what's broken and why

Not just a score - detailed analysis showing exactly what failed and how to fix it.

Agents can't work without deterministic evals

Manual checks take weeks. LLM as Judge is slow, expensive & 70% accurate. Teams shipping to production need instant, deterministic 95% accuracy.

Finally, a reliable signal on AI quality

95% accuracy vs 70% for LLM as Judge - a 75% reduction in errors.

We're trusted by the best teams in AI

Proven results across startups & enterprises in the most complex verticals

Our Blog

Case Studies

Our Pricing

Starter

100 evaluations/month
Access to our best evaluation models
Direct API access
5 requests/min rate limit
Support for all evaluation types (agents, tool calls, RAG)

Best for individuals & testing

Professional

50k to 1m+ evaluations/month
High throughput rate limits & priority processing
Insights & analytics platform (as well as direct API)
Direct 1-1 support from founders
Dedicated onboarding assistance
Enterprise features available (on-prem, SLAs, DPAs) 

Best for teams shipping AI to production - most teams start here

Stop guessing. Start knowing.

Get instant, deterministic scoring in 4 lines of code.