Elegantly simple to use

No rubrics. No prompt engineering. Just deterministic scoring that works.


from composo import Composo

client = Composo(api_key="YOUR_API_KEY")

result = client.evaluate(
    messages=[
        {"role": "user", "content": """What's the return policy?

Context: [Retrieved returns policy document...]"""},
        {"role": "assistant", "content": "You can return items within 30 days of purchase. You'll receive store credit for the full amount. Contact our support team to initiate the return."}
    ],
    criteria="Reward agents that accurately reflect the provided context"
)


import composo
composo.init()

@composo.agent_tracer("your_agent")
def your_agent(...):
    ...

criteria = ["Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls"]

composo.evaluate_trace(trace, criteria)

  
{
  "score": 0.68,
  "explanation": "
  - The response correctly states the 30-day return window and store credit policy, accurately conveying the core return terms
  - The timeframe 'within 30 days of purchase' and refund type are both factually accurate, providing reliable information to customers
  - The response fails to direct customers to the specific returns portal (returns.store.com), instead giving vague instructions to 'contact support team' which could cause confusion"
}
  

Production-grade accuracy

95% vs 70% for LLM as Judge.

Built different

Combines trained reasoning models with deterministic reward models. Not LLM as a Judge.

Integrate in minutes

Native tracing for openAI, anthropic, langchain and more.

Single sentence criteria.

API-first design

Drop into any pipeline or monitoring stack. Or use our dashboards

Try for Free

The evaluation layer designed for the agentic era

Used by leading AI teams to power unit tests, benchmarking, monitoring & real-time guardrails

Ship 10x faster

Instant scores on every change, not 2 week manual review cycles

Catch issues instantly

Real-time monitoring finds problems before customers do

Prove quality with data

Show quantitative metrics to win & retain customers

We're trusted by the best teams in AI

Proven results across startups & enterprises in the most complex verticals

"We embedded Composo into our AI Workers from day one - best decision we've made on testing. As an early stage start-up, we can't afford to waste time on manual evals or debugging.

Composo's importance exponentially increases when we ship agents in production. They provide peace of mind for us and our customers. No brainer!"

CTO of 5u.ai

Fehmi Şener

"We cut our QA cycle time by 70%. Instead of relying purely on human review, now we instantly know which prompts are failing and why. The detailed analysis helps us fix issues before they hit production."

Head of AI Engineering

Enterprise SaaS platform

"We plugged Composo directly into our production pipeline and now get instant visibility into quality issues.

For the first time, we can ship with complete confidence knowing exactly what our AI quality looks like at scale."

Senior Software Engineer

Instrumentl

"We were finding LLM as a Judge far too unreliable - the same response would score 40% one day and 70% the next. Composo gave us the deterministic scoring we needed to actually track improvements. Game changer for our ML team."

Composo achieves state-of-the-art performance in real world validation

The only LLM evals API accurate enough for production.

Elegantly simple to use

Try for Free

The evaluation layer designed for the agentic era

Ship 10x faster

Catch issues instantly

Prove quality with data

Agents can't work without deterministic evals

Composo Align is SOTA when it comes to evals

We're trusted by the best teams in AI

Our Blog

Luke Markham

Composo team

Composo Team

Case Studies

Our Pricing

Hobby

Free

100 evaluations/month

Access to our best evaluation models

Direct API access

5 requests/min rate limit

Best for individuals & testing

Support for all evaluation types (agents, tool calls, RAG)

Professional

Contact us for pricing

50k to 1m+ evaluations/month

High throughput rate limits & priority processing

Insights & analytics platform (as well as direct API)

Direct 1-1 support from founders

Dedicated onboarding assistance

Best for teams shipping AI to production

Enterprise features available (on-prem, SLAs, DPAs)

Stop pretending agents work. Start knowing they do.