Legal Tech Startup Ships MVP in 4 Weeks Using Composo

Composo Team

Case Study: Legal Tech Startup Ships MVP in 4 Weeks Using Composo

Industry: Legal Tech
Company Stage: Seed, 12 employees
Use Case: AI-powered commercial lease review for SMBs

The Challenge

A legal tech startup had just closed a $4.5M seed round with commitments from three enterprise pilots - contingent on demonstrating reliable AI performance within 6 weeks. Their AI assistant for commercial lease review was functionally complete, but they had no evaluation framework in place.

The evaluation bottleneck was killing them:

  • Their legal advisor spent 15 hours per week manually reviewing outputs at $300/hour
  • Manual review was their single largest operating expense after salaries
  • Each product iteration took 3-4 days to validate, making rapid improvement impossible
  • Enterprise prospects explicitly required quality metrics and evaluation processes for compliance
  • They had no historical data or labeled examples to build traditional ML evaluation systems

The team had tried LLM-as-judge but abandoned it after scores varied by 20% between identical runs & missed crucial errors. They considered building their own evaluation infrastructure but estimated 2-3 months of engineering time they didn't have.

Implementation with Composo

The startup implemented Composo in one afternoon. They converted their legal team's expertise into evaluation criteria:



import composo

# Initialize Composo - that's it, no custom models needed
composo.init()

# Define evaluation criteria in plain English
criteria = [
    "Reward responses that identify non-standard payment terms that could impact cash flow",
    "Reward responses that flag ambiguous jurisdiction clauses for legal review",
    "Reward responses that detect missing tenant protection clauses required in commercial leases",
    "Penalize responses that provide definitive legal interpretations without appropriate disclaimers",
    "Penalize responses that fabricate standard terms not actually present in the document"
]

# Run evaluation on any LLM output
results = composo.evaluate_trace(trace, criteria)

# Returns deterministic scores (0-1) with detailed explanations

Full implementation examples at docs.composo.ai

Pre-Deployment Testing and Iteration

In the first week, the startup tested 150+ prompt variations and 5 different models. What previously required 3 weeks of manual review took 1 day. They discovered:

  • Claude Sonnet outperformed GPT-5 for their use case with a 20% increase in accuracy
  • Structured output formats improved accuracy by 23%
  • The model consistently failed on multi-party leases (requiring targeted prompts)
  • Tool calling failures could be traced to poorly defined request schema in legal database queries

Every code change now triggered automated evaluations in CI/CD, blocking deployments that degraded quality.

Production Monitoring and Continuous Improvement

The startup deployed with Composo monitoring every production response. Production insights from the first month:

  • Drift detection caught a model update that degraded sublease clause detection from 91% to 67%
  • Error patterns revealed systematic failures on specific clause types (indemnification: 62% accuracy)
  • Quality distribution showed 94% of responses met their 0.75 threshold, with clear data on the failing 6%
  • Customer-specific issues identified that one pilot's non-standard lease format caused consistent problems

The team used this data to prioritize improvements. They knew exactly which document types and clause categories needed attention, turning vague quality concerns into specific engineering tasks.

Results

First 6 Weeks:

  • Passed all three enterprise pilot evaluations with documented quality metrics
  • Reduced evaluation costs from $4,800/week to $200/week
  • Decreased iteration cycle from 3-4 days to 2-3 hours
  • Achieved 92% accuracy on lease review tasks (measured consistently)

Six Months Later:

  • 24 paying customers including 2 enterprise accounts
  • 8,000+ automated evaluations daily across 45 criteria
  • Quality improvements driven by production data (accuracy increased from 92% to 97%)
  • Successfully passed SOC 2 audit with comprehensive evaluation documentation

The CTO noted that Composo solved their immediate crisis - they needed quality metrics for enterprise sales but had no data, no time, and no budget to build traditional evaluation systems. More importantly, production monitoring revealed exactly where and how their product failed, transforming quality from a vague concern into measurable, actionable engineering work.

Implementation details and documentation available at docs.composo.ai