Question 1

How much does it actually cost to build an internal LLM evaluation pipeline?

Accepted Answer

A minimum viable setup: 1 senior ML engineer over 3 months (~£45-60k fully loaded) gets you a working LLM-as-judge integrated into CI/CD. A production-grade system calibrated to your domain: closer to 6 months, one full-time engineer plus a part-time domain expert. Ongoing maintenance: budget ~2 engineers once you are running at scale. The hidden cost is opportunity cost - those engineers are not shipping product.

Question 2

Why does accuracy plateau when you build your own?

Accepted Answer

Most internal builds use LLM-as-judge with a single evaluator prompt. That typically reaches around 70% alignment with human domain experts. Getting to the 90%+ alignment bar needed for production gating requires techniques like criteria ensembling, variance-informed calibration, and domain-specific reward modelling. These are real research contributions, not prompt engineering, and they take months to develop internally.

Question 3

What about using open-source evaluation frameworks?

Accepted Answer

Open-source options (Promptfoo, DeepEval, Phoenix, Ragas) get you started quickly. They are excellent for CI/CD-style offline evaluation. They do not solve the harder problems: catching domain-specific failures that generic evaluators miss, calibrating to your failure taxonomy, running guardrails at production scale, and handling drift. You can combine open source with Composo or build on top of them - we see both.

Question 4

Is Composo locked in - can I export the evaluation logic if we part ways?

Accepted Answer

The evaluation criteria, failure taxonomy, and annotated calibration data are yours. The underlying scoring model is proprietary. Customers who leave Composo typically take the criteria and taxonomy and re-implement scoring in-house if they choose to. The lock-in is closer to 'losing institutional knowledge' than 'data hostage'.

Question 5

Should a small team (1-3 engineers) ever build this internally?

Accepted Answer

Rarely. Small teams have the least slack to absorb a 6-month internal build plus ongoing maintenance. The explicit trade is: would you rather build a homegrown eval pipeline, or ship your actual product faster? For most small teams, Composo is the shortcut to a quality layer so the team stays focused on their real work.

Dimension	Composo	building it yourself
Time to first value	1 week (initial failure report)	3 to 6 months (end-to-end pipeline)
Time to production	2 to 4 weeks	6 to 12 months including first production iteration
Upfront engineering cost	Contract fee, scoped per deployment	1 senior engineer × 6 months = £90k-£120k fully-loaded
Ongoing maintenance	Included in licence	~2 engineers ongoing to keep evals current as models and product evolve
Quality at launch	Ships with failure taxonomy from 30+ prior deployments	Ships with what you remembered to evaluate in week one
Model drift handling	Built-in; corrections generalise across traces	Manual - someone must re-tune scoring each model update
Domain-expert workload	~10 hours over 4 weeks for calibration	Ongoing weekly review forever
Alignment with human experts	90%+ in most production contexts (domain-calibrated)	~70% (basic homegrown LLM-as-judge)

Composo vs building it yourself

At a glance

Where building it yourself is strong

Where Composo is different

When to pick which

Pick building it yourself if

Pick Composo if

Frequently asked questions

How much does it actually cost to build an internal LLM evaluation pipeline?

Why does accuracy plateau when you build your own?

What about using open-source evaluation frameworks?

Is Composo locked in - can I export the evaluation logic if we part ways?

Should a small team (1-3 engineers) ever build this internally?

See what Composo catches on your own AI.