Skip to content
Read our latest publication on optimal methods for LLM evaluation here
← Back to Blog

Build vs Buy: Should You Build Your Own LLM Evaluation Pipeline?

Seb Fox · CEO & Co-founder · · Updated

You can build your own LLM evaluation pipeline. It will take your best engineer six months, they will hate every minute of it, and they will miss patterns that evaluation vendors have already catalogued. For most teams, that trade is not worth it.

This post is the honest version of the build-vs-buy conversation for LLM evaluation. No vendor marketing, just the numbers.

The cheap part is starting. The expensive part is everything after.

A senior ML engineer can vibe-code a basic LLM-as-judge evaluator in a day. Prompt a model to score outputs 1 to 5. Run it in CI/CD. Log the results. Shipped.

This is the trap.

The easy part creates the impression that LLM evaluation is a weekend project. It is not. The hard parts are:

  1. Alignment with human experts. A basic homegrown LLM-as-judge plateaus around 70% alignment with human domain experts. That is not sufficient for production gating. Reaching 90%+ alignment requires real research: criteria ensembling, reward modelling, variance-informed calibration. Months of work, not weeks.
  2. Domain calibration. Generic evaluators miss domain-specific failures. Catching a missed medication, a stale FX rate, or a fabricated legal citation requires a failure taxonomy built from your actual production traces and encoded into evaluation criteria. This is domain-expert time, not engineering time.
  3. Maintenance. Your product evolves. Your models change. Your domain drifts. An evaluation system written in week one is already stale by month three. Keeping it current is roughly two engineers ongoing.
  4. Production guardrails. Moving from offline scoring to inline guardrails introduces latency and infrastructure concerns that are not in most internal builds.
  5. Drift detection. When the model you are evaluating gets swapped (GPT-4.1 → GPT-5, Claude 4.5 → Claude 4.6), evaluation quality often shifts. Homegrown systems typically do not notice.

The real numbers

Based on patterns from 30+ production deployments, here is what a realistic build looks like:

Minimum viable internal eval (3 months)

  • 1 senior ML engineer × 3 months, fully loaded: £45,000 - £60,000
  • What you get: LLM-as-judge in CI/CD, basic dashboards, ~72% accuracy
  • What you do not get: domain calibration, production guardrails, drift handling

Production-grade internal eval (6 to 12 months)

  • 1 senior ML engineer × 6 months: £90,000 - £120,000
  • Part-time domain expert: equivalent of £40,000 - £60,000 of their time
  • Total phase 1 cost: £130,000 - £180,000
  • What you get: domain-calibrated evaluation at production scale, failure taxonomy, maybe runtime guardrails
  • What you do not get: maintenance. That is the next budget.

Ongoing maintenance (once at scale)

  • ~2 engineers keeping evaluations current as models, product, and domain drift: £300,000 - £400,000 per year fully loaded
  • Ongoing domain-expert review: ~0.2 - 0.5 FTE

Opportunity cost

This is the line most build-vs-buy calculations miss. Your best ML engineer spending 6 months on evaluation infrastructure is your best ML engineer not shipping product features, not working on your core model, not doing the thing that actually differentiates your company.

For most teams whose product is not LLM evaluation, that opportunity cost is larger than the vendor fee.

When building is genuinely the right call

Building LLM evaluation internally is the right call in specific cases:

1. Evaluation IS your product

Research organisations, evaluation vendors, AI labs whose core output is methodology. If you are Anthropic, OpenAI, Scale, or an academic lab, you have to build this. It is the work.

2. You have very strong internal research capability

A handful of companies have world-class evaluation research teams already. Meta’s FAIR, DeepMind, Anthropic, Microsoft Research. If you are one of them, internal build is strategically obvious.

3. You have a 6 to 12 month runway before needing production-grade quality

Startups with 12 months of cash and no AI shipping yet can afford to spend 6 of those months on evaluation. Most startups cannot.

4. You have senior engineers you can dedicate long-term

A one-time build with no maintenance plan will be outdated within 6 months. You need the ongoing commitment to be realistic.

For everyone else - which is most teams building on top of LLMs - buying is usually right.

What open-source covers (and does not)

Open-source evaluation frameworks exist and are good. The honest answer on each:

Promptfoo, DeepEval, Ragas. Excellent for offline CI/CD evaluation, prompt regression testing, and simple LLM-as-judge setups. Use them. They solve a real problem.

What they do not solve:

  • Domain-specific failure detection beyond generic categories
  • Production-scale guardrails at sub-second latency
  • Learning from domain-expert corrections
  • Drift detection when your underlying model changes
  • The calibration work that takes you from 72% to 85% accuracy

You can layer open-source for offline eval with a commercial quality layer for production. Many Composo customers do exactly this.

Arize Phoenix. Strong open-source observability and tracing. Use it if you need visibility. The eval module is useful for simple cases. For domain-specific failure detection at production scale, it is not the same as a calibrated evaluation system.

The FDE (forward-deployed engineer) model

Composo deploys with a forward-deployed engineer who embeds with the customer for 2 to 4 weeks. The deployment produces:

  • A failure taxonomy built from the customer’s actual production traces
  • A domain-calibrated evaluation model
  • Runtime guardrails configured at whatever boundary makes sense
  • An operational handover so the customer’s own team runs the system

Total cost: depends on scope, usually meaningfully less than a 6-month internal build. Total time: 2 to 4 weeks, not 6 to 12 months. Total ongoing commitment from the customer: roughly 10 hours of domain-expert time during calibration, then light-touch operation afterwards.

The comparison is not “pay a vendor fee forever vs spend a one-time engineering cost.” It is “get a production-grade quality layer in weeks, or build it yourself over a year.”

The framing that clarifies the decision

Dave Holmes-Kinsella, who runs AI at Infinitus, put the build-vs-buy decision as well as anyone we’ve spoken to:

“We don’t need to be best in class in evaluating phone calls. Just know. That’s what you will do. Opportunity cost for some very scarce engineering resources.”

“We want the dashboard with the dials. That tells us stuff. We don’t wanna have to build it.”

That is the right question. Is “world-class internal evaluation infrastructure” actually strategic for your company? For a tiny minority of companies, yes. For most, it is the thing you pay for so your best engineers can work on your actual product.

The shortest version of this post

Building your own evaluation is cheap to start and expensive to maintain. Most homegrown evaluation plateaus at accuracy that is not sufficient for production gating. Generic LLM-as-judge misses domain-specific failures. The 6 to 12 months of engineering time costs more than the vendor fee, and the opportunity cost costs more than both.

Unless evaluation is your product or you have genuine research capability, buying is usually right.

For a diagnostic that shows you what failures Composo would catch on your specific traces in under a week, book here. Or read the Composo vs building it yourself comparison for a side-by-side.

Frequently asked questions

How much does it cost to build an internal LLM evaluation pipeline?

A minimum viable internal evaluation setup costs around £45-60k fully loaded (one senior ML engineer over 3 months). A production-grade, domain-calibrated evaluation system typically requires 6 to 12 months of engineering time plus ongoing maintenance of roughly two engineers once at scale.

When does it make sense to build LLM evaluation internally?

Building makes sense when evaluation quality IS the product (for research orgs, eval vendors, or AI labs), when you have a 6+ month runway before needing production-grade quality, and when you have senior engineers you can dedicate to long-term maintenance. For most teams using AI as a component of their product, it does not make sense.

What is the human-expert alignment ceiling of a homegrown LLM-as-judge?

Most homegrown LLM-as-judge systems plateau around 70% alignment with human domain experts. A production-grade evaluation platform like Composo reaches 90%+ alignment with human experts in most contexts, but the techniques required (criteria ensembling, reward modelling, domain calibration) take months of specialised research to implement internally.

What do open-source evaluation frameworks cover, and what do they miss?

Open-source options (Promptfoo, DeepEval, Phoenix, Ragas) are excellent for offline CI/CD-style evaluation. They do not solve the harder problems: catching domain-specific failures that generic evaluators miss, calibrating to your failure taxonomy, running production-scale guardrails, and handling drift.

Is a bought LLM evaluation platform locked in?

Partially. Your evaluation criteria, failure taxonomy, and annotated calibration data are your property and exportable. The underlying scoring models are usually proprietary. The practical lock-in is the institutional knowledge of how to evaluate your domain, not data hostage-taking.