Skip to content
Read our latest publication on optimal methods for LLM evaluation here
← Back to Blog

AI Guardrails: How to Block Bad LLM Outputs in Production

Seb Fox · CEO & Co-founder · · Updated

Generic AI guardrails catch generic failures. The failures that hurt your business are specific.

A hallucinated medication in a clinical AI. A missing risk disclosure in a financial recommendation. A fabricated legal citation in a contract review. These are the failures that create real exposure. None of them match a generic hallucination pattern. None of them fire a toxicity alert. None of them look wrong to an off-the-shelf guardrail.

This guide covers what AI guardrails actually are, where generic guardrails break down, and how to deploy domain-specific guardrails that work at production latency.

What is an AI guardrail?

An AI guardrail is a runtime check on an LLM output. It runs inline, at inference time, before the output reaches a user or triggers an action. Based on a quality criterion, the guardrail allows the output, blocks it, rewrites it, or escalates to human review.

Guardrails differ from offline evaluation in three important ways:

  1. Synchronous. They run before the output is delivered. Latency is a first-class concern.
  2. Binding. They change what the system does. A blocked output never reaches the user.
  3. Trace-level. They operate on individual outputs, not aggregate metrics.

A complete AI quality layer typically includes both offline evaluation (for monitoring, regression testing, and surfacing failure patterns) and runtime guardrails (for blocking specific failures in production).

What generic guardrails catch, and what they miss

Off-the-shelf guardrail libraries and services focus on a common set of categories:

  • Toxicity and profanity. Text-level classifiers trained on generic datasets.
  • PII exposure. Pattern matching and named-entity classifiers for common PII fields.
  • Prompt injection. Signature and behavioural detection of known injection patterns.
  • Generic hallucination. Citation-to-context checks, broad groundedness scoring.
  • Banned topics and keywords. Policy lists encoded as filters.

These matter. A production AI system without basic toxicity, PII, and prompt injection protection is exposed. But the failures that generic guardrails catch are not usually the failures that matter most to a specific business.

Consider the failure modes we have seen in real production deployments:

DomainFailureCaught by generic guardrails?
Clinical AIAI scribe inserts a medication not mentioned in the consultationNo
Clinical AIAI scribe omits a red-flag symptom the patient mentionedNo
Financial AIAdvice recommends a product without required risk disclosureNo
Financial AIFX agent uses yesterday’s rate on today’s quoteNo
Legal AIContract review miscategorises a change-of-control clauseNo
Legal AILegal research memo cites a case that does not existSometimes (if the citation is obvious enough)
Customer service AIAgent commits to a refund outside policyNo
Customer service AIAgent misses an escalation triggerNo

The pattern is consistent: the failure is domain-specific and invisible to a generic classifier.

Domain-specific guardrails: the principle

A domain-specific guardrail is a quality check calibrated to the specific ways your AI fails. That means three things:

1. A failure taxonomy built from your traces

Before you can catch a failure, you have to know what it looks like. The first step in any production guardrail deployment is surfacing the failure taxonomy from the AI’s actual production outputs. What is it doing wrong? Which failures matter? Which are frequent, and which are rare but high-severity?

At Composo, we typically run this in the first week of a deployment: pull a representative sample of production traces, surface the 5 to 15 most meaningful failure patterns, and let the customer’s domain experts rank them by business impact.

2. An evaluation model that can detect those failures

Once the failure modes are known, the guardrail needs to be able to detect them. This is where generic LLM-as-judge plateaus. Basic LLM-as-judge typically reaches around 70% alignment with human domain experts. That is not good enough for a production guardrail that gates customer-facing outputs.

Techniques that materially move accuracy forward include:

  • Criteria ensembling. Multiple specialised criteria rather than one monolithic judge prompt.
  • Variance-informed calibration. Using the distribution of scores across runs to identify low-confidence decisions.
  • Reward modelling. A model trained specifically on evaluation, rather than a general-purpose LLM prompted to evaluate.

The measure that matters is alignment with human domain experts. Using these techniques, Composo’s evaluation reaches 90%+ alignment with human experts across most production contexts. Benchmark studies on public datasets like RewardBench 2 show a similar gap over baseline LLM-as-judge (roughly 13 percentage points), but the operative number for deployment is the human-expert alignment rate on your actual domain.

3. Inline latency

A guardrail that takes 10 seconds is not a production guardrail. It is an async check you are pretending is inline.

For interactive chat and agent systems, the practical latency budget is in the 200 to 800 millisecond range. This requires:

  • Efficient model inference (small, fast evaluators or batched calls to frontier models)
  • Parallel criterion evaluation (not sequential)
  • Fast failure signalling (early exit when a clear pass or fail is detected)

We run Composo’s guardrails on every production call for the customers who need them, typically in 200 to 600 milliseconds end to end depending on evaluation complexity.

Guardrails at the tool-call boundary

For agent systems - anything built on LangGraph, custom agent frameworks, or multi-step tool use - the most valuable place for a guardrail is often at the tool-call boundary, not at the final response.

Example: an agent is processing a customer service request. It generates a tool call to issue a refund. Before that tool call executes, a guardrail evaluates whether the refund is within policy, whether the customer is eligible, whether the reason given is consistent with the refund policy.

If it fails, the tool call is blocked. The agent replans.

This is what 5u, a healthcare voice AI company, does with Composo in production. Their CTO Fehmi Sener:

“We embedded Composo into our AI Workers from day one. Best decision we’ve made on testing. They provide peace of mind for us and our customers.”

Roughly 50% of tool calls in their system fail the domain-specific quality bar and get blocked before they execute. That is not a failure of the agent - it is the guardrail working as designed, catching specific tool calls that would have produced the wrong outcome.

When offline evaluation is not enough

Some teams default to offline evaluation as their entire quality story. They run evals in CI/CD, they sample production traces for review, they fix what they find. This works when failures are rare, low-severity, or easily caught in batch review.

It stops working when:

  • Failures have customer-facing consequences. An AI that gives a wrong medical instruction, a wrong financial recommendation, or a policy-breaching commitment cannot be caught after the fact.
  • The domain is regulated. Compliance rules often require that specific outputs cannot leave the system. Offline monitoring catches violations; it does not prevent them.
  • The failure rate is meaningful. If 10% of outputs fail on a specific dimension, letting them ship and catching them later is too expensive.

For those cases, runtime guardrails are not optional. They are the difference between “we know our AI fails sometimes” and “we prevent our AI from failing where it matters.”

How to deploy AI guardrails: a checklist

  1. Surface the failure taxonomy. Before choosing or building guardrails, get the list of failure modes from production traces. Rank by business impact.
  2. Define the scope of the guardrail. Final response only? Every tool call? Specific high-risk tool calls? The right scope depends on where failures actually occur.
  3. Choose an evaluation approach that can reach production accuracy. Generic LLM-as-judge is usually insufficient for production gating. Reward models, criteria ensembling, or domain-calibrated evaluators cross the threshold.
  4. Validate latency end to end. Measure p50 and p99 latency of the guardrail under production load, not just average on a test set.
  5. Plan for failures of the guardrail itself. A guardrail is a model. It will sometimes be wrong. Have a path for false positives (allow override with reason) and false negatives (periodic sampling to detect missed failures).
  6. Measure business impact. Track how often the guardrail fires, what it catches, and what customer outcomes change. A guardrail that never fires is not helping. A guardrail that fires constantly is probably over-triggering.

What Composo deploys

Composo’s runtime guardrails are calibrated to the customer’s failure taxonomy during a 2 to 4 week deployment. After that, the same evaluation model that scores traces offline runs inline as a runtime gate. Latency is sub-second. Integration is at whatever boundary makes sense for the customer: final response, tool call, intermediate step.

If your AI is in production and you need to move from “we know it sometimes fails” to “we prevent the failures that matter,” a diagnostic call is the quickest way to see what Composo would catch on your specific traces.

See how Composo deploys runtime guardrails or book a diagnostic.

Frequently asked questions

What is an AI guardrail?

An AI guardrail is a runtime check that inspects an LLM output at inference time and blocks, modifies, or allows it based on a quality criterion. Unlike offline evaluation, guardrails run inline on the live system and affect what the user sees.

What is the difference between AI guardrails and AI evaluation?

Evaluation is typically offline and asynchronous - scoring outputs after the fact, usually for monitoring or regression testing. Guardrails are online and synchronous - blocking or modifying outputs before they reach a user. Both are part of a complete AI quality layer.

Do generic AI guardrails work for regulated industries?

Rarely. Generic guardrails catch toxicity, PII, and obvious hallucination patterns. They do not catch domain-specific failures like a wrong medication dosage, a missing risk disclosure, or a fabricated legal citation. Regulated industries need guardrails calibrated to their specific failure taxonomy.

How fast can AI guardrails run in production?

A well-optimised domain-specific guardrail can run in 200 to 600 milliseconds, which is fast enough for interactive chat, agent tool calls, and most synchronous AI use cases.

Should I build or buy AI guardrails?

A basic guardrail (regex, LLM-as-judge on a single dimension) takes a day to build. A production-grade guardrail that catches domain-specific failures at low latency typically takes 3 to 6 months to build internally, plus ongoing maintenance. Buying makes sense when time to production matters more than full stack ownership.