Production & Scale/Production Operations
Advanced18 min

Guardrails & Content Safety

Guardrails are the boundary between your agent and the real world. This article covers the full production stack: threat modeling, layered defense with honest cost math, tool-level guardrails, indirect prompt injection defense, fail-open vs. fail-closed decisions, NeMo Guardrails (Colang 2.0), and how to evaluate and monitor your guardrails over time.

Quick Reference

  • Guardrails are code-level checks at input/output/tool boundaries — system prompt instructions are suggestions, not enforcement
  • Layer fast checks first: regex (< 1ms, free) → small-model classifier (~200ms, ~$0.004/1K calls) → LLM judge (500ms+, expensive)
  • Indirect prompt injection arrives via tool responses — input guardrails can't catch it; tool-level scanning is required
  • Fail closed by default: if a guardrail errors, reject the request rather than passing it through
  • Log both the original and modified content for every guardrail trigger — the originals are your eval dataset
  • NeMo Guardrails is worth the dependency when you have multi-domain policy (topic rails + fact-checking + content safety) — custom code wins for simpler, single-concern checks
  • Build a CI gate from production triggers: recall ≥ 0.95 on your eval dataset before every deploy

When You Don't Need Guardrails (and When You Do)

Not every agent needs a full guardrail stack. Adding guardrails to a low-risk internal tool wastes latency and increases maintenance surface. The question is: what does a motivated attacker or confused user actually cost you if they succeed?

Agent typeRisk profileGuardrail recommendation
Internal dev tool, controlled usersLow — blast radius is limited to the operatorInput length limits + output PII scrub only
Customer-facing assistant, unstructured inputMedium — prompt injection and brand riskFull input stack (regex + classifier) + output scrub + topic rails
Agent with write permissions (email, DB, payments)High — tool calls have irreversible side effectsFull input + tool-level validation + human-in-the-loop for sensitive actions
Public RAG over sensitive documentsHigh — PII leakage and indirect injection via retrieved chunksAll layers + retrieval-level content filtering + output scrub
System prompt instructions are not guardrails

Telling the LLM 'do not reveal your system prompt' in the system prompt is a suggestion, not enforcement. Determined users bypass prompt-level instructions routinely. True guardrails are code-level checks that run whether or not the LLM cooperates.

User InputInput GuardrailsRegex FilterClassifierLLM CheckAgentLLM + toolsOutput GuardrailsOutput FilterPII ScannerBrand CheckUser Response

Guardrails sandwich: input filters, agent core, output filters