Guardrails & Content Safety
Guardrails are the boundary between your agent and the real world. This article covers the full production stack: threat modeling, layered defense with honest cost math, tool-level guardrails, indirect prompt injection defense, fail-open vs. fail-closed decisions, NeMo Guardrails (Colang 2.0), and how to evaluate and monitor your guardrails over time.
Quick Reference
- →Guardrails are code-level checks at input/output/tool boundaries — system prompt instructions are suggestions, not enforcement
- →Layer fast checks first: regex (< 1ms, free) → small-model classifier (~200ms, ~$0.004/1K calls) → LLM judge (500ms+, expensive)
- →Indirect prompt injection arrives via tool responses — input guardrails can't catch it; tool-level scanning is required
- →Fail closed by default: if a guardrail errors, reject the request rather than passing it through
- →Log both the original and modified content for every guardrail trigger — the originals are your eval dataset
- →NeMo Guardrails is worth the dependency when you have multi-domain policy (topic rails + fact-checking + content safety) — custom code wins for simpler, single-concern checks
- →Build a CI gate from production triggers: recall ≥ 0.95 on your eval dataset before every deploy
When You Don't Need Guardrails (and When You Do)
Not every agent needs a full guardrail stack. Adding guardrails to a low-risk internal tool wastes latency and increases maintenance surface. The question is: what does a motivated attacker or confused user actually cost you if they succeed?
| Agent type | Risk profile | Guardrail recommendation |
|---|---|---|
| Internal dev tool, controlled users | Low — blast radius is limited to the operator | Input length limits + output PII scrub only |
| Customer-facing assistant, unstructured input | Medium — prompt injection and brand risk | Full input stack (regex + classifier) + output scrub + topic rails |
| Agent with write permissions (email, DB, payments) | High — tool calls have irreversible side effects | Full input + tool-level validation + human-in-the-loop for sensitive actions |
| Public RAG over sensitive documents | High — PII leakage and indirect injection via retrieved chunks | All layers + retrieval-level content filtering + output scrub |
Telling the LLM 'do not reveal your system prompt' in the system prompt is a suggestion, not enforcement. Determined users bypass prompt-level instructions routinely. True guardrails are code-level checks that run whether or not the LLM cooperates.
Guardrails sandwich: input filters, agent core, output filters