Prompt Injection Defense
Prompt injection is OWASP LLM01 for the third year running — and it's still not solved. This article gives you the threat model to decide how much defense you need, five production layers with real cost math, an eval framework to measure if they work, and a 30-day runbook to ship it.
Quick Reference
- →Prompt injection = attacker-crafted content that overrides the agent's system instructions; LLMs cannot reliably distinguish instructions from data
- →Rule of Two (Meta, 2025): an agent is vulnerable when it has ALL THREE of — sensitive data access, exposure to untrusted content, ability to change state. Remove one property to break the chain.
- →Five defense layers in cost order: regex (free) → LLM classifier (~$0.0003/req) → instruction hierarchy (free) → tool result sanitization (free) → LLM output validator (~$0.0003/req)
- →Use forced tool calls (tool_choice: required) for classifier and output validator — structured output is more reliable than parsing 'SAFE or INJECTION' strings
- →Indirect injection via RAG docs, tool results, MCP servers, and API responses is more dangerous than direct user injection — all external data is a potential attack surface
- →Human-in-the-loop is the single most effective defense against tool abuse from injection — gate HIGH-risk tools (send_email, delete_record) behind human approval
- →Prompt injection is unsolved: PromptArmor (ICLR 2026) achieves <1% FP/FN on AgentDojo, but Anthropic's own research shows 1% attack success rate for Claude Opus 4.5 under adaptive attack
- →Run an attack suite (50+ payloads from HackAPrompt and OWASP datasets) in CI — block deploys where direct_override bypass_rate > 0
When Do You Need Injection Defense?
Prompt injection is the most-exploited LLM vulnerability in production. Unlike SQL injection, there is no parameterized query equivalent — LLMs process instructions and data in the same context window and cannot fully separate them by design.
Before adding defense layers, answer one question: does your agent have all three of the following properties simultaneously? Sensitive data access (PII, API keys, private context), exposure to untrusted input (user messages, RAG docs, tool results, external API responses), and the ability to change state (write tools, email sending, database writes, external calls). Meta's security team calls this the Rule of Two — an agent with all three is vulnerable to injection regardless of how carefully you write the system prompt.
Rule of Two: break the chain by restricting data access, sanitizing inputs, or removing state-changing tools
| Agent Profile | Properties Present | Required Defense Layers |
|---|---|---|
| Read-only assistant | Sensitive data, but no state changes and controlled inputs only | Layer 3 only (instruction hierarchy) |
| RAG assistant | Sensitive data + untrusted RAG docs, but no write tools | Layers 1, 3, 5 (regex + hierarchy + output validator) |
| Tool-using agent (read tools) | All three, but tools only read data | Layers 1, 2, 3, 4, 5 |
| Agentic system (write tools) | All three, with tools that send email, write files, call APIs | All 5 layers + human-in-the-loop for HIGH-risk tool calls |
Regex pattern matching and instruction hierarchy cost nothing and take 4 hours to implement. They catch the majority of commodity injection attacks. Add the LLM classifier (layer 2) only after you have production traffic and can measure its false positive rate.