Advanced22 min

Prompt Injection Defense

Prompt injection is OWASP LLM01 for the third year running — and it's still not solved. This article gives you the threat model to decide how much defense you need, five production layers with real cost math, an eval framework to measure if they work, and a 30-day runbook to ship it.

Quick Reference

→Prompt injection = attacker-crafted content that overrides the agent's system instructions; LLMs cannot reliably distinguish instructions from data
→Rule of Two (Meta, 2025): an agent is vulnerable when it has ALL THREE of — sensitive data access, exposure to untrusted content, ability to change state. Remove one property to break the chain.
→Five defense layers in cost order: regex (free) → LLM classifier (~$0.0003/req) → instruction hierarchy (free) → tool result sanitization (free) → LLM output validator (~$0.0003/req)
→Use forced tool calls (tool_choice: required) for classifier and output validator — structured output is more reliable than parsing 'SAFE or INJECTION' strings
→Indirect injection via RAG docs, tool results, MCP servers, and API responses is more dangerous than direct user injection — all external data is a potential attack surface
→Human-in-the-loop is the single most effective defense against tool abuse from injection — gate HIGH-risk tools (send_email, delete_record) behind human approval
→Prompt injection is unsolved: PromptArmor (ICLR 2026) achieves <1% FP/FN on AgentDojo, but Anthropic's own research shows 1% attack success rate for Claude Opus 4.5 under adaptive attack
→Run an attack suite (50+ payloads from HackAPrompt and OWASP datasets) in CI — block deploys where direct_override bypass_rate > 0

When Do You Need Injection Defense?

OWASP LLM01:2025 — #1 vulnerability for the third year running

Prompt injection is the most-exploited LLM vulnerability in production. Unlike SQL injection, there is no parameterized query equivalent — LLMs process instructions and data in the same context window and cannot fully separate them by design.

Before adding defense layers, answer one question: does your agent have all three of the following properties simultaneously? Sensitive data access (PII, API keys, private context), exposure to untrusted input (user messages, RAG docs, tool results, external API responses), and the ability to change state (write tools, email sending, database writes, external calls). Meta's security team calls this the Rule of Two — an agent with all three is vulnerable to injection regardless of how carefully you write the system prompt.

Rule of Two: break the chain by restricting data access, sanitizing inputs, or removing state-changing tools

Agent Profile	Properties Present	Required Defense Layers
Read-only assistant	Sensitive data, but no state changes and controlled inputs only	Layer 3 only (instruction hierarchy)
RAG assistant	Sensitive data + untrusted RAG docs, but no write tools	Layers 1, 3, 5 (regex + hierarchy + output validator)
Tool-using agent (read tools)	All three, but tools only read data	Layers 1, 2, 3, 4, 5
Agentic system (write tools)	All three, with tools that send email, write files, call APIs	All 5 layers + human-in-the-loop for HIGH-risk tool calls

Start with layers 1 and 3 — they are free and ship in a day

Regex pattern matching and instruction hierarchy cost nothing and take 4 hours to implement. They catch the majority of commodity injection attacks. Add the LLM classifier (layer 2) only after you have production traffic and can measure its false positive rate.

Attack Surface: Where Injection Enters

5 distinct injection surfaces — indirect vectors are harder to detect and control

Defense Strategy: Layers, Cost, and Tradeoffs

No single defense stops all injection attacks. Production systems layer multiple defenses ordered from cheapest to most expensive. Each layer catches what the previous layer missed, and together they provide defense-in-depth. The diagram below shows the pipeline; the cost diagram below that shows what each layer adds to your per-request budget.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.