Caching Strategies at Scale
Three cache layers that reduce LLM costs and latency in production agents: provider prompt caching, semantic caching, and tool result caching — with cost math, failure modes, and the decision framework most articles skip.
Quick Reference
- →Provider prompt caching: static prefix > 2K tokens gets cached by the provider; reads cost 0.1× base input, writes cost 1.25× (5 min) or 2× (1 hour)
- →Anthropic: requires cache_control on content blocks, max 4 breakpoints per request, GA — no beta header needed
- →OpenAI: fully automatic, no code changes, 50% discount on cached tokens (vs Anthropic's 90%)
- →Semantic caching: embed the query, find cosine-similar past queries, return cached response — skips the LLM call entirely; use threshold ≥ 0.92 or you'll serve wrong answers
- →Tool result caching: hash tool name + sorted args as key, TTL per data type (stock: 60s, KB: 3600s)
- →Cache write costs break the '90% savings' headline: a 5 min cache write costs 1.25× input — savings only start after the second read
- →Never semantic-cache personalized, creative, or context-dependent responses — correctness failure, not a performance trade-off
When to Cache (and When Not To)
Caching for LLM agents is not always a win. Before adding any cache layer, answer three questions: Is the cached value safe to return to a different user? Will the same value be requested again within the TTL? Is serving a slightly stale value acceptable? If any answer is no, that layer doesn't apply.
Evaluate each layer independently — all three can apply to the same agent
| Scenario | Prompt Cache | Semantic Cache | Tool Cache |
|---|---|---|---|
| FAQ chatbot with fixed system prompt | ✓ yes | ✓ yes (>0.92 threshold) | ✓ yes (KB queries) |
| Personalized financial advisor | ✓ yes | ✗ no — user-specific context | ✓ yes (market data, shared) |
| Creative writing agent | ✓ yes | ✗ no — responses must vary | ✗ no — non-deterministic |
| Code review agent | ✓ yes | ✗ rarely — code is unique | ✓ yes (linting, doc lookups) |
| Real-time news agent | ✓ yes | ✗ no — data changes too fast | ✗ 60s TTL max |
Semantic caching across users is a correctness failure, not a performance trade-off. If your system prompt includes user name, account balance, or session state, a cached response from User A will be served to User B with User A's data. Always scope semantic cache keys to a user or session if any personalized context exists.