Production & Scale/Infrastructure
Advanced12 min

Caching Strategies at Scale

Three cache layers that reduce LLM costs and latency in production agents: provider prompt caching, semantic caching, and tool result caching — with cost math, failure modes, and the decision framework most articles skip.

Quick Reference

  • Provider prompt caching: static prefix > 2K tokens gets cached by the provider; reads cost 0.1× base input, writes cost 1.25× (5 min) or 2× (1 hour)
  • Anthropic: requires cache_control on content blocks, max 4 breakpoints per request, GA — no beta header needed
  • OpenAI: fully automatic, no code changes, 50% discount on cached tokens (vs Anthropic's 90%)
  • Semantic caching: embed the query, find cosine-similar past queries, return cached response — skips the LLM call entirely; use threshold ≥ 0.92 or you'll serve wrong answers
  • Tool result caching: hash tool name + sorted args as key, TTL per data type (stock: 60s, KB: 3600s)
  • Cache write costs break the '90% savings' headline: a 5 min cache write costs 1.25× input — savings only start after the second read
  • Never semantic-cache personalized, creative, or context-dependent responses — correctness failure, not a performance trade-off

When to Cache (and When Not To)

Caching for LLM agents is not always a win. Before adding any cache layer, answer three questions: Is the cached value safe to return to a different user? Will the same value be requested again within the TTL? Is serving a slightly stale value acceptable? If any answer is no, that layer doesn't apply.

Which cache layer applies?Evaluate each independentlyStatic prefix> 2 K tokens?YESProvider Prompt Cache0.1× on cache readsNOQuery repeats,non-personalized?YESSemantic Cache100% per hitNOTool call isdeterministic?YESTool Result CacheTTL-basedNONo applicable cache layerskipskip

Evaluate each layer independently — all three can apply to the same agent

ScenarioPrompt CacheSemantic CacheTool Cache
FAQ chatbot with fixed system prompt✓ yes✓ yes (>0.92 threshold)✓ yes (KB queries)
Personalized financial advisor✓ yes✗ no — user-specific context✓ yes (market data, shared)
Creative writing agent✓ yes✗ no — responses must vary✗ no — non-deterministic
Code review agent✓ yes✗ rarely — code is unique✓ yes (linting, doc lookups)
Real-time news agent✓ yes✗ no — data changes too fast✗ 60s TTL max
The personalization trap

Semantic caching across users is a correctness failure, not a performance trade-off. If your system prompt includes user name, account balance, or session state, a cached response from User A will be served to User B with User A's data. Always scope semantic cache keys to a user or session if any personalized context exists.