Advanced12 min

Caching Strategies at Scale

Three cache layers that reduce LLM costs and latency in production agents: provider prompt caching, semantic caching, and tool result caching — with cost math, failure modes, and the decision framework most articles skip.

Quick Reference

→Provider prompt caching: static prefix > 2K tokens gets cached by the provider; reads cost 0.1× base input, writes cost 1.25× (5 min) or 2× (1 hour)
→Anthropic: requires cache_control on content blocks, max 4 breakpoints per request, GA — no beta header needed
→OpenAI: fully automatic, no code changes, 50% discount on cached tokens (vs Anthropic's 90%)
→Semantic caching: embed the query, find cosine-similar past queries, return cached response — skips the LLM call entirely; use threshold ≥ 0.92 or you'll serve wrong answers
→Tool result caching: hash tool name + sorted args as key, TTL per data type (stock: 60s, KB: 3600s)
→Cache write costs break the '90% savings' headline: a 5 min cache write costs 1.25× input — savings only start after the second read
→Never semantic-cache personalized, creative, or context-dependent responses — correctness failure, not a performance trade-off

When to Cache (and When Not To)

Caching for LLM agents is not always a win. Before adding any cache layer, answer three questions: Is the cached value safe to return to a different user? Will the same value be requested again within the TTL? Is serving a slightly stale value acceptable? If any answer is no, that layer doesn't apply.

Evaluate each layer independently — all three can apply to the same agent

Scenario	Prompt Cache	Semantic Cache	Tool Cache
FAQ chatbot with fixed system prompt	✓ yes	✓ yes (>0.92 threshold)	✓ yes (KB queries)
Personalized financial advisor	✓ yes	✗ no — user-specific context	✓ yes (market data, shared)
Creative writing agent	✓ yes	✗ no — responses must vary	✗ no — non-deterministic
Code review agent	✓ yes	✗ rarely — code is unique	✓ yes (linting, doc lookups)
Real-time news agent	✓ yes	✗ no — data changes too fast	✗ 60s TTL max

The personalization trap

Semantic caching across users is a correctness failure, not a performance trade-off. If your system prompt includes user name, account balance, or session state, a cached response from User A will be served to User B with User A's data. Always scope semantic cache keys to a user or session if any personalized context exists.

Provider Prompt Caching

Highest-ROI optimization in this article

Prompt caching reduces input costs by 90% on cached tokens with zero changes to agent logic. Enable it on every endpoint with a static system prompt over 2K tokens. Do it today.

Semantic Caching

Eliminates the LLM call entirely on a hit

Semantic caching returns a cached response for queries that are semantically equivalent — not just lexically identical. A cache hit costs one embedding call (~0.1ms, $0.00002) instead of a full LLM inference.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.