Advanced10 min
Caching Strategies at Scale
Reducing LLM API calls through caching: prompt caching, semantic caching, tool result caching, and cache invalidation patterns for agents.
Quick Reference
- →Prompt caching (Anthropic, OpenAI): the provider caches your static system prompt and tool definitions, reducing cost and latency
- →Semantic caching: embed the user query, search for similar past queries, return the cached response if similarity exceeds a threshold
- →Tool result caching: cache external API responses (search results, database queries) to avoid redundant calls within and across conversations
- →Cache invalidation: use TTLs for time-sensitive data, event-based invalidation for state changes, and versioned keys for schema changes
- →Cache hit rate is the key metric — track it per cache layer (prompt, semantic, tool) to identify optimization opportunities
Cache Layers Overview
Three cache layers
Production agents benefit from caching at three layers: provider prompt caching (system prompt reuse), semantic caching (similar query dedup), and tool result caching (external API response reuse).
Cache layers: provider prompt cache, semantic cache, tool result cache
Each layer targets a different cost driver. Prompt caching reduces per-request token costs by up to 90% on cached prefixes. Semantic caching eliminates redundant LLM calls entirely. Tool result caching avoids repeated external API calls that add latency and may have their own rate limits.
| Cache Layer | What It Caches | Hit Rate (typical) | Cost Reduction | Implementation Effort |
|---|---|---|---|---|
| Provider prompt cache | Static system prompt + tool defs | 95-100% | Up to 90% on cached tokens | Zero (provider-side) |
| Semantic cache | Full LLM responses for similar queries | 10-40% | 100% per hit (no LLM call) | Medium (embedding + similarity) |
| Tool result cache | External API responses | 30-70% | Eliminates API call latency + cost | Low (TTL-based key-value) |