Production & Scale/Infrastructure
Advanced10 min

Caching Strategies at Scale

Reducing LLM API calls through caching: prompt caching, semantic caching, tool result caching, and cache invalidation patterns for agents.

Quick Reference

  • Prompt caching (Anthropic, OpenAI): the provider caches your static system prompt and tool definitions, reducing cost and latency
  • Semantic caching: embed the user query, search for similar past queries, return the cached response if similarity exceeds a threshold
  • Tool result caching: cache external API responses (search results, database queries) to avoid redundant calls within and across conversations
  • Cache invalidation: use TTLs for time-sensitive data, event-based invalidation for state changes, and versioned keys for schema changes
  • Cache hit rate is the key metric — track it per cache layer (prompt, semantic, tool) to identify optimization opportunities

Cache Layers Overview

Three cache layers

Production agents benefit from caching at three layers: provider prompt caching (system prompt reuse), semantic caching (similar query dedup), and tool result caching (external API response reuse).

RequestProvider Prompt Cacheclosest to LLM -- prefix cachingHITMISSSemantic Cacheembedding similarity lookupHITMISSTool Result Cacheclosest to external APIsHITMISSExternal API / LLMfastestslowest

Cache layers: provider prompt cache, semantic cache, tool result cache

Each layer targets a different cost driver. Prompt caching reduces per-request token costs by up to 90% on cached prefixes. Semantic caching eliminates redundant LLM calls entirely. Tool result caching avoids repeated external API calls that add latency and may have their own rate limits.

Cache LayerWhat It CachesHit Rate (typical)Cost ReductionImplementation Effort
Provider prompt cacheStatic system prompt + tool defs95-100%Up to 90% on cached tokensZero (provider-side)
Semantic cacheFull LLM responses for similar queries10-40%100% per hit (no LLM call)Medium (embedding + similarity)
Tool result cacheExternal API responses30-70%Eliminates API call latency + costLow (TTL-based key-value)