AI Engineering Judgment/Cost Engineering
Intermediate8 min

Semantic Caching

Cache similar (not just identical) queries by embedding similarity — a user asking 'What is LangGraph?' gets the cached response for 'Explain LangGraph' instead of a new LLM call.

Quick Reference

  • Exact-match caching: only hits if the query is character-for-character identical — low hit rate
  • Semantic caching: embed the query, find similar cached queries by cosine similarity — 3-5x more hits
  • Set similarity threshold (0.92-0.95) — too low returns wrong answers, too high misses valid cache hits
  • Cache invalidation: TTL-based for time-sensitive data, version-based for knowledge base updates
  • Store: query embedding + response + metadata (timestamp, hit count, source)
  • Saves 40-60% of LLM calls in production with well-tuned thresholds

Exact-Match vs Semantic Caching

QueryEmbedCache Searchcosine ≥ 0.93?HitCachedResponseMissLLM Call+ cache result

Query → embed → search cache (cosine ≥ 0.93) → hit: return cached / miss: call LLM + store

AspectExact-Match CacheSemantic Cache
Match conditionQuery == cached query (string equality)embed(query) similar to embed(cached_query)
Hit rate5-15% in production30-50% in production
RiskNone — identical query, identical answerWrong answer if threshold too low
CostFree (hash lookup)Embedding cost (~$0.0001 per query)
SetupDict/Redis with string keysVector store with similarity search