Intermediate8 min
Semantic Caching
Cache similar (not just identical) queries by embedding similarity — a user asking 'What is LangGraph?' gets the cached response for 'Explain LangGraph' instead of a new LLM call.
Quick Reference
- →Exact-match caching: only hits if the query is character-for-character identical — low hit rate
- →Semantic caching: embed the query, find similar cached queries by cosine similarity — 3-5x more hits
- →Set similarity threshold (0.92-0.95) — too low returns wrong answers, too high misses valid cache hits
- →Cache invalidation: TTL-based for time-sensitive data, version-based for knowledge base updates
- →Store: query embedding + response + metadata (timestamp, hit count, source)
- →Saves 40-60% of LLM calls in production with well-tuned thresholds
Exact-Match vs Semantic Caching
Query → embed → search cache (cosine ≥ 0.93) → hit: return cached / miss: call LLM + store
| Aspect | Exact-Match Cache | Semantic Cache |
|---|---|---|
| Match condition | Query == cached query (string equality) | embed(query) similar to embed(cached_query) |
| Hit rate | 5-15% in production | 30-50% in production |
| Risk | None — identical query, identical answer | Wrong answer if threshold too low |
| Cost | Free (hash lookup) | Embedding cost (~$0.0001 per query) |
| Setup | Dict/Redis with string keys | Vector store with similarity search |