Semantic Caching

Cache similar (not just identical) queries by embedding similarity — a user asking 'What is LangGraph?' gets the cached response for 'Explain LangGraph' instead of a new LLM call.

Quick Reference

→Exact-match caching: only hits if the query is character-for-character identical — low hit rate
→Semantic caching: embed the query, find similar cached queries by cosine similarity — 3-5x more hits
→Set similarity threshold (0.92-0.95) — too low returns wrong answers, too high misses valid cache hits
→Cache invalidation: TTL-based for time-sensitive data, version-based for knowledge base updates
→Store: query embedding + response + metadata (timestamp, hit count, source)
→Saves 40-60% of LLM calls in production with well-tuned thresholds

Exact-Match vs Semantic Caching

Query → embed → search cache (cosine ≥ 0.93) → hit: return cached / miss: call LLM + store

Aspect	Exact-Match Cache	Semantic Cache
Match condition	Query == cached query (string equality)	embed(query) similar to embed(cached_query)
Hit rate	5-15% in production	30-50% in production
Risk	None — identical query, identical answer	Wrong answer if threshold too low
Cost	Free (hash lookup)	Embedding cost (~$0.0001 per query)
Setup	Dict/Redis with string keys	Vector store with similarity search

Implementation

Semantic cache with Redis + embeddings

Threshold Tuning

Threshold	Hit Rate	Risk	Best For
0.98+	Low (10-15%)	Almost zero false positives	Critical/financial queries
0.93-0.97	Medium (30-40%)	Rare false positives	General Q&A (recommended)
0.88-0.92	High (45-55%)	Some wrong answers	Low-stakes, exploratory
< 0.88	Very high	Many wrong answers	Not recommended

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.