Cost & Latency Optimization
Reduce RAG costs in production through measure-first profiling, prompt caching (50-90% savings), semantic caching, multi-provider model tiering, context optimization, and cost monitoring.
Quick Reference
- →Measure first: profile your actual per-query cost breakdown before optimizing — generation input usually dominates
- →Prompt caching: 50-90% savings on static input tokens (system prompt, context). Enable it before anything else.
- →Semantic caching: embed queries, return cached answers for semantically similar questions. Eliminates LLM calls.
- →Model tiering: route simple queries to nano/Haiku ($0.10-1.00/1M in), complex to Sonnet/GPT-5.4 ($2.50-3.00/1M in)
- →Context stuffing: reduce from 5 to 3 chunks after reranking — saves ~40% context tokens with minimal quality loss
- →Storage: Matryoshka 256 dims + scalar quantization = 90%+ storage reduction (fixes monthly fixed cost, not per-query)
- →Cost monitoring: track per-query cost, set daily budgets, alert on spikes — costs can double overnight
Measure Before You Optimize
The first rule of RAG cost optimization: profile your actual cost breakdown before you change anything. Most teams guess wrong about where their money goes — they optimize embeddings while generation burns 20x more. Run the calculator below with your real numbers, then read the rest of this article in the order that matches your breakdown.
LLM generation dominates per-query cost; embedding dominates indexing cost
| Cost Component | Typical Cost (2026) | Scales With | Optimization Lever |
|---|---|---|---|
| Embedding (indexing) | $0.02/1M tokens | Corpus size (one-time) | Batch processing, cache, fewer re-indexes |
| Embedding (queries) | $0.02/1M tokens | Query volume | Cache repeated query embeddings |
| Vector storage | $0.33/GB + $8.25/1M reads (Pinecone) | Corpus size x dimensions | Dimensionality reduction, quantization |
| Vector search | $0.001-0.003/query | Query volume | Pre-filtering, approximate search |
| LLM generation | $0.10-15/1M tokens | Query volume x context size | Smaller models, prompt caching, fewer chunks |
top-left quadrant = highest ROI — start there before touching storage optimization
Run the calculator with YOUR model, YOUR average context size, and YOUR daily query volume. The matrix above shows typical ROI ordering — but if your system prompt is under 100 tokens, prompt caching saves less. If you have no repeated queries, semantic caching saves nothing. Numbers first, optimizations second.