Cost & Latency Optimization
Reducing RAG costs and latency in production: embedding caching, dimensionality reduction, vector quantization, context stuffing strategies, and model tiering.
Quick Reference
- →Embedding cache: cache query embeddings to avoid redundant API calls — 30-50% of queries are repeated or similar
- →Dimensionality reduction: Matryoshka embeddings at 256 dims save 90%+ storage with ~5% quality loss
- →Vector quantization: scalar/binary quantization reduces storage 4-32x with minimal quality impact
- →Model tiering: use gpt-5.4-mini for simple queries, gpt-5.4 only for complex ones
- →Context stuffing: fewer, more relevant chunks in the prompt reduces generation cost and improves quality
RAG Cost Breakdown
RAG costs come from four sources: embedding (converting text to vectors), storage (storing vectors in a database), retrieval (searching the vector store), and generation (LLM inference to produce the answer). Understanding the cost breakdown for your system is essential for knowing where to optimize. For most systems, generation dominates the per-query cost, while storage dominates the fixed monthly cost.
| Cost Component | Typical Cost | Scales With | Optimization Lever |
|---|---|---|---|
| Embedding (indexing) | $0.02/1M tokens | Corpus size (one-time) | Batch processing, cache, fewer re-indexes |
| Embedding (queries) | $0.02/1M tokens | Query volume | Cache repeated queries |
| Vector storage | $2-12/GB/month | Corpus size x dimensions | Dimensionality reduction, quantization |
| Vector search | $0.001-0.003/query | Query volume | Pre-filtering, approximate search |
| LLM generation | $0.15-10/1M tokens | Query volume x context size | Smaller models, fewer/shorter context chunks |
For most RAG systems, LLM generation accounts for 85-95% of per-query cost. This means the biggest optimization levers are: (1) using a cheaper model (gpt-5.4-mini vs gpt-5.4 is 10x cheaper), (2) reducing context size (fewer/shorter chunks), and (3) caching answers for repeated queries.