Advanced RAG/RAG in Production
Advanced10 min

Cost & Latency Optimization

Reducing RAG costs and latency in production: embedding caching, dimensionality reduction, vector quantization, context stuffing strategies, and model tiering.

Quick Reference

  • Embedding cache: cache query embeddings to avoid redundant API calls — 30-50% of queries are repeated or similar
  • Dimensionality reduction: Matryoshka embeddings at 256 dims save 90%+ storage with ~5% quality loss
  • Vector quantization: scalar/binary quantization reduces storage 4-32x with minimal quality impact
  • Model tiering: use gpt-5.4-mini for simple queries, gpt-5.4 only for complex ones
  • Context stuffing: fewer, more relevant chunks in the prompt reduces generation cost and improves quality

RAG Cost Breakdown

RAG costs come from four sources: embedding (converting text to vectors), storage (storing vectors in a database), retrieval (searching the vector store), and generation (LLM inference to produce the answer). Understanding the cost breakdown for your system is essential for knowing where to optimize. For most systems, generation dominates the per-query cost, while storage dominates the fixed monthly cost.

Cost ComponentTypical CostScales WithOptimization Lever
Embedding (indexing)$0.02/1M tokensCorpus size (one-time)Batch processing, cache, fewer re-indexes
Embedding (queries)$0.02/1M tokensQuery volumeCache repeated queries
Vector storage$2-12/GB/monthCorpus size x dimensionsDimensionality reduction, quantization
Vector search$0.001-0.003/queryQuery volumePre-filtering, approximate search
LLM generation$0.15-10/1M tokensQuery volume x context sizeSmaller models, fewer/shorter context chunks
Calculate your RAG cost per query
Generation dominates query cost

For most RAG systems, LLM generation accounts for 85-95% of per-query cost. This means the biggest optimization levers are: (1) using a cheaper model (gpt-5.4-mini vs gpt-5.4 is 10x cheaper), (2) reducing context size (fewer/shorter chunks), and (3) caching answers for repeated queries.