RAG Architecture Deep Dive
Decision guide for RAG architecture: when to use RAG vs alternatives, what it costs, how the two-pipeline architecture works, how RAG fails in production, and a map of the 22 articles in this topic.
Quick Reference
- →RAG = offline indexing pipeline + online query pipeline sharing a vector store
- →Use RAG when: data changes often, corpus > 100K tokens, or you need source attribution
- →Indexing cost is dominated by embedding volume; per-query cost is dominated by LLM generation
- →Changing your embedding model means full reindex of every document — choose carefully upfront
- →Retrieval quality sets the ceiling for generation quality — no model can fix irrelevant context
- →This article is the overview — see Evaluating RAG, Debugging Retrieval, and Cost Optimization for production depth
Should You Build RAG at All?
RAG is not the default answer to every knowledge problem. Before building a retrieval system, answer three questions in order. If you answer Yes to any of them, RAG is the right tool. If you answer No to all three, you have simpler and cheaper options.
answer Yes to any question → use RAG; answer No to all → evaluate alternatives
| Approach | Best When | Real Limitation |
|---|---|---|
| RAG | Data changes often, large corpus, need attribution | Complex pipeline with two independent failure modes |
| Long-context LLM | Corpus fits in a context window (~200K tokens), low volume | At 200K tokens: ~$0.60–$1.00 per query at current model pricing. Cheap for a demo, expensive at 10K queries/day. |
| Fine-tuning | Need to change model style, tone, or domain vocabulary | Doesn't reliably add factual knowledge — models still hallucinate on fine-tuned facts |
| Prompt Engineering | Static, small knowledge base (< 10K tokens total) | Every prompt becomes longer over time; context window and cost grow linearly with KB size |
Fine-tune the model to follow your citation format and response style. Then use RAG to supply the facts. This combination — style via fine-tuning, knowledge via retrieval — is what most mature production systems evolve toward.