Advanced RAG/RAG Fundamentals
★ OverviewIntermediate15 min

RAG Architecture Deep Dive

Decision guide for RAG architecture: when to use RAG vs alternatives, what it costs, how the two-pipeline architecture works, how RAG fails in production, and a map of the 22 articles in this topic.

Quick Reference

  • RAG = offline indexing pipeline + online query pipeline sharing a vector store
  • Use RAG when: data changes often, corpus > 100K tokens, or you need source attribution
  • Indexing cost is dominated by embedding volume; per-query cost is dominated by LLM generation
  • Changing your embedding model means full reindex of every document — choose carefully upfront
  • Retrieval quality sets the ceiling for generation quality — no model can fix irrelevant context
  • This article is the overview — see Evaluating RAG, Debugging Retrieval, and Cost Optimization for production depth

Should You Build RAG at All?

RAG is not the default answer to every knowledge problem. Before building a retrieval system, answer three questions in order. If you answer Yes to any of them, RAG is the right tool. If you answer No to all three, you have simpler and cheaper options.

NoNoNoYes →Yes →Yes →Data changes monthly or more often?docs, policies, pricing, knowledge baseCorpus exceeds 100K tokens?too large to stuff into a context windowAnswers must cite their source?attribution, auditability, complianceUse RAGretrieval-augmentedgenerationWhy it wins:✓ handles stale data✓ any corpus size✓ source attribution✓ no retraining neededFine-tunecustom style/behaviorLong-contextfits in context windowPrompt Eng.static, small knowledge

answer Yes to any question → use RAG; answer No to all → evaluate alternatives

ApproachBest WhenReal Limitation
RAGData changes often, large corpus, need attributionComplex pipeline with two independent failure modes
Long-context LLMCorpus fits in a context window (~200K tokens), low volumeAt 200K tokens: ~$0.60–$1.00 per query at current model pricing. Cheap for a demo, expensive at 10K queries/day.
Fine-tuningNeed to change model style, tone, or domain vocabularyDoesn't reliably add factual knowledge — models still hallucinate on fine-tuned facts
Prompt EngineeringStatic, small knowledge base (< 10K tokens total)Every prompt becomes longer over time; context window and cost grow linearly with KB size
RAG and fine-tuning are not mutually exclusive

Fine-tune the model to follow your citation format and response style. Then use RAG to supply the facts. This combination — style via fine-tuning, knowledge via retrieval — is what most mature production systems evolve toward.