★ OverviewIntermediate15 min

RAG Architecture Deep Dive

Decision guide for RAG architecture: when to use RAG vs alternatives, what it costs, how the two-pipeline architecture works, how RAG fails in production, and a map of the 22 articles in this topic.

Quick Reference

→RAG = offline indexing pipeline + online query pipeline sharing a vector store
→Use RAG when: data changes often, corpus > 100K tokens, or you need source attribution
→Indexing cost is dominated by embedding volume; per-query cost is dominated by LLM generation
→Changing your embedding model means full reindex of every document — choose carefully upfront
→Retrieval quality sets the ceiling for generation quality — no model can fix irrelevant context
→This article is the overview — see Evaluating RAG, Debugging Retrieval, and Cost Optimization for production depth

Should You Build RAG at All?

RAG is not the default answer to every knowledge problem. Before building a retrieval system, answer three questions in order. If you answer Yes to any of them, RAG is the right tool. If you answer No to all three, you have simpler and cheaper options.

answer Yes to any question → use RAG; answer No to all → evaluate alternatives

Approach	Best When	Real Limitation
RAG	Data changes often, large corpus, need attribution	Complex pipeline with two independent failure modes
Long-context LLM	Corpus fits in a context window (~200K tokens), low volume	At 200K tokens: ~$0.60–$1.00 per query at current model pricing. Cheap for a demo, expensive at 10K queries/day.
Fine-tuning	Need to change model style, tone, or domain vocabulary	Doesn't reliably add factual knowledge — models still hallucinate on fine-tuned facts
Prompt Engineering	Static, small knowledge base (< 10K tokens total)	Every prompt becomes longer over time; context window and cost grow linearly with KB size

RAG and fine-tuning are not mutually exclusive

Fine-tune the model to follow your citation format and response style. Then use RAG to supply the facts. This combination — style via fine-tuning, knowledge via retrieval — is what most mature production systems evolve toward.

The Two-Pipeline Mental Model

Every production RAG system is two separate pipelines that share a vector store. The indexing pipeline runs offline (or on a schedule) and converts raw documents into searchable embeddings. The query pipeline runs in real-time and uses those embeddings to find relevant context before generating an answer. Understanding this separation is fundamental — the indexing pipeline is a data engineering problem, while the query pipeline is an inference-time optimization problem. They have completely different performance requirements, failure modes, and scaling characteristics.

What RAG Actually Costs

Cost is one of the most common blind spots in RAG projects. A system that looks cheap in a notebook becomes expensive when 10,000 users run 5 queries each per day. Before building, compute the actual numbers for your workload.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.