Advanced RAG/RAG Fundamentals
★ OverviewIntermediate13 min

RAG Architecture Deep Dive

Two-pipeline architecture for RAG: the offline indexing pipeline and the online query pipeline. Components, data flow, and when RAG is the right approach vs fine-tuning or long context.

Quick Reference

  • RAG has two pipelines: indexing (offline, batch) and query (online, real-time)
  • Indexing pipeline: load → split → embed → store in vector database
  • Query pipeline: embed query → retrieve → (optional rerank) → generate answer
  • RAG beats fine-tuning when data changes frequently or you need source attribution
  • Long-context models reduce but don't eliminate the need for RAG — cost and latency still matter
  • The retriever is the most critical component — bad retrieval guarantees bad answers

Two-Pipeline Architecture

Every production RAG system is actually two separate pipelines that share a vector store. The indexing pipeline runs offline (or on a schedule) and converts raw documents into searchable embeddings. The query pipeline runs in real-time and uses those embeddings to find relevant context before generating an answer. Understanding this separation is fundamental — the indexing pipeline is a data engineering problem, while the query pipeline is an inference-time optimization problem.

Indexing Pipeline (Offline)

Document Loaders → Text Splitters → Embedding Model → Vector Store. This runs when new documents arrive. It's batch-oriented, can be slow, and is optimized for throughput. You run this once per document, not once per query.

Query Pipeline (Online)

User Query → Query Embedding → Vector Search → (Optional: Rerank) → Context Assembly → LLM Generation → Answer. This runs on every user request. It must be fast (< 2 seconds total) and is optimized for latency and relevance.

AspectIndexing PipelineQuery Pipeline
RunsOffline / scheduledReal-time per request
Optimized forThroughputLatency
BottleneckEmbedding API rate limitsVector search + LLM generation
Failure impactStale or missing dataWrong or no answer
Cost driverEmbedding tokensLLM tokens + vector queries