Integrations/Knowledge
★ OverviewIntermediate22 min

RAG: When to Use It, How to Build It, and How It Breaks

The first RAG decision is whether to use RAG at all — with 200K+ token context windows, it's a choice, not a given. This article covers the RAG-vs-long-context decision framework with cost math, building an indexing and retrieval pipeline, evaluation with concrete thresholds, production failure modes, monitoring, and a production-shaped LangGraph reference implementation.

Quick Reference

  • RAG = Retrieve relevant documents, Augment the prompt, Generate a grounded answer — but only use it when long-context won't do
  • With 200K+ token context windows, corpora under 50K tokens at low query volume are often cheaper with long-context than RAG infrastructure
  • RAG becomes cost-effective at scale: 500 queries/day on an 80K-token corpus costs ~$6/day with RAG vs. ~$120/day with long-context
  • Chunk at 500–1000 tokens with 10–20% overlap; embed with text-embedding-3-small ($0.02/1M tokens); store in Chroma (dev) or Pinecone/pgvector (prod)
  • Evaluate retrieval (precision@5 > 0.7, recall@10 > 0.8) separately from generation (faithfulness > 0.9) — bad retrieval guarantees bad answers
  • Monitor: retrieval score trend, zero-result rate, P95 latency, index freshness — RAG fails silently, not with errors
  • Add a retrieval quality gate: if no document scores above the similarity threshold, return a fallback instead of generating from irrelevant context
  • Start with naive RAG; add hybrid search, reranking, or query rewriting only after evaluation reveals a specific gap

Should I Use RAG at All?

RAG is a choice, not a default

Claude has a 200K-token context window. Gemini has 1M tokens. The assumption that made RAG necessary in 2023 — 'you can't fit your knowledge base in context' — no longer holds for most corpora. The first RAG decision is whether to build RAG infrastructure at all, or whether long-context stuffing is simpler, cheaper, and more accurate for your specific case.

NoNoNoYes →Yes →Yes →Data changes monthly or more often?docs, policies, pricing, knowledge baseCorpus exceeds 100K tokens?too large to stuff into a context windowAnswers must cite their source?attribution, auditability, complianceUse RAGretrieval-augmentedgenerationWhy it wins:✓ handles stale data✓ any corpus size✓ source attribution✓ no retraining neededFine-tunecustom style/behaviorLong-contextfits in context windowPrompt Eng.static, small knowledge

answer Yes to any question → use RAG; answer No to all → evaluate alternatives

FactorUse RAGUse Long-ContextWhy It Matters
Corpus size> 200K tokens (larger than context window)Fits in context windowAbove the limit, long-context is physically impossible
Corpus stabilityChanges frequently; re-index incrementallyStable; no change pipeline neededStale context is worse than no context
Query volume> 500/day on same corpus< 100/dayHigh volume makes RAG's embed-once/retrieve-many economics win
Need source attributionChunk metadata gives citationsNo natural citation mechanismRegulated domains require knowing which document answered the query
Need full-corpus reasoningAvoid — RAG returns fragmentsModel sees everythingMulti-hop reasoning across the whole corpus requires the whole corpus
Latency budget> 1s acceptable< 500ms requiredRetrieval adds 100–300ms; matters for real-time applications
Corpus size 50K–200K tokensRAG works but may be overkillWorks, higher per-query costRun the cost math below — depends on your query volume

Cost math: 80K-token corpus, 500 queries/day, Claude Sonnet at $3/1M input tokens. Long-context: 80K × 500 = 40M tokens/day × $3/1M = $120/day. RAG: embed query (negligible at $0.02/1M) + vector search (negligible) + 4 chunks × 500 tokens × 500 queries = 1M tokens/day × $3/1M = $3/day plus ~$0.01/day embedding cost. RAG is ~40× cheaper at this scale. But for a 5K-token corpus at 20 queries/day: long-context = $0.30/day. RAG adds indexing infrastructure, a reindexing pipeline, and a vector store for $0.28/day in savings. The math doesn't justify it.

Three cases where RAG is the wrong answer

1. Your corpus fits in the context window and query volume is under 100/day — just include it in the system prompt. 2. You need the model to reason across the entire corpus simultaneously — RAG retrieves fragments, and cross-document reasoning requires the whole picture. 3. Your 'knowledge base' is a single document under 10K tokens — include it directly.