Conversational RAG
How to make RAG handle multi-turn conversations: when to add it, how query condensation works under the hood, the LangGraph and legacy chain implementations, context budget math, failure modes, and the production architecture.
Quick Reference
- →Follow-up questions are incomplete without context — 'What about the pricing?' retrieves nothing useful without knowing which product
- →Query condensation rewrites a follow-up + conversation history into a standalone query — this is the only new step vs. single-turn RAG
- →Condensation subsumes coreference resolution — no separate pronoun-replacement step needed
- →LangGraph with checkpointers (MemorySaver, PostgresSaver) is the recommended approach for stateful conversational RAG
- →create_history_aware_retriever and create_retrieval_chain moved to langchain-classic in LangChain 1.0 — new projects use LangGraph
- →Every conversational turn costs at minimum 2 LLM calls (condensation + generation) — measure this before adding it
- →Context budget is the key tradeoff: history grows ~300 tokens per turn, directly competing with retrieved docs
- →Always log the condensed query — it is the single most important debugging signal for conversational RAG
When You Don't Need Conversational RAG
Conversational RAG is needed when: (1) users have multi-turn sessions, (2) follow-up questions reference prior context, and (3) those follow-ups trigger new retrieval. All three conditions must be true. If your system handles one-shot queries, or if users clarify their input without needing new documents retrieved, you don't need this — and adding it wastes compute on every turn after the first.
| Pattern | Example | Need Conversational RAG? | Simpler Alternative |
|---|---|---|---|
| Single-turn Q&A | Search docs, get answer, done | No | Standard RAG |
| Multi-turn clarifications | User rephrases same question | No — no new retrieval needed | Pass full rephrased question as-is |
| Multi-turn topic continuity | 'Tell me about Lambda' → 'What about pricing?' | Yes | — |
| Multi-turn topic shifts | Lambda pricing → then asks about DynamoDB | Yes, but history weight should drop on shift | Detect shift, truncate history |
Every conversational turn costs at minimum 2 LLM calls: one to condense the query, one to generate the answer. A 10-turn conversation costs 20+ LLM calls. If your analytics show >60% of sessions have only 1 turn, you're paying for infrastructure that handles 40% of traffic. Check your session length distribution before adding conversational RAG.
Three signals you actually need it: your logs show >40% of sessions have 3+ turns; users report 'it forgot what I asked'; or your retrieval logs show follow-up queries returning irrelevant results. If none of these are true, don't build it yet.