Advanced RAG
Deep dive into retrieval-augmented generation: chunking strategies, hybrid search, re-ranking, graph RAG, and production RAG pipelines.
Two-pipeline architecture for RAG: the offline indexing pipeline and the online query pipeline. Components, data flow, and when RAG is the right approach vs fine-tuning or long context.
Fixed-size, recursive, semantic, and document-aware chunking strategies. How chunk size affects retrieval quality, and how to choose the right approach for your data.
Comparing OpenAI, Cohere, and open-source embedding models for RAG. Dimensions, pricing, MTEB benchmarks, and Matryoshka embeddings for cost optimization.
Comparing Pinecone, Weaviate, pgvector, Qdrant, and Chroma for production RAG. Features, pricing, scaling characteristics, and when to use each.
Combining keyword search (BM25) with semantic vector search for superior retrieval. Reciprocal Rank Fusion, weighted scoring, and when keyword search beats embeddings.
Using cross-encoder re-rankers to improve retrieval precision. Cohere Rerank, ColBERT, open-source re-rankers, and the cost/latency tradeoff of adding a reranking stage.
Techniques to improve retrieval by transforming user queries before search: HyDE, multi-query expansion, step-back prompting, and query decomposition.
Using metadata to narrow search scope before vector similarity. Attaching metadata during indexing, pre-filtering, self-querying retrievers, and combining filters with semantic search.
Handling questions that require combining information from multiple documents. Iterative retrieval, query decomposition into retrieval steps, and LangGraph-based multi-hop patterns.
Knowledge graphs for RAG: structured relationships vs semantic similarity, Microsoft's Graph RAG approach, building knowledge graphs from documents, and combining graph traversal with vector search.
Moving from static RAG pipelines to agent-driven retrieval. The agent decides what to retrieve, when, from which source, and evaluates retrieval quality with self-reflection.
Handling multi-turn conversations in RAG: resolving follow-up questions, history-aware retrieval, coreference resolution, and context window management across turns.
RAG beyond text: indexing images, tables, and diagrams from documents. PDF processing, multi-vector retrieval, and using vision models for table and image understanding.
Corrective RAG adds document grading and question rewriting to the retrieval loop — if retrieved documents don't answer the question, the system rewrites the query and tries again.
Route queries to different retrieval sources based on classification — vector stores, SQL databases, APIs, or specialized indexes — for optimal answers from the right source.
Building production ingestion pipelines for RAG: batch vs streaming, incremental updates, change detection, and pipeline orchestration with Airflow and Prefect.
Measuring RAG quality systematically: retrieval metrics (precision, recall, MRR, NDCG), generation metrics (faithfulness, relevance), the RAGAS framework, and building golden evaluation datasets.
Systematic approach to diagnosing RAG failures: is it a retrieval problem or a generation problem? Common failure modes, debugging toolkit, and fixing the most frequent issues.
Reducing RAG costs and latency in production: embedding caching, dimensionality reduction, vector quantization, context stuffing strategies, and model tiering.
Fine-tune embedding models on your domain data to improve retrieval quality by 10-25% — using contrastive learning with query-document pairs from your actual search logs.
Matryoshka embeddings let you truncate vectors to any dimension — trade off storage and speed for quality. Use 256 dims for fast filtering, 1536 dims for final ranking.
Three architectures for semantic retrieval: bi-encoders for fast search, cross-encoders for precise reranking, and ColBERT for the best of both — understand when to use each.