Advanced RAG/Search Quality
Intermediate16 min

Hybrid Search

Hybrid search (BM25 + semantic) outperforms either method alone on mixed-query corpora — but not on every corpus. This article covers when to add it, how to fuse results with RRF, how to measure improvement, when LangChain's in-memory BM25 breaks, and how reranking fits in.

Quick Reference

  • Hybrid search = BM25 keyword search + vector semantic search, results merged via Reciprocal Rank Fusion (RRF)
  • BM25 excels at exact matches: error codes, product IDs, acronyms, proper nouns
  • Semantic search excels at meaning: paraphrases, synonyms, conceptual questions
  • RRF score = sum of 1/(k + rank) across retrievers; k=60 from the original 2009 paper
  • LangChain's BM25Retriever is in-memory only — rebuild on every restart, breaks past ~100K docs
  • After hybrid retrieval: add a cross-encoder reranker for precision; skip if latency budget is tight
  • Tune weights against a labeled query set — 0.4/0.6 is a starting point, not a universal answer

When Hybrid Search Is Overkill

Add complexity only when you've confirmed the failure mode. Hybrid search adds indexing overhead, two retrieval paths, and a fusion step. Before adding it, check whether you actually have the retrieval gap it solves.

Corpus / Query PatternUse Hybrid?Reason
Pure Q&A knowledge base, users ask in natural languageNo — semantic alone is fineNo exact-term queries; BM25 adds noise
Product catalog searched by SKU / model numberNo — BM25 alone is fineNo semantic gap; embeddings waste compute
Mixed: some users type error codes, others ask questionsYesClassic hybrid case — both gaps exist
Code documentation + developer queriesYesExact function/symbol matches + semantic intent
Legal / compliance docs with exact citationsYesStatute numbers and conceptual questions coexist
Chat history over tiny corpus (<500 docs)No — doesn't matterRecall is fine with either; not worth the complexity
Measure first

If you add hybrid search without measuring recall before and after, you have no idea whether it helped. Run 20 representative queries through BM25-only and semantic-only, score recall@5 manually, then add hybrid and remeasure. If you skip this, you're guessing.