Hybrid Search
Hybrid search (BM25 + semantic) outperforms either method alone on mixed-query corpora — but not on every corpus. This article covers when to add it, how to fuse results with RRF, how to measure improvement, when LangChain's in-memory BM25 breaks, and how reranking fits in.
Quick Reference
- →Hybrid search = BM25 keyword search + vector semantic search, results merged via Reciprocal Rank Fusion (RRF)
- →BM25 excels at exact matches: error codes, product IDs, acronyms, proper nouns
- →Semantic search excels at meaning: paraphrases, synonyms, conceptual questions
- →RRF score = sum of 1/(k + rank) across retrievers; k=60 from the original 2009 paper
- →LangChain's BM25Retriever is in-memory only — rebuild on every restart, breaks past ~100K docs
- →After hybrid retrieval: add a cross-encoder reranker for precision; skip if latency budget is tight
- →Tune weights against a labeled query set — 0.4/0.6 is a starting point, not a universal answer
When Hybrid Search Is Overkill
Add complexity only when you've confirmed the failure mode. Hybrid search adds indexing overhead, two retrieval paths, and a fusion step. Before adding it, check whether you actually have the retrieval gap it solves.
| Corpus / Query Pattern | Use Hybrid? | Reason |
|---|---|---|
| Pure Q&A knowledge base, users ask in natural language | No — semantic alone is fine | No exact-term queries; BM25 adds noise |
| Product catalog searched by SKU / model number | No — BM25 alone is fine | No semantic gap; embeddings waste compute |
| Mixed: some users type error codes, others ask questions | Yes | Classic hybrid case — both gaps exist |
| Code documentation + developer queries | Yes | Exact function/symbol matches + semantic intent |
| Legal / compliance docs with exact citations | Yes | Statute numbers and conceptual questions coexist |
| Chat history over tiny corpus (<500 docs) | No — doesn't matter | Recall is fine with either; not worth the complexity |
If you add hybrid search without measuring recall before and after, you have no idea whether it helped. Run 20 representative queries through BM25-only and semantic-only, score recall@5 manually, then add hybrid and remeasure. If you skip this, you're guessing.