Re-Ranking
Re-ranking is a second retrieval stage that adds precision without sacrificing recall. This article covers when it's worth the cost, which rerankers are current in 2026, scale cost math, how to measure real impact on your corpus, and how to build a pipeline that degrades gracefully when the reranker fails.
Quick Reference
- →Bi-encoders optimize for recall — fast but imprecise. Cross-encoders optimize for precision — slow but accurate.
- →The production pattern: retrieve top-25 with embeddings (recall), rerank to top-5 with a cross-encoder (precision), pass to the LLM.
- →2026 API leaders: Cohere rerank-v3.5 (~100ms, $0.002/query) and Jina Reranker v3 (~40ms). Open-source leader: BAAI/bge-reranker-v2-m3.
- →Cost math matters: Cohere costs ~$60/month at 1K queries/day and ~$6,000/month at 100K/day. Self-hosted A10G breaks even around 10K queries/day.
- →Never put a reranker in the hot path without a circuit breaker — API downtime otherwise breaks your entire retrieval pipeline.
- →Benchmark rerankers on your corpus, not BEIR. Run 30-50 labeled queries through both paths and compare precision@5 and MRR.
- →Conditional reranking — skip when the top retrieval score clearly dominates — reduces cost without measurable precision loss on easy queries.
When Re-Ranking Is a Waste of Time
Add reranking only when you've confirmed the retrieval failure mode it solves. Reranking adds latency (~50–150ms), API cost ($0.002/query for Cohere), and an external dependency to every query. None of that is worth it unless your precision problem is real. Check first.
| Corpus / Query Pattern | Add Reranking? | Reason |
|---|---|---|
| Pure Q&A knowledge base, natural language queries, well-chunked docs | No — vector search is enough | If top-5 results are consistently relevant, reranking just reorders them with added latency |
| Corpus with many near-duplicate docs or dense technical detail | Yes | Cross-encoders distinguish semantically similar documents that embeddings treat as equivalent |
| Mixed query styles (keywords + full questions from the same users) | Yes | Reranking corrects for query-length sensitivity that affects embedding similarity scores |
| Tiny corpus (< 500 docs) | No — doesn't matter | Recall@50 is near 100%; reranking a nearly complete set adds nothing |
| Single-answer QA where only rank-1 matters | Maybe — measure first | Reranking improves the full ranking; verify the rank-1 gain justifies the latency cost |
| Premium UX where irrelevant results cause churn | Yes | Precision improvement compounds with LLM quality — one wrong source in context degrades the answer |
Run 20 representative queries through your pipeline. Manually score the top-5 retrieved documents for relevance and calculate precision@5. Below 70%: reranking likely helps. Above 80%: reranking may not justify the cost. If you skip this measurement, you're adding latency and cost to a problem that may not exist.