Advanced RAG/Search Quality
Advanced10 min

Re-Ranking

Using cross-encoder re-rankers to improve retrieval precision. Cohere Rerank, ColBERT, open-source re-rankers, and the cost/latency tradeoff of adding a reranking stage.

Quick Reference

  • Bi-encoders (embeddings) optimize for recall — fast but imprecise. Cross-encoders optimize for precision — slow but accurate.
  • Re-ranking is a second stage: retrieve top-50 with embeddings, then rerank to find the best 5
  • Cohere Rerank API: easiest integration, ~100ms latency for 25 documents, $0.002/query
  • ColBERT uses late interaction — better quality/speed tradeoff than full cross-encoders
  • Re-ranking typically improves precision@5 by 15-30% over raw vector search

Bi-Encoders vs Cross-Encoders

Embedding models (bi-encoders) encode the query and each document independently into vectors, then compare them via cosine similarity. This is fast because document vectors are pre-computed at index time. But it's imprecise — the model never sees the query and document together. Cross-encoders take the query and document as a single input, allowing deep token-level interaction. This is much more accurate but requires running inference on every query-document pair at search time.

AspectBi-Encoder (Embeddings)Cross-Encoder (Re-Ranker)
InputQuery and document encoded separatelyQuery + document encoded together
Speed~1ms per query (pre-computed vectors)~5-10ms per document pair
AccuracyGood recall, moderate precisionExcellent precision
Use caseFirst-stage retrieval (find candidates)Second-stage re-ranking (pick winners)
ScaleMillions of documentsTop 20-100 candidates only
The two-stage pattern

The standard pattern is: (1) Retrieve top-K candidates with a bi-encoder for speed (K=20 to 100). (2) Re-rank those K candidates with a cross-encoder for precision. (3) Return the top-N re-ranked results (N=3 to 5). This combines the speed of embeddings with the accuracy of cross-encoders.