Re-Ranking
Using cross-encoder re-rankers to improve retrieval precision. Cohere Rerank, ColBERT, open-source re-rankers, and the cost/latency tradeoff of adding a reranking stage.
Quick Reference
- →Bi-encoders (embeddings) optimize for recall — fast but imprecise. Cross-encoders optimize for precision — slow but accurate.
- →Re-ranking is a second stage: retrieve top-50 with embeddings, then rerank to find the best 5
- →Cohere Rerank API: easiest integration, ~100ms latency for 25 documents, $0.002/query
- →ColBERT uses late interaction — better quality/speed tradeoff than full cross-encoders
- →Re-ranking typically improves precision@5 by 15-30% over raw vector search
Bi-Encoders vs Cross-Encoders
Embedding models (bi-encoders) encode the query and each document independently into vectors, then compare them via cosine similarity. This is fast because document vectors are pre-computed at index time. But it's imprecise — the model never sees the query and document together. Cross-encoders take the query and document as a single input, allowing deep token-level interaction. This is much more accurate but requires running inference on every query-document pair at search time.
| Aspect | Bi-Encoder (Embeddings) | Cross-Encoder (Re-Ranker) |
|---|---|---|
| Input | Query and document encoded separately | Query + document encoded together |
| Speed | ~1ms per query (pre-computed vectors) | ~5-10ms per document pair |
| Accuracy | Good recall, moderate precision | Excellent precision |
| Use case | First-stage retrieval (find candidates) | Second-stage re-ranking (pick winners) |
| Scale | Millions of documents | Top 20-100 candidates only |
The standard pattern is: (1) Retrieve top-K candidates with a bi-encoder for speed (K=20 to 100). (2) Re-rank those K candidates with a cross-encoder for precision. (3) Return the top-N re-ranked results (N=3 to 5). This combines the speed of embeddings with the accuracy of cross-encoders.