Bi-Encoder vs Cross-Encoder vs ColBERT
Three retrieval architectures differ in how much they pre-compute vs compute at query time. This article covers when each architecture wins, how to calculate storage and latency for your corpus, how to measure which one improves your retrieval, and the failure modes specific to each.
Quick Reference
- →Bi-encoder: encode query and doc independently → cosine similarity. O(1) at query time after indexing. One vector per doc.
- →Cross-encoder: concatenate query + doc → full cross-attention → relevance score. O(N) per query. No pre-computation possible.
- →ColBERT: encode all tokens independently → MaxSim at query time. Pre-computes doc tokens. Near-cross-encoder quality.
- →Storage math: bi-encoder = dims × 4 bytes/doc; ColBERT FP16 = avg_tokens × 128 × 2 bytes/doc. For 200-token docs: 6 KB vs 50 KB.
- →Binary quantization cuts ColBERT storage 16× (FP16→1-bit) to ~3 KB/doc — less than a standard bi-encoder — with ~7% recall tradeoff.
- →Production cascade: bi-encoder retrieves top-25 (recall), cross-encoder reranks to top-5 (precision). See the Re-Ranking article for reranker selection and cost math.
- →Architecture choice is not permanent: start with bi-encoder, add a reranker when precision matters, consider ColBERT when storage budget allows.
When Architecture Choice Matters
Most RAG systems work well with a default bi-encoder. Before choosing between architectures, check whether you have a retrieval problem at all. Architecture upgrades are expensive to maintain and irreversible in the sense that they add dependencies your team must own. The table below is the decision you should make before reading the rest of this article.
| Situation | Action | Why |
|---|---|---|
| Corpus under 500 docs | No architecture work needed | Recall@50 is near 100% — any retrieval approach will find the right doc |
| Simple factoid queries, well-chunked docs, precision@5 above 2.0 | Stay with bi-encoder | You have no retrieval problem to solve |
| Dense technical docs, many near-duplicates, mixed query styles | Add cross-encoder reranker to bi-encoder | Cross-encoders distinguish semantically similar docs that embeddings treat as equivalent |
| Precision@5 below 2.0 after adding a reranker | Fine-tune your embedding model on domain data | The bi-encoder is not finding good candidates — reranking cannot fix recall failures |
| Latency budget is tight but cross-encoder quality is required | Consider ColBERT single-stage | ColBERT achieves near-cross-encoder quality with pre-computed doc tokens — no second API call |
The Re-Ranking article covers which reranker model to use, how to calculate API cost per query volume, and how to build a circuit breaker. This article covers the architecture choice itself: bi-encoder vs ColBERT vs cascade, how to size storage, and how to measure which architecture wins on your data.