Bi-Encoder vs Cross-Encoder vs ColBERT
Three architectures for semantic retrieval: bi-encoders for fast search, cross-encoders for precise reranking, and ColBERT for the best of both — understand when to use each.
Quick Reference
- →Bi-encoder: encode query and docs independently → cosine similarity — fast, scalable, used for initial retrieval
- →Cross-encoder: encode query+doc together → relevance score — slow but most accurate, used for reranking
- →ColBERT: encode query and doc tokens independently → late interaction — fast retrieval with cross-encoder quality
- →Production pattern: bi-encoder for recall (top-100) → cross-encoder for precision (top-5)
- →ColBERT is the emerging winner for single-stage retrieval — but requires more storage
- →All three are complementary — use them together, not as alternatives
Architecture Comparison
Bi-encoder: independent, fast, scalable · Cross-encoder: joint, accurate, slow · Cascade both
| Architecture | How It Works | Speed | Quality | Storage |
|---|---|---|---|---|
| Bi-Encoder | Query and doc encoded separately → cosine similarity | Fast (pre-computed doc vectors) | Good | Low (one vector per doc) |
| Cross-Encoder | Query + doc encoded together → relevance score | Slow (recompute per query-doc pair) | Best | None (no pre-computation) |
| ColBERT | Query and doc tokens encoded separately → token-level similarity | Fast (pre-computed token vectors) | Near-best | High (many vectors per doc) |
Bi-encoders are what most RAG systems use: embed the query, embed the document, compute cosine similarity. Fast and scalable because document embeddings are computed once at index time. Cross-encoders see the query and document together, enabling much richer comparison — but they must score every candidate, making them too slow for initial search. ColBERT is a middle ground: it pre-computes per-token embeddings for documents and computes late interaction at query time, achieving near-cross-encoder quality at bi-encoder speed.