Bi-Encoder vs Cross-Encoder vs ColBERT

Three retrieval architectures differ in how much they pre-compute vs compute at query time. This article covers when each architecture wins, how to calculate storage and latency for your corpus, how to measure which one improves your retrieval, and the failure modes specific to each.

Quick Reference

→Bi-encoder: encode query and doc independently → cosine similarity. O(1) at query time after indexing. One vector per doc.
→Cross-encoder: concatenate query + doc → full cross-attention → relevance score. O(N) per query. No pre-computation possible.
→ColBERT: encode all tokens independently → MaxSim at query time. Pre-computes doc tokens. Near-cross-encoder quality.
→Storage math: bi-encoder = dims × 4 bytes/doc; ColBERT FP16 = avg_tokens × 128 × 2 bytes/doc. For 200-token docs: 6 KB vs 50 KB.
→Binary quantization cuts ColBERT storage 16× (FP16→1-bit) to ~3 KB/doc — less than a standard bi-encoder — with ~7% recall tradeoff.
→Production cascade: bi-encoder retrieves top-25 (recall), cross-encoder reranks to top-5 (precision). See the Re-Ranking article for reranker selection and cost math.
→Architecture choice is not permanent: start with bi-encoder, add a reranker when precision matters, consider ColBERT when storage budget allows.

When Architecture Choice Matters

Most RAG systems work well with a default bi-encoder. Before choosing between architectures, check whether you have a retrieval problem at all. Architecture upgrades are expensive to maintain and irreversible in the sense that they add dependencies your team must own. The table below is the decision you should make before reading the rest of this article.

Situation	Action	Why
Corpus under 500 docs	No architecture work needed	Recall@50 is near 100% — any retrieval approach will find the right doc
Simple factoid queries, well-chunked docs, precision@5 above 2.0	Stay with bi-encoder	You have no retrieval problem to solve
Dense technical docs, many near-duplicates, mixed query styles	Add cross-encoder reranker to bi-encoder	Cross-encoders distinguish semantically similar docs that embeddings treat as equivalent
Precision@5 below 2.0 after adding a reranker	Fine-tune your embedding model on domain data	The bi-encoder is not finding good candidates — reranking cannot fix recall failures
Latency budget is tight but cross-encoder quality is required	Consider ColBERT single-stage	ColBERT achieves near-cross-encoder quality with pre-computed doc tokens — no second API call

This article and the Re-Ranking article cover different decisions

The Re-Ranking article covers which reranker model to use, how to calculate API cost per query volume, and how to build a circuit breaker. This article covers the architecture choice itself: bi-encoder vs ColBERT vs cascade, how to size storage, and how to measure which architecture wins on your data.

Three Architectures in One Diagram

Blue: one vector/doc, fast · Purple: many tokens/doc, fast, near-best quality · Red: no pre-compute, highest quality

Storage and Latency Math at Scale

Storage is the primary reason teams avoid ColBERT. The math is simple: bi-encoder storage per document is model_dims × 4 bytes (FP32). For text-embedding-3-small (1536 dims): 1536 × 4 = 6,144 bytes ≈ 6 KB per doc. ColBERT v2 uses 128-dimensional token embeddings stored in FP16. For a document with an average of 200 tokens: 200 × 128 × 2 = 51,200 bytes ≈ 50 KB per doc — about 8× more than a standard bi-encoder. Binary quantization changes this calculation significantly.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.