Advanced RAG/Advanced Embeddings
Advanced14 min

Bi-Encoder vs Cross-Encoder vs ColBERT

Three retrieval architectures differ in how much they pre-compute vs compute at query time. This article covers when each architecture wins, how to calculate storage and latency for your corpus, how to measure which one improves your retrieval, and the failure modes specific to each.

Quick Reference

  • Bi-encoder: encode query and doc independently → cosine similarity. O(1) at query time after indexing. One vector per doc.
  • Cross-encoder: concatenate query + doc → full cross-attention → relevance score. O(N) per query. No pre-computation possible.
  • ColBERT: encode all tokens independently → MaxSim at query time. Pre-computes doc tokens. Near-cross-encoder quality.
  • Storage math: bi-encoder = dims × 4 bytes/doc; ColBERT FP16 = avg_tokens × 128 × 2 bytes/doc. For 200-token docs: 6 KB vs 50 KB.
  • Binary quantization cuts ColBERT storage 16× (FP16→1-bit) to ~3 KB/doc — less than a standard bi-encoder — with ~7% recall tradeoff.
  • Production cascade: bi-encoder retrieves top-25 (recall), cross-encoder reranks to top-5 (precision). See the Re-Ranking article for reranker selection and cost math.
  • Architecture choice is not permanent: start with bi-encoder, add a reranker when precision matters, consider ColBERT when storage budget allows.

When Architecture Choice Matters

Most RAG systems work well with a default bi-encoder. Before choosing between architectures, check whether you have a retrieval problem at all. Architecture upgrades are expensive to maintain and irreversible in the sense that they add dependencies your team must own. The table below is the decision you should make before reading the rest of this article.

SituationActionWhy
Corpus under 500 docsNo architecture work neededRecall@50 is near 100% — any retrieval approach will find the right doc
Simple factoid queries, well-chunked docs, precision@5 above 2.0Stay with bi-encoderYou have no retrieval problem to solve
Dense technical docs, many near-duplicates, mixed query stylesAdd cross-encoder reranker to bi-encoderCross-encoders distinguish semantically similar docs that embeddings treat as equivalent
Precision@5 below 2.0 after adding a rerankerFine-tune your embedding model on domain dataThe bi-encoder is not finding good candidates — reranking cannot fix recall failures
Latency budget is tight but cross-encoder quality is requiredConsider ColBERT single-stageColBERT achieves near-cross-encoder quality with pre-computed doc tokens — no second API call
This article and the Re-Ranking article cover different decisions

The Re-Ranking article covers which reranker model to use, how to calculate API cost per query volume, and how to build a circuit breaker. This article covers the architecture choice itself: bi-encoder vs ColBERT vs cascade, how to size storage, and how to measure which architecture wins on your data.