Advanced RAG/Advanced Embeddings
Advanced9 min

Bi-Encoder vs Cross-Encoder vs ColBERT

Three architectures for semantic retrieval: bi-encoders for fast search, cross-encoders for precise reranking, and ColBERT for the best of both — understand when to use each.

Quick Reference

  • Bi-encoder: encode query and docs independently → cosine similarity — fast, scalable, used for initial retrieval
  • Cross-encoder: encode query+doc together → relevance score — slow but most accurate, used for reranking
  • ColBERT: encode query and doc tokens independently → late interaction — fast retrieval with cross-encoder quality
  • Production pattern: bi-encoder for recall (top-100) → cross-encoder for precision (top-5)
  • ColBERT is the emerging winner for single-stage retrieval — but requires more storage
  • All three are complementary — use them together, not as alternatives

Architecture Comparison

Bi-EncoderQueryEncoder ADocumentEncoder Bcosine simFast (~10ms)Pre-computed doc vectorsCross-EncoderQuery + Document(concatenated input)Joint EncoderFull cross-attentionscore: 0.94Slow (~50ms/pair)Recompute per query-docProduction: Bi-Encoder (top-100) → Cross-Encoder (top-5)Fast recall + precise reranking = best of both

Bi-encoder: independent, fast, scalable · Cross-encoder: joint, accurate, slow · Cascade both

ArchitectureHow It WorksSpeedQualityStorage
Bi-EncoderQuery and doc encoded separately → cosine similarityFast (pre-computed doc vectors)GoodLow (one vector per doc)
Cross-EncoderQuery + doc encoded together → relevance scoreSlow (recompute per query-doc pair)BestNone (no pre-computation)
ColBERTQuery and doc tokens encoded separately → token-level similarityFast (pre-computed token vectors)Near-bestHigh (many vectors per doc)

Bi-encoders are what most RAG systems use: embed the query, embed the document, compute cosine similarity. Fast and scalable because document embeddings are computed once at index time. Cross-encoders see the query and document together, enabling much richer comparison — but they must score every candidate, making them too slow for initial search. ColBERT is a middle ground: it pre-computes per-token embeddings for documents and computes late interaction at query time, achieving near-cross-encoder quality at bi-encoder speed.