Embedding Models
Embedding models convert text into vectors for semantic search and RAG. This article covers the 2026 model landscape, cost math at scale, production patterns, and the hidden traps — especially the re-embedding trap when you switch models.
Quick Reference
- →embed_documents() for corpus indexing — embed_query() for search queries. They may use different internal prefixes.
- →text-embedding-3-small ($0.02/1M) is the English-text default. voyage-4-lite is the same price with stronger retrieval.
- →Gemini Embedding 2 (#1 MTEB multilingual, 3072 dims, 8192 token input) is free during preview.
- →Switching models means re-embedding your entire corpus. Plan before you pick.
- →Reduce dimensions to 512 via the Matryoshka `dimensions` parameter — cuts storage 66% with minimal recall loss.
- →CacheBackedEmbeddings wraps any embedder and skips API calls on repeated text. Always use it.
- →Batch embed_documents() in groups of 256 to stay within rate limits and maximize throughput.
- →Measure retrieval hit-rate@5 on a held-out eval set before calling embeddings 'good enough'.
When NOT to Use Embeddings
Before reaching for an embedding model, check whether keyword search would work. BM25 (or Elasticsearch/OpenSearch) is faster, cheaper, and produces more interpretable results for exact-match queries like product SKUs, legal citation numbers, or error codes. Embeddings win when the user's vocabulary differs from the document vocabulary — 'heart attack' should match 'myocardial infarction.' If your users search with the same words the documents use, you may not need embeddings at all.
| Signal | Lean toward keyword | Lean toward embeddings |
|---|---|---|
| Query vocabulary | Same as documents | Different from documents |
| Query type | Exact IDs, codes, SKUs | Conceptual, open-ended |
| Corpus size | < 10K documents | > 100K documents |
| Latency budget | < 10 ms P99 | 50–200 ms P99 acceptable |
| Cost budget | < $0.01/day | Willing to pay for quality |
Most production RAG systems end up with hybrid search: BM25 for exact matches + embeddings for semantic matches + a reranker to merge results. Start with keyword search, add embeddings only when you can measure the recall improvement.