Embedding Models

Embedding models convert text into vectors for semantic search and RAG. This article covers the 2026 model landscape, cost math at scale, production patterns, and the hidden traps — especially the re-embedding trap when you switch models.

Quick Reference

→embed_documents() for corpus indexing — embed_query() for search queries. They may use different internal prefixes.
→text-embedding-3-small ($0.02/1M) is the English-text default. voyage-4-lite is the same price with stronger retrieval.
→Gemini Embedding 2 (#1 MTEB multilingual, 3072 dims, 8192 token input) is free during preview.
→Switching models means re-embedding your entire corpus. Plan before you pick.
→Reduce dimensions to 512 via the Matryoshka `dimensions` parameter — cuts storage 66% with minimal recall loss.
→CacheBackedEmbeddings wraps any embedder and skips API calls on repeated text. Always use it.
→Batch embed_documents() in groups of 256 to stay within rate limits and maximize throughput.
→Measure retrieval hit-rate@5 on a held-out eval set before calling embeddings 'good enough'.

When NOT to Use Embeddings

Before reaching for an embedding model, check whether keyword search would work. BM25 (or Elasticsearch/OpenSearch) is faster, cheaper, and produces more interpretable results for exact-match queries like product SKUs, legal citation numbers, or error codes. Embeddings win when the user's vocabulary differs from the document vocabulary — 'heart attack' should match 'myocardial infarction.' If your users search with the same words the documents use, you may not need embeddings at all.

Signal	Lean toward keyword	Lean toward embeddings
Query vocabulary	Same as documents	Different from documents
Query type	Exact IDs, codes, SKUs	Conceptual, open-ended
Corpus size	< 10K documents	> 100K documents
Latency budget	< 10 ms P99	50–200 ms P99 acceptable
Cost budget	< $0.01/day	Willing to pay for quality

Hybrid search first

Most production RAG systems end up with hybrid search: BM25 for exact matches + embeddings for semantic matches + a reranker to merge results. Start with keyword search, add embeddings only when you can measure the recall improvement.

The Embeddings Interface

embed_documents() for indexing, embed_query() for search — same class, different methods

Model Landscape (April 2026)

Model	Dims	Cost/1M tokens	MTEB	Best for
text-embedding-3-small (OpenAI)	1536 (configurable)	$0.02	62.3 English	General English RAG, tight budget
voyage-4-lite (Voyage AI)	1024	$0.02	63.1 English	Same budget as OAI small, stronger recall
voyage-4 (Voyage AI)	1024	$0.06	65.4 English	Code, technical docs, high recall
Gemini Embedding 2 (Google)	3072 (configurable)	Free (preview)	#1 multilingual	Multilingual, multimodal, long docs (8K tokens)
Cohere Embed v4	1536	$0.12 text / $0.47 image	Top-5 multilingual	Multimodal (text + images), binary quantization
text-embedding-3-large (OpenAI)	3072 (configurable)	$0.13	64.6 English	Maximum OpenAI accuracy, at 6.5× the cost of small
Qwen3-Embedding-8B (Alibaba)	4096	Free (self-hosted)	#1 open-source	Air-gapped deployments, multilingual
Llama-Embed-Nemotron-8B (NVIDIA)	4096	Free (self-hosted)	Top-3 open-source	Air-gapped deployments, NVIDIA stack

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.