Fine-Tuning Embeddings
Fine-tuning embedding models on domain data can improve retrieval recall by 10-25% — but re-ranking, query expansion, and better chunking solve the same problem with less effort in most cases. This article covers when fine-tuning is the right tool, what it costs, how to build training data and train with the modern SentenceTransformerTrainer API, how to measure improvement, common failure modes, and production deployment.
Quick Reference
- →Off-the-shelf embeddings underperform on domain jargon — 'tortious interference' near 'wrongful business disruption', not just 'tort'
- →Fine-tuning framework: sentence-transformers SentenceTransformerTrainer (Python). OpenAI and Cohere do NOT offer embedding fine-tuning APIs as of April 2026
- →Base models (2026): BAAI/bge-m3 (multilingual, MIT license), intfloat/multilingual-e5-large-instruct, Qwen3-Embedding-8B
- →Minimum 1,000 query-document pairs with hard negatives. 5,000+ for reliable results
- →Training: contrastive learning with MultipleNegativesRankingLoss — more data-efficient than TripletLoss because every non-matching pair in a batch is an implicit negative
- →Before fine-tuning: try re-ranking (16h setup, +15-30% precision), query expansion (8h, +5-10%), better chunking (4h, +5-15%)
- →When NOT to fine-tune: fewer than 1,000 pairs, general-purpose queries, rapidly changing content
Should You Fine-Tune at All?
Fine-tuning is the most expensive intervention in the retrieval stack. It requires labeled data, compute, re-embedding your entire corpus, and ongoing maintenance as your domain evolves. Three cheaper interventions address the same problem and should be tried first. Important correction to common advice: as of April 2026, OpenAI and Cohere do not offer a public API for fine-tuning their embedding models. OpenAI's fine-tuning API supports GPT-4.1 series language models only. Cohere's embed-v4 does not expose a fine-tuning endpoint in their public API. If you need to fine-tune embeddings, you do it through open-source frameworks — primarily sentence-transformers — on a base model you self-host.
re-ranking reorders what the space retrieves — fine-tuning reshapes the space itself
| Problem | First Try | Fine-Tune? | Why |
|---|---|---|---|
| Domain jargon misses (legal, medical) | Cross-encoder re-ranking | If reranking Recall@5 < 70% | Re-rankers handle most domain mismatches; fine-tune when they don't |
| Internal acronyms not retrieved | Query expansion with synonym injection | If expansion doesn't help | Expand acronyms at query time before paying to retrain a model |
| Low precision in top-5 (noise) | Re-ranking | No | Low precision is a cross-encoder problem, not an embedding space problem |
| Multilingual domain terminology | bge-m3 or multilingual-e5 | Yes, if domain-specific | Off-the-shelf multilingual models may lack domain-specific geometry |
| General Q&A with acceptable retrieval | Nothing | No | You're already performing well |
| Rapidly changing terminology (news, product) | Nothing | No | Model goes stale faster than you can retrain |
Run 30 representative queries. If reranking improves Recall@5 above 80% on domain-specific queries, stop — your retrieval is good enough. If reranking still leaves Recall@5 below 70% on domain queries after adjusting chunking and query expansion, fine-tuning has a real signal. This gate prevents wasting 30-40 engineering hours on a problem that a 16-hour reranker integration would have solved.