Advanced RAG/Advanced Embeddings
Advanced16 min

Fine-Tuning Embeddings

Fine-tuning embedding models on domain data can improve retrieval recall by 10-25% — but re-ranking, query expansion, and better chunking solve the same problem with less effort in most cases. This article covers when fine-tuning is the right tool, what it costs, how to build training data and train with the modern SentenceTransformerTrainer API, how to measure improvement, common failure modes, and production deployment.

Quick Reference

  • Off-the-shelf embeddings underperform on domain jargon — 'tortious interference' near 'wrongful business disruption', not just 'tort'
  • Fine-tuning framework: sentence-transformers SentenceTransformerTrainer (Python). OpenAI and Cohere do NOT offer embedding fine-tuning APIs as of April 2026
  • Base models (2026): BAAI/bge-m3 (multilingual, MIT license), intfloat/multilingual-e5-large-instruct, Qwen3-Embedding-8B
  • Minimum 1,000 query-document pairs with hard negatives. 5,000+ for reliable results
  • Training: contrastive learning with MultipleNegativesRankingLoss — more data-efficient than TripletLoss because every non-matching pair in a batch is an implicit negative
  • Before fine-tuning: try re-ranking (16h setup, +15-30% precision), query expansion (8h, +5-10%), better chunking (4h, +5-15%)
  • When NOT to fine-tune: fewer than 1,000 pairs, general-purpose queries, rapidly changing content

Should You Fine-Tune at All?

Fine-tuning is the most expensive intervention in the retrieval stack. It requires labeled data, compute, re-embedding your entire corpus, and ongoing maintenance as your domain evolves. Three cheaper interventions address the same problem and should be tried first. Important correction to common advice: as of April 2026, OpenAI and Cohere do not offer a public API for fine-tuning their embedding models. OpenAI's fine-tuning API supports GPT-4.1 series language models only. Cohere's embed-v4 does not expose a fine-tuning endpoint in their public API. If you need to fine-tune embeddings, you do it through open-source frameworks — primarily sentence-transformers — on a base model you self-host.

Should you fine-tune embeddings?Retrieval qualitypoor on domain queries?NoKeep embeddingsretrieval is good enoughYesHave 1,000+labeled query pairs?NoTry re-ranking firstcollect data while it runsYesDomain-specificterminology gaps?NoRe-ranking is enoughprecision gap, not space gapYesFine-tunereshape the embedding space

re-ranking reorders what the space retrieves — fine-tuning reshapes the space itself

ProblemFirst TryFine-Tune?Why
Domain jargon misses (legal, medical)Cross-encoder re-rankingIf reranking Recall@5 < 70%Re-rankers handle most domain mismatches; fine-tune when they don't
Internal acronyms not retrievedQuery expansion with synonym injectionIf expansion doesn't helpExpand acronyms at query time before paying to retrain a model
Low precision in top-5 (noise)Re-rankingNoLow precision is a cross-encoder problem, not an embedding space problem
Multilingual domain terminologybge-m3 or multilingual-e5Yes, if domain-specificOff-the-shelf multilingual models may lack domain-specific geometry
General Q&A with acceptable retrievalNothingNoYou're already performing well
Rapidly changing terminology (news, product)NothingNoModel goes stale faster than you can retrain
The decision gate

Run 30 representative queries. If reranking improves Recall@5 above 80% on domain-specific queries, stop — your retrieval is good enough. If reranking still leaves Recall@5 below 70% on domain queries after adjusting chunking and query expansion, fine-tuning has a real signal. This gate prevents wasting 30-40 engineering hours on a problem that a 16-hour reranker integration would have solved.