When to Fine-Tune
A decision framework for choosing between prompt engineering, RAG, and fine-tuning. When fine-tuning is the right investment, when it is a waste of time, cost analysis comparing approaches, and the use cases where fine-tuning delivers the most value.
Quick Reference
- →Fine-tune for: consistent style/format, domain-specific terminology, latency reduction (smaller model), cost reduction at scale
- →Don't fine-tune for: adding knowledge (use RAG), improving general intelligence (use a better model), one-off tasks
- →Good fine-tuning use cases: consistent email tone, domain classification, structured extraction, code style
- →Cost: OpenAI fine-tuning ~$8/1M training tokens + inference cost. Self-hosted: GPU compute + engineering time
- →You need 200-2000 high-quality examples for most fine-tuning tasks
- →Always compare fine-tuned model against prompt-engineered baseline on your evaluation set
In this article
Fine-Tuning vs Prompt Engineering vs RAG
Fine-tuning, prompt engineering, and RAG address different problems. The key insight is understanding what each approach is good at and choosing based on your specific bottleneck.
| Approach | Best for | Not good for | Investment |
|---|---|---|---|
| Prompt engineering | Task specification, output format, reasoning guidance | Teaching new knowledge, consistent style at scale | Hours |
| RAG (retrieval) | Adding factual knowledge, keeping info up-to-date, citations | Teaching style/format, improving reasoning | Days-weeks |
| Fine-tuning | Consistent style/tone, domain adaptation, cost reduction, latency | Adding new facts (knowledge cutoff still applies), general intelligence | Weeks |
| Combining all three | Production systems needing all properties | Simple tasks where one approach suffices | Weeks-months |
Think of it this way: prompt engineering tells the model WHAT to do. RAG gives the model WHAT to know. Fine-tuning teaches the model HOW to behave. If your problem is 'the model does not know X,' use RAG. If your problem is 'the model does not do X consistently,' fine-tune.
Good Use Cases for Fine-Tuning
- ▸Consistent style transfer: make o4-mini write in your company's specific voice consistently across thousands of outputs
- ▸Domain-specific classification: classify medical records, legal documents, or financial transactions with domain terminology
- ▸Structured extraction: reliably extract specific fields from domain documents (insurance claims, invoices, contracts)
- ▸Format enforcement: always produce output in a specific schema without format drift on edge cases
- ▸Latency reduction: fine-tune a smaller model (o4-mini, Llama 8B) to match a larger model on your specific task
- ▸Cost reduction: replace GPT-5.4 ($2.00/1M) with fine-tuned o4-mini ($0.30/1M) for 6x cost savings
The highest-ROI fine-tuning scenario is: you have a task currently running on GPT-5.4 (expensive), prompt engineering is working well (80-90% accuracy), and you want to maintain quality while cutting costs. Fine-tuning o4-mini on 1K-2K examples of your GPT-5.4 outputs often matches GPT-5.4 quality at 1/6th the cost.
When NOT to Fine-Tune
| Goal | Why fine-tuning won't help | What to do instead |
|---|---|---|
| Add factual knowledge | Fine-tuning teaches behavior, not facts. Knowledge retention from fine-tuning is unreliable | Use RAG to provide facts at inference time |
| Improve general intelligence | You can't make a small model as smart as a large one through fine-tuning | Use a more capable model (o3, Claude Sonnet 4.6) |
| Handle a one-off task | Fine-tuning overhead is not justified for tasks you do rarely | Prompt engineering with few-shot examples |
| Fix safety/alignment | Fine-tuning can degrade safety alignment, not improve it | Use guardrails, input/output filters, system prompts |
| Process new data types | The model architecture limits what inputs it can process | Use a multimodal model or specialized pipeline |
The most common mistake is trying to fine-tune knowledge into a model. If you fine-tune on 1000 examples mentioning your company's products, the model might learn to mention those products, but it won't reliably recall specific facts about them. For factual accuracy, RAG is almost always better than fine-tuning.
- ▸Fine-tuning can degrade model capabilities outside the fine-tuned task (catastrophic forgetting)
- ▸If you have fewer than 200 examples, few-shot prompting is likely sufficient
- ▸Fine-tuning requires ongoing maintenance: retraining when the base model updates, monitoring for drift
- ▸Regulatory environments may require explainability -- fine-tuning makes the model less interpretable
Cost Analysis: Is Fine-Tuning Worth It?
Fine-tuning has upfront costs (training) and ongoing benefits (cheaper inference, better quality). The break-even depends on your volume.
| Cost component | OpenAI fine-tuning | Self-hosted (LoRA) | Self-hosted (full) |
|---|---|---|---|
| Training compute | $8/1M tokens (~$25-100 for 1K examples) | $50-500 (GPU hours) | $500-5000 (GPU hours) |
| Data preparation | Engineering time (hours-days) | Same | Same |
| Evaluation | Inference cost for test set | Same | Same |
| Inference cost | $0.30/$1.20 per 1M (mini) | Self-hosted rates | Self-hosted rates |
| Maintenance | Retrain on model updates | Full control but full responsibility | Full control but full responsibility |
Never start fine-tuning without a clear evaluation framework. Define your metrics, build your test set, and establish your baseline (prompt-engineered model performance) before spending a dollar on training. If you can't measure improvement, you won't know if fine-tuning helped.
Prerequisites Before Fine-Tuning
- ▸A clear, measurable quality metric for your task (accuracy, F1, format compliance, human preference)
- ▸A test set of 50-100 labeled examples to evaluate before and after fine-tuning
- ▸A prompt-engineered baseline that shows the current quality ceiling without fine-tuning
- ▸200-2000 high-quality training examples (quality matters far more than quantity)
- ▸A clear hypothesis about what fine-tuning will improve (e.g., 'format consistency from 85% to 95%')
- ▸Budget for iterating -- the first fine-tuning attempt rarely produces the best model
Teams that invest weeks in data preparation sometimes push ahead with fine-tuning even when the results are not clearly better than prompt engineering. Set a clear success criterion before starting. If fine-tuning does not beat the prompt-engineered baseline by your target margin, abandon it and try a different approach.
Best Practices
Do
- ✓Establish a prompt-engineering baseline and test set before starting fine-tuning
- ✓Fine-tune for behavior (style, format, consistency), not for knowledge (use RAG for that)
- ✓Start with the smallest model that meets your quality needs -- fine-tuned o4-mini is often sufficient
- ✓Budget for 2-3 iterations of fine-tuning -- the first attempt is a starting point, not the final model
- ✓Compare fine-tuned model against baseline on the same test set with the same metrics
Don’t
- ✗Don't fine-tune as a first resort -- exhaust prompt engineering and RAG first
- ✗Don't try to fine-tune knowledge into a model -- facts should come from retrieval
- ✗Don't fine-tune with fewer than 200 examples -- quality will be poor
- ✗Don't skip the evaluation framework -- without measurement, you are guessing
- ✗Don't assume fine-tuning is permanent -- base model updates may require retraining
Key Takeaways
- ✓Fine-tuning teaches HOW to behave (style, format, consistency), not WHAT to know (use RAG for knowledge).
- ✓Best use cases: consistent style, domain classification, format enforcement, and cost reduction at scale.
- ✓Always establish a prompt-engineering baseline and evaluation set before investing in fine-tuning.
- ✓Fine-tuned o4-mini often matches GPT-5.4 quality on specific tasks at 1/6th the inference cost.
- ✓The evaluation-first principle: if you can't measure improvement, don't fine-tune.
Video on this topic
Should you fine-tune an LLM? (decision framework)