LLM Foundations/Fine-Tuning
★ OverviewIntermediate11 min

When to Fine-Tune

A decision framework for choosing between prompt engineering, RAG, and fine-tuning. When fine-tuning is the right investment, when it is a waste of time, cost analysis comparing approaches, and the use cases where fine-tuning delivers the most value.

Quick Reference

  • Fine-tune for: consistent style/format, domain-specific terminology, latency reduction (smaller model), cost reduction at scale
  • Don't fine-tune for: adding knowledge (use RAG), improving general intelligence (use a better model), one-off tasks
  • Good fine-tuning use cases: consistent email tone, domain classification, structured extraction, code style
  • Cost: OpenAI fine-tuning ~$8/1M training tokens + inference cost. Self-hosted: GPU compute + engineering time
  • You need 200-2000 high-quality examples for most fine-tuning tasks
  • Always compare fine-tuned model against prompt-engineered baseline on your evaluation set

Fine-Tuning vs Prompt Engineering vs RAG

Fine-tuning, prompt engineering, and RAG address different problems. The key insight is understanding what each approach is good at and choosing based on your specific bottleneck.

ApproachBest forNot good forInvestment
Prompt engineeringTask specification, output format, reasoning guidanceTeaching new knowledge, consistent style at scaleHours
RAG (retrieval)Adding factual knowledge, keeping info up-to-date, citationsTeaching style/format, improving reasoningDays-weeks
Fine-tuningConsistent style/tone, domain adaptation, cost reduction, latencyAdding new facts (knowledge cutoff still applies), general intelligenceWeeks
Combining all threeProduction systems needing all propertiesSimple tasks where one approach sufficesWeeks-months
The fundamental distinction

Think of it this way: prompt engineering tells the model WHAT to do. RAG gives the model WHAT to know. Fine-tuning teaches the model HOW to behave. If your problem is 'the model does not know X,' use RAG. If your problem is 'the model does not do X consistently,' fine-tune.

Good Use Cases for Fine-Tuning

  • Consistent style transfer: make o4-mini write in your company's specific voice consistently across thousands of outputs
  • Domain-specific classification: classify medical records, legal documents, or financial transactions with domain terminology
  • Structured extraction: reliably extract specific fields from domain documents (insurance claims, invoices, contracts)
  • Format enforcement: always produce output in a specific schema without format drift on edge cases
  • Latency reduction: fine-tune a smaller model (o4-mini, Llama 8B) to match a larger model on your specific task
  • Cost reduction: replace GPT-5.4 ($2.00/1M) with fine-tuned o4-mini ($0.30/1M) for 6x cost savings
Cost comparison: fine-tuned small model vs large model
The sweet spot

The highest-ROI fine-tuning scenario is: you have a task currently running on GPT-5.4 (expensive), prompt engineering is working well (80-90% accuracy), and you want to maintain quality while cutting costs. Fine-tuning o4-mini on 1K-2K examples of your GPT-5.4 outputs often matches GPT-5.4 quality at 1/6th the cost.

When NOT to Fine-Tune

GoalWhy fine-tuning won't helpWhat to do instead
Add factual knowledgeFine-tuning teaches behavior, not facts. Knowledge retention from fine-tuning is unreliableUse RAG to provide facts at inference time
Improve general intelligenceYou can't make a small model as smart as a large one through fine-tuningUse a more capable model (o3, Claude Sonnet 4.6)
Handle a one-off taskFine-tuning overhead is not justified for tasks you do rarelyPrompt engineering with few-shot examples
Fix safety/alignmentFine-tuning can degrade safety alignment, not improve itUse guardrails, input/output filters, system prompts
Process new data typesThe model architecture limits what inputs it can processUse a multimodal model or specialized pipeline
The knowledge trap

The most common mistake is trying to fine-tune knowledge into a model. If you fine-tune on 1000 examples mentioning your company's products, the model might learn to mention those products, but it won't reliably recall specific facts about them. For factual accuracy, RAG is almost always better than fine-tuning.

  • Fine-tuning can degrade model capabilities outside the fine-tuned task (catastrophic forgetting)
  • If you have fewer than 200 examples, few-shot prompting is likely sufficient
  • Fine-tuning requires ongoing maintenance: retraining when the base model updates, monitoring for drift
  • Regulatory environments may require explainability -- fine-tuning makes the model less interpretable

Cost Analysis: Is Fine-Tuning Worth It?

Fine-tuning has upfront costs (training) and ongoing benefits (cheaper inference, better quality). The break-even depends on your volume.

Cost componentOpenAI fine-tuningSelf-hosted (LoRA)Self-hosted (full)
Training compute$8/1M tokens (~$25-100 for 1K examples)$50-500 (GPU hours)$500-5000 (GPU hours)
Data preparationEngineering time (hours-days)SameSame
EvaluationInference cost for test setSameSame
Inference cost$0.30/$1.20 per 1M (mini)Self-hosted ratesSelf-hosted rates
MaintenanceRetrain on model updatesFull control but full responsibilityFull control but full responsibility
The evaluation-first principle

Never start fine-tuning without a clear evaluation framework. Define your metrics, build your test set, and establish your baseline (prompt-engineered model performance) before spending a dollar on training. If you can't measure improvement, you won't know if fine-tuning helped.

Prerequisites Before Fine-Tuning

  • A clear, measurable quality metric for your task (accuracy, F1, format compliance, human preference)
  • A test set of 50-100 labeled examples to evaluate before and after fine-tuning
  • A prompt-engineered baseline that shows the current quality ceiling without fine-tuning
  • 200-2000 high-quality training examples (quality matters far more than quantity)
  • A clear hypothesis about what fine-tuning will improve (e.g., 'format consistency from 85% to 95%')
  • Budget for iterating -- the first fine-tuning attempt rarely produces the best model
Readiness checklist for fine-tuning
The sunk cost trap

Teams that invest weeks in data preparation sometimes push ahead with fine-tuning even when the results are not clearly better than prompt engineering. Set a clear success criterion before starting. If fine-tuning does not beat the prompt-engineered baseline by your target margin, abandon it and try a different approach.

Best Practices

Best Practices

Do

  • Establish a prompt-engineering baseline and test set before starting fine-tuning
  • Fine-tune for behavior (style, format, consistency), not for knowledge (use RAG for that)
  • Start with the smallest model that meets your quality needs -- fine-tuned o4-mini is often sufficient
  • Budget for 2-3 iterations of fine-tuning -- the first attempt is a starting point, not the final model
  • Compare fine-tuned model against baseline on the same test set with the same metrics

Don’t

  • Don't fine-tune as a first resort -- exhaust prompt engineering and RAG first
  • Don't try to fine-tune knowledge into a model -- facts should come from retrieval
  • Don't fine-tune with fewer than 200 examples -- quality will be poor
  • Don't skip the evaluation framework -- without measurement, you are guessing
  • Don't assume fine-tuning is permanent -- base model updates may require retraining

Key Takeaways

  • Fine-tuning teaches HOW to behave (style, format, consistency), not WHAT to know (use RAG for knowledge).
  • Best use cases: consistent style, domain classification, format enforcement, and cost reduction at scale.
  • Always establish a prompt-engineering baseline and evaluation set before investing in fine-tuning.
  • Fine-tuned o4-mini often matches GPT-5.4 quality on specific tasks at 1/6th the inference cost.
  • The evaluation-first principle: if you can't measure improvement, don't fine-tune.

Video on this topic

Should you fine-tune an LLM? (decision framework)

instagram