End-to-End Fine-Tuning Pipeline
Pick your path first (API, self-hosted, or cloud-managed), then follow it end-to-end. Covers cost math for all three paths, updated OpenAI pipeline with GPT-4.1, updated Hugging Face pipeline with TRL v1.0 SFTConfig, training monitoring, post-training LoRA merge and deployment, and the five failure modes that most fine-tuning projects hit.
Quick Reference
- →Fine-tunable OpenAI models (2026): GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini, o4-mini
- →GPT-4.1 fine-tuning cost: ~$3/M training tokens; inference $3/M in, $12/M out
- →GPT-4.1-mini fine-tuning cost: ~$0.80/M training tokens; inference $0.80/M in, $3.20/M out
- →QLoRA on Llama 3.1 8B: ~16 GB VRAM (4-bit model ~10 GB + LoRA params ~2 GB + optimizer ~4 GB)
- →Hugging Face TRL v1.0: use SFTConfig (not TrainingArguments) + peft_config on SFTTrainer
- →Overfitting signal: validation loss increases while training loss keeps decreasing — stop immediately
- →Post-training: merge LoRA adapters before serving to eliminate adapter overhead at inference
- →Catastrophic forgetting risk: always evaluate on general tasks after fine-tuning, not just your target task
In this article
Which Fine-Tuning Path Do You Need?
Before writing a line of training code, pick your path. The three options address different constraints — and choosing the wrong one costs weeks of work.
Pick your path before writing a line of training code
| Factor | OpenAI API | Hugging Face + QLoRA | Cloud-Managed |
|---|---|---|---|
| Model ownership | No — OpenAI only | Yes — run anywhere | Depends on provider |
| Setup complexity | Low — API calls only | High — GPU, libs, monitoring | Medium — cloud console |
| Training cost | ~$0.80–$3/M tokens | GPU cost (≈$0.50–2/hr spot) | Per hour or per token |
| Inference control | None — OpenAI endpoint | Full — deploy anywhere | Provider endpoint |
| Best for | Fastest path, closed infra | Open models, max control | Open model, no GPU ops |
| Data sensitivity | Data leaves to OpenAI | Stays on your hardware | Stays in your VPC |
This article is the 'how' — it assumes you've already decided fine-tuning is the right tool. If you haven't done that analysis, the 'When to Fine-Tune' article in this chapter covers the decision framework including cost comparison against prompt engineering and RAG.
LoRA merge is the only step unique to the self-hosted path — API and cloud handle it server-side
What Will It Cost?
Every fine-tuning conversation ends with this question. Here's the math for a typical run: 5,000 training examples, average 300 tokens each, 3 epochs.
Total training tokens = 5,000 examples × 300 tokens × 3 epochs = 4,500,000 tokens = 4.5M tokens.
| Path | Training cost (4.5M tokens) | Inference (vs. base) | Notes |
|---|---|---|---|
| GPT-4.1 API | $3.00/M × 4.5M = $13.50 | $3/M in, $12/M out (50% more than base) | Simplest; no GPU management |
| GPT-4.1-mini API | $0.80/M × 4.5M = $3.60 | $0.80/M in, $3.20/M out | Best cost for lighter tasks |
| GPT-4o API | $25.00/M × 4.5M = $112.50 | $3.75/M in, $15/M out | Only if you need GPT-4o quality |
| HF + QLoRA (A100 spot) | ~$1.50/hr × 2 hrs = $3.00 | Your hardware cost only | Llama 3.1 8B; varies by GPU |
| Together AI LoRA (Llama 3.3 8B) | $4.50/M × 4.5M = $20.25 | Per-request at provider rates | No GPU management needed |
| Vertex AI (Gemini 2.5 Flash) | Check current docs — per GPU-hr | Per-request at Vertex rates | Tight GCP integration |
An A100 40GB spot instance on GCP or AWS runs $1–2/hr vs. $3–4/hr on-demand. A 5,000-example QLoRA run on Llama 3.1 8B typically finishes in 1–2 hours. Budget $3–6 for the training itself; the GPU instance setup time is the real cost.
If you enable data sharing when creating an OpenAI fine-tune job, inference costs drop 50% on both standard and batch modes. For high-volume inference, this discount often makes the API path cheaper than self-hosted at scale.
OpenAI Fine-Tuning Pipeline
The OpenAI API is the fastest path: upload JSONL, configure hyperparameters, wait. The constraints are real — you cannot access model weights, change architecture, or deploy outside OpenAI's infrastructure — but for many production use cases, those constraints don't matter.
GPT-5.4 and GPT-5.4-mini are inference-only. Fine-tunable models as of April 2026: GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini, and o4-mini. Always verify at platform.openai.com/docs — this list changes.
| Model | Training / 1M tokens | Inference in / 1M | Inference out / 1M |
|---|---|---|---|
| gpt-4.1-mini (fine-tuned) | ~$0.80 | ~$0.80 | ~$3.20 |
| gpt-4.1 (fine-tuned) | ~$3.00 | ~$3.00 | ~$12.00 |
| gpt-4o (fine-tuned) | $25.00 | $3.75 | $15.00 |
| o4-mini (fine-tuned) | check docs | check docs | check docs |
At ~$0.80/M training tokens and $0.80/$3.20 inference, gpt-4.1-mini is the right default for most fine-tuning tasks. Move to gpt-4.1 only if you've confirmed that the mini variant's output quality is insufficient for your use case on your evaluation set.
Hugging Face + QLoRA Pipeline
For open models with full deployment control, Hugging Face's transformers + TRL v1.0 + PEFT is the standard stack. TRL v1.0 (released April 2026) replaced the old TrainingArguments-based API with SFTConfig — the code below uses the current API.
Pre-v1.0 code passed TrainingArguments to SFTTrainer. In v1.0, use SFTConfig instead — it extends TrainingArguments with SFT-specific params (assistant_only_loss, packing, max_length). The old approach still runs but you lose access to the new SFT-specific options.
VRAM budget for this config on Llama 3.1 8B: 4-bit quantized model ≈ 8B × 0.5 bytes/param = 4 GB, plus activations and overhead ≈ 10 GB total; LoRA parameters (r=16 on 7 modules) ≈ 2 GB; optimizer states for LoRA-only params ≈ 4 GB. Total ≈ 16 GB — fits an RTX 4090 (24 GB) comfortably. For Llama 3.1 70B, scale to ≈ 48 GB (2× RTX 4090 or A100 80GB).
Llama 4 Scout and Maverick are mixture-of-experts models. The LoRA target_modules list changes for MoE layers — check the model's config.json for the correct module names. Llama 3.1 8B is the safer starting point for first fine-tuning runs.
Cloud-Managed Fine-Tuning
Cloud-managed fine-tuning handles GPU provisioning for you while giving access to open models. The right choice when you're already on a cloud platform, need data to stay in your VPC, and don't want to manage GPU infrastructure.
| Platform | Tunable models | Pricing model | Standout feature |
|---|---|---|---|
| Google Vertex AI | Gemini 2.5 Flash, Llama 4 Scout | Per training hour | Gemini fine-tuning + GCP MLOps integration |
| AWS Bedrock | Amazon Nova, Llama, Qwen 3 32B, GPT-OSS 20B | Per training token | Reinforcement fine-tuning; RFT improved accuracy 66% over base in AWS benchmarks |
| Microsoft Foundry | GPT-4.1-nano, Llama 4 Scout, Qwen 3 32B, Llama 3.3 70B | Per training hour | RFT with o4-mini; good for distillation from larger models |
| Together AI | Llama 4, Llama 3.3, Mistral, Qwen, any HF model | Per training token (LoRA from $4.50/M) | Widest model selection; serverless LoRA |
Reinforcement fine-tuning (RFT) — training with reward signals rather than supervised examples — is available on Bedrock and Microsoft Foundry (Foundry with o4-mini). AWS reported 66% accuracy improvement over base models on their benchmarks. If your task has a verifiable correct answer (math, code, structured extraction), RFT is worth testing alongside SFT.
Training Monitoring and Loss Curves
The loss curve tells you whether to continue training, stop early, or change hyperparameters. You need validation data — without it, overfitting is invisible until you test the model after the fact.
Blue = train loss · Dashed = validation loss · Diverging curves = stop training
| Pattern | Diagnosis | Action |
|---|---|---|
| Train ↓ · Val ↓ (parallel) | Healthy — model is learning | Continue training |
| Train ↓ · Val flattens | Near-optimal, diminishing returns | Reduce LR or stop soon |
| Train ↓ · Val ↑ (diverges) | Overfitting — memorizing training data | Stop now; reduce epochs; add more diverse data |
| Train barely moves from start | LR too low or data format issue | Raise LR 5-10×; validate data format |
| Train spiky throughout | LR too high or batch too small | Halve LR; double gradient_accumulation_steps |
Set report_to='wandb' in SFTConfig. W&B logs loss curves, learning rate, GPU utilization, and lets you compare runs side by side. The time to set this up is under 2 minutes; the time saved diagnosing 'why did this run underperform?' is hours.
Post-Training: Merge, Convert, Deploy
After a self-hosted fine-tuning run, you have a LoRA adapter alongside the base model weights. Serving them separately adds a small overhead per forward pass. For production, merge them into a single checkpoint first.
Fine-tuning on a narrow dataset can degrade the base model's safety alignment — especially if your training data doesn't include edge-case refusals. Run a safety eval suite (OpenAI Evals, Giskard, or your own) before deploying. A model that's better at customer support but more willing to reveal PII is not a net improvement.
A team fine-tuned Llama 3 8B on 3,000 customer support transcripts. Accuracy on their target eval jumped from 62% to 89%. They shipped it. Three days later, support agents reported the model was hallucinating product SKU numbers it had memorized from training data — SKUs that had since changed. The fix: they added product catalog lookup as a tool, reverted to prompt-engineering the base model, and rebuilt with RAG instead of fine-tuning. The root cause was fine-tuning on knowledge that should have stayed in a database.
Learn this in → Fine-tune for behavior, not knowledge. If the model needs to 'know' facts that change, use RAG or tool calling — not fine-tuning.
How Fine-Tuning Fails
Most fine-tuning failures are predictable. These five show up repeatedly in production projects.
| Failure mode | How to detect | Defense |
|---|---|---|
| Catastrophic forgetting | General-task accuracy drops after fine-tuning | Eval on diverse benchmarks before/after; use lower learning rate; include general examples in training mix (5-10%) |
| Overfitting | Val loss diverges from train loss | Validation split + OverfitDetector callback; stop at 1-3 epochs; add data diversity |
| Knowledge memorization | Model 'knows' facts from training data rather than current ground truth | Fine-tune for behavior (format, tone, reasoning style); use RAG or tools for facts |
| Mode collapse | Model outputs same format/phrase for all inputs | Check training data diversity; shuffle thoroughly; add entropy penalty; inspect outlier inputs |
| Safety degradation | Model complies with requests the base model refused | Run safety eval suite post-training; compare refusal rates on adversarial prompts |
Build an evaluation set that covers: (1) your target task, (2) adjacent tasks the base model handles, (3) safety-relevant edge cases. If fine-tuning hurts category 2 or 3, you have a failure regardless of how good category 1 looks.
Best Practices
Do
- ✓Validate JSONL format before uploading to OpenAI — malformed examples fail silently during training
- ✓Use a 10-20% validation split and monitor validation loss every 50 steps throughout training
- ✓Start with 3 epochs and auto hyperparameters; tune only after seeing validation loss behavior
- ✓Add an OverfitDetector callback with patience=3 to auto-stop on diverging validation loss
- ✓Merge LoRA adapters before deploying — eliminates adapter overhead and simplifies serving
- ✓Compare fine-tuned model against the prompt-engineered baseline on the same evaluation set before shipping
- ✓Evaluate on a general-task benchmark (not just your target task) to catch catastrophic forgetting
- ✓Document training runs: model ID, data version, hyperparameters, eval scores, known limitations
- ✓Enable data-sharing discount on OpenAI fine-tuning if your data doesn't contain confidential information
- ✓Use assistant_only_loss=True in TRL SFTConfig to train on responses only, not repeated user/system turns
Don’t
- ✗Don't use gpt-5.4 or gpt-5.4-mini as fine-tuning base models — they don't support it
- ✗Don't fine-tune on knowledge that changes (product catalog, prices, policies) — use RAG instead
- ✗Don't skip validation data — you can't detect overfitting without it
- ✗Don't run more than 3 epochs without confirming validation loss hasn't plateaued or diverged
- ✗Don't deploy without a regression check on general-task performance
- ✗Don't assume the API fine-tuned model is safely aligned — run a safety eval after every fine-tuning run
- ✗Don't fine-tune with a learning rate that worked for a different model size — larger models need lower LR
- ✗Don't use TrainingArguments with TRL v1.0 SFTTrainer — use SFTConfig to access SFT-specific options
Key Takeaways
- ✓GPT-5.4 and GPT-5.4-mini do not support fine-tuning — use GPT-4.1, GPT-4.1-mini, or GPT-4o.
- ✓TRL v1.0 uses SFTConfig (not TrainingArguments) — old code still runs but misses SFT-specific options like assistant_only_loss.
- ✓A 4.5M-token fine-tuning run costs $3.60 on GPT-4.1-mini, $13.50 on GPT-4.1, or ~$3 on a spot GPU — compute the math before choosing a path.
- ✓Validation loss diverging from training loss is the overfitting signal — stop training immediately and reduce epochs.
- ✓Merge LoRA adapters before serving: unmerged adapters add inference overhead and complicate deployment.
- ✓Fine-tune for behavior (format, tone, task structure), not knowledge — facts that change belong in RAG or tools, not model weights.
Video on this topic
Fine-tuning an LLM from start to finish