Advanced16 min

End-to-End Fine-Tuning Pipeline

Pick your path first (API, self-hosted, or cloud-managed), then follow it end-to-end. Covers cost math for all three paths, updated OpenAI pipeline with GPT-4.1, updated Hugging Face pipeline with TRL v1.0 SFTConfig, training monitoring, post-training LoRA merge and deployment, and the five failure modes that most fine-tuning projects hit.

Quick Reference

→Fine-tunable OpenAI models (2026): GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini, o4-mini
→GPT-4.1 fine-tuning cost: ~$3/M training tokens; inference $3/M in, $12/M out
→GPT-4.1-mini fine-tuning cost: ~$0.80/M training tokens; inference $0.80/M in, $3.20/M out
→QLoRA on Llama 3.1 8B: ~16 GB VRAM (4-bit model ~10 GB + LoRA params ~2 GB + optimizer ~4 GB)
→Hugging Face TRL v1.0: use SFTConfig (not TrainingArguments) + peft_config on SFTTrainer
→Overfitting signal: validation loss increases while training loss keeps decreasing — stop immediately
→Post-training: merge LoRA adapters before serving to eliminate adapter overhead at inference
→Catastrophic forgetting risk: always evaluate on general tasks after fine-tuning, not just your target task

In this article

1.Which Fine-Tuning Path Do You Need?
2.What Will It Cost?
3.OpenAI Fine-Tuning Pipeline
4.Hugging Face + QLoRA Pipeline
5.Cloud-Managed Fine-Tuning
6.Training Monitoring and Loss Curves
7.Post-Training: Merge, Convert, Deploy
8.How Fine-Tuning Fails
★Best Practices
✓Key Takeaways

Which Fine-Tuning Path Do You Need?

Before writing a line of training code, pick your path. The three options address different constraints — and choosing the wrong one costs weeks of work.

Pick your path before writing a line of training code

Factor	OpenAI API	Hugging Face + QLoRA	Cloud-Managed
Model ownership	No — OpenAI only	Yes — run anywhere	Depends on provider
Setup complexity	Low — API calls only	High — GPU, libs, monitoring	Medium — cloud console
Training cost	~$0.80–$3/M tokens	GPU cost (≈$0.50–2/hr spot)	Per hour or per token
Inference control	None — OpenAI endpoint	Full — deploy anywhere	Provider endpoint
Best for	Fastest path, closed infra	Open models, max control	Open model, no GPU ops
Data sensitivity	Data leaves to OpenAI	Stays on your hardware	Stays in your VPC

Read the when-to-fine-tune article first

This article is the 'how' — it assumes you've already decided fine-tuning is the right tool. If you haven't done that analysis, the 'When to Fine-Tune' article in this chapter covers the decision framework including cost comparison against prompt engineering and RAG.

LoRA merge is the only step unique to the self-hosted path — API and cloud handle it server-side

What Will It Cost?

Every fine-tuning conversation ends with this question. Here's the math for a typical run: 5,000 training examples, average 300 tokens each, 3 epochs.

Total training tokens = 5,000 examples × 300 tokens × 3 epochs = 4,500,000 tokens = 4.5M tokens.

Path	Training cost (4.5M tokens)	Inference (vs. base)	Notes
GPT-4.1 API	$3.00/M × 4.5M = $13.50	$3/M in, $12/M out (50% more than base)	Simplest; no GPU management
GPT-4.1-mini API	$0.80/M × 4.5M = $3.60	$0.80/M in, $3.20/M out	Best cost for lighter tasks
GPT-4o API	$25.00/M × 4.5M = $112.50	$3.75/M in, $15/M out	Only if you need GPT-4o quality
HF + QLoRA (A100 spot)	~$1.50/hr × 2 hrs = $3.00	Your hardware cost only	Llama 3.1 8B; varies by GPU
Together AI LoRA (Llama 3.3 8B)	$4.50/M × 4.5M = $20.25	Per-request at provider rates	No GPU management needed
Vertex AI (Gemini 2.5 Flash)	Check current docs — per GPU-hr	Per-request at Vertex rates	Tight GCP integration

Spot instances cut self-hosted cost by 60-80%

An A100 40GB spot instance on GCP or AWS runs $1–2/hr vs. $3–4/hr on-demand. A 5,000-example QLoRA run on Llama 3.1 8B typically finishes in 1–2 hours. Budget $3–6 for the training itself; the GPU instance setup time is the real cost.

OpenAI data-sharing discount

If you enable data sharing when creating an OpenAI fine-tune job, inference costs drop 50% on both standard and batch modes. For high-volume inference, this discount often makes the API path cheaper than self-hosted at scale.

OpenAI Fine-Tuning Pipeline

The OpenAI API is the fastest path: upload JSONL, configure hyperparameters, wait. The constraints are real — you cannot access model weights, change architecture, or deploy outside OpenAI's infrastructure — but for many production use cases, those constraints don't matter.

GPT-5.4 does NOT support fine-tuning

GPT-5.4 and GPT-5.4-mini are inference-only. Fine-tunable models as of April 2026: GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini, and o4-mini. Always verify at platform.openai.com/docs — this list changes.

Data validation — run this before uploading anything

Complete OpenAI fine-tuning workflow (GPT-4.1)

Model	Training / 1M tokens	Inference in / 1M	Inference out / 1M
gpt-4.1-mini (fine-tuned)	~$0.80	~$0.80	~$3.20
gpt-4.1 (fine-tuned)	~$3.00	~$3.00	~$12.00
gpt-4o (fine-tuned)	$25.00	$3.75	$15.00
o4-mini (fine-tuned)	check docs	check docs	check docs

Start with gpt-4.1-mini

At ~$0.80/M training tokens and $0.80/$3.20 inference, gpt-4.1-mini is the right default for most fine-tuning tasks. Move to gpt-4.1 only if you've confirmed that the mini variant's output quality is insufficient for your use case on your evaluation set.

Hugging Face + QLoRA Pipeline

For open models with full deployment control, Hugging Face's transformers + TRL v1.0 + PEFT is the standard stack. TRL v1.0 (released April 2026) replaced the old TrainingArguments-based API with SFTConfig — the code below uses the current API.

TRL v1.0 API break: SFTConfig replaces TrainingArguments

Pre-v1.0 code passed TrainingArguments to SFTTrainer. In v1.0, use SFTConfig instead — it extends TrainingArguments with SFT-specific params (assistant_only_loss, packing, max_length). The old approach still runs but you lose access to the new SFT-specific options.

VRAM budget for this config on Llama 3.1 8B: 4-bit quantized model ≈ 8B × 0.5 bytes/param = 4 GB, plus activations and overhead ≈ 10 GB total; LoRA parameters (r=16 on 7 modules) ≈ 2 GB; optimizer states for LoRA-only params ≈ 4 GB. Total ≈ 16 GB — fits an RTX 4090 (24 GB) comfortably. For Llama 3.1 70B, scale to ≈ 48 GB (2× RTX 4090 or A100 80GB).

QLoRA fine-tuning with TRL v1.0 SFTConfig (Llama 3.1 8B)

Llama 4 models use MoE architecture — QLoRA config differs

Llama 4 Scout and Maverick are mixture-of-experts models. The LoRA target_modules list changes for MoE layers — check the model's config.json for the correct module names. Llama 3.1 8B is the safer starting point for first fine-tuning runs.

Cloud-Managed Fine-Tuning

Cloud-managed fine-tuning handles GPU provisioning for you while giving access to open models. The right choice when you're already on a cloud platform, need data to stay in your VPC, and don't want to manage GPU infrastructure.

Platform	Tunable models	Pricing model	Standout feature
Google Vertex AI	Gemini 2.5 Flash, Llama 4 Scout	Per training hour	Gemini fine-tuning + GCP MLOps integration
AWS Bedrock	Amazon Nova, Llama, Qwen 3 32B, GPT-OSS 20B	Per training token	Reinforcement fine-tuning; RFT improved accuracy 66% over base in AWS benchmarks
Microsoft Foundry	GPT-4.1-nano, Llama 4 Scout, Qwen 3 32B, Llama 3.3 70B	Per training hour	RFT with o4-mini; good for distillation from larger models
Together AI	Llama 4, Llama 3.3, Mistral, Qwen, any HF model	Per training token (LoRA from $4.50/M)	Widest model selection; serverless LoRA

Reinforcement fine-tuning is now available on Bedrock and Foundry

Reinforcement fine-tuning (RFT) — training with reward signals rather than supervised examples — is available on Bedrock and Microsoft Foundry (Foundry with o4-mini). AWS reported 66% accuracy improvement over base models on their benchmarks. If your task has a verifiable correct answer (math, code, structured extraction), RFT is worth testing alongside SFT.

Training Monitoring and Loss Curves

The loss curve tells you whether to continue training, stop early, or change hyperparameters. You need validation data — without it, overfitting is invisible until you test the model after the fact.

Blue = train loss · Dashed = validation loss · Diverging curves = stop training

Pattern	Diagnosis	Action
Train ↓ · Val ↓ (parallel)	Healthy — model is learning	Continue training
Train ↓ · Val flattens	Near-optimal, diminishing returns	Reduce LR or stop soon
Train ↓ · Val ↑ (diverges)	Overfitting — memorizing training data	Stop now; reduce epochs; add more diverse data
Train barely moves from start	LR too low or data format issue	Raise LR 5-10×; validate data format
Train spiky throughout	LR too high or batch too small	Halve LR; double gradient_accumulation_steps

Early stopping callback — add to any SFTTrainer

Log to Weights & Biases from the start

Set report_to='wandb' in SFTConfig. W&B logs loss curves, learning rate, GPU utilization, and lets you compare runs side by side. The time to set this up is under 2 minutes; the time saved diagnosing 'why did this run underperform?' is hours.

Post-Training: Merge, Convert, Deploy

After a self-hosted fine-tuning run, you have a LoRA adapter alongside the base model weights. Serving them separately adds a small overhead per forward pass. For production, merge them into a single checkpoint first.

Merge LoRA adapters into base model weights

Post-training regression check — compare fine-tuned vs. base on your eval set

Run safety eval after fine-tuning

Fine-tuning on a narrow dataset can degrade the base model's safety alignment — especially if your training data doesn't include edge-case refusals. Run a safety eval suite (OpenAI Evals, Giskard, or your own) before deploying. A model that's better at customer support but more willing to reveal PII is not a net improvement.

Real project

A team fine-tuned Llama 3 8B on 3,000 customer support transcripts. Accuracy on their target eval jumped from 62% to 89%. They shipped it. Three days later, support agents reported the model was hallucinating product SKU numbers it had memorized from training data — SKUs that had since changed. The fix: they added product catalog lookup as a tool, reverted to prompt-engineering the base model, and rebuilt with RAG instead of fine-tuning. The root cause was fine-tuning on knowledge that should have stayed in a database.

Learn this in → Fine-tune for behavior, not knowledge. If the model needs to 'know' facts that change, use RAG or tool calling — not fine-tuning.

How Fine-Tuning Fails

Most fine-tuning failures are predictable. These five show up repeatedly in production projects.

Failure mode	How to detect	Defense
Catastrophic forgetting	General-task accuracy drops after fine-tuning	Eval on diverse benchmarks before/after; use lower learning rate; include general examples in training mix (5-10%)
Overfitting	Val loss diverges from train loss	Validation split + OverfitDetector callback; stop at 1-3 epochs; add data diversity
Knowledge memorization	Model 'knows' facts from training data rather than current ground truth	Fine-tune for behavior (format, tone, reasoning style); use RAG or tools for facts
Mode collapse	Model outputs same format/phrase for all inputs	Check training data diversity; shuffle thoroughly; add entropy penalty; inspect outlier inputs
Safety degradation	Model complies with requests the base model refused	Run safety eval suite post-training; compare refusal rates on adversarial prompts

Evaluate on a diverse held-out set, not just your task

Build an evaluation set that covers: (1) your target task, (2) adjacent tasks the base model handles, (3) safety-relevant edge cases. If fine-tuning hurts category 2 or 3, you have a failure regardless of how good category 1 looks.

Best Practices

✓Validate JSONL format before uploading to OpenAI — malformed examples fail silently during training
✓Use a 10-20% validation split and monitor validation loss every 50 steps throughout training
✓Start with 3 epochs and auto hyperparameters; tune only after seeing validation loss behavior
✓Add an OverfitDetector callback with patience=3 to auto-stop on diverging validation loss
✓Merge LoRA adapters before deploying — eliminates adapter overhead and simplifies serving
✓Compare fine-tuned model against the prompt-engineered baseline on the same evaluation set before shipping
✓Evaluate on a general-task benchmark (not just your target task) to catch catastrophic forgetting
✓Document training runs: model ID, data version, hyperparameters, eval scores, known limitations
✓Enable data-sharing discount on OpenAI fine-tuning if your data doesn't contain confidential information
✓Use assistant_only_loss=True in TRL SFTConfig to train on responses only, not repeated user/system turns

Don’t

✗Don't use gpt-5.4 or gpt-5.4-mini as fine-tuning base models — they don't support it
✗Don't fine-tune on knowledge that changes (product catalog, prices, policies) — use RAG instead
✗Don't skip validation data — you can't detect overfitting without it
✗Don't run more than 3 epochs without confirming validation loss hasn't plateaued or diverged
✗Don't deploy without a regression check on general-task performance
✗Don't assume the API fine-tuned model is safely aligned — run a safety eval after every fine-tuning run
✗Don't fine-tune with a learning rate that worked for a different model size — larger models need lower LR
✗Don't use TrainingArguments with TRL v1.0 SFTTrainer — use SFTConfig to access SFT-specific options

Key Takeaways

✓GPT-5.4 and GPT-5.4-mini do not support fine-tuning — use GPT-4.1, GPT-4.1-mini, or GPT-4o.
✓TRL v1.0 uses SFTConfig (not TrainingArguments) — old code still runs but misses SFT-specific options like assistant_only_loss.
✓A 4.5M-token fine-tuning run costs $3.60 on GPT-4.1-mini, $13.50 on GPT-4.1, or ~$3 on a spot GPU — compute the math before choosing a path.
✓Validation loss diverging from training loss is the overfitting signal — stop training immediately and reduce epochs.
✓Merge LoRA adapters before serving: unmerged adapters add inference overhead and complicate deployment.
✓Fine-tune for behavior (format, tone, task structure), not knowledge — facts that change belong in RAG or tools, not model weights.

Video on this topic

Fine-tuning an LLM from start to finish

instagram

←

LoRA & QLoRA

Evaluating Fine-Tuned Models

→