LLM Foundations/Fine-Tuning
Advanced16 min

End-to-End Fine-Tuning Pipeline

Pick your path first (API, self-hosted, or cloud-managed), then follow it end-to-end. Covers cost math for all three paths, updated OpenAI pipeline with GPT-4.1, updated Hugging Face pipeline with TRL v1.0 SFTConfig, training monitoring, post-training LoRA merge and deployment, and the five failure modes that most fine-tuning projects hit.

Quick Reference

  • Fine-tunable OpenAI models (2026): GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini, o4-mini
  • GPT-4.1 fine-tuning cost: ~$3/M training tokens; inference $3/M in, $12/M out
  • GPT-4.1-mini fine-tuning cost: ~$0.80/M training tokens; inference $0.80/M in, $3.20/M out
  • QLoRA on Llama 3.1 8B: ~16 GB VRAM (4-bit model ~10 GB + LoRA params ~2 GB + optimizer ~4 GB)
  • Hugging Face TRL v1.0: use SFTConfig (not TrainingArguments) + peft_config on SFTTrainer
  • Overfitting signal: validation loss increases while training loss keeps decreasing — stop immediately
  • Post-training: merge LoRA adapters before serving to eliminate adapter overhead at inference
  • Catastrophic forgetting risk: always evaluate on general tasks after fine-tuning, not just your target task

Which Fine-Tuning Path Do You Need?

Before writing a line of training code, pick your path. The three options address different constraints — and choosing the wrong one costs weeks of work.

Need Fine-Tuning?start hereOpen model needed?(data sovereignty / Llama 4)YesNoOn cloud already?(AWS / GCP / Azure)YesNoSelf-HostedHF + QLoRAfull controlOpenAI APIGPT-4.1 / o4-minisimplest pathCloud-ManagedVertex/Bedrock/Foundrymiddle ground

Pick your path before writing a line of training code

FactorOpenAI APIHugging Face + QLoRACloud-Managed
Model ownershipNo — OpenAI onlyYes — run anywhereDepends on provider
Setup complexityLow — API calls onlyHigh — GPU, libs, monitoringMedium — cloud console
Training cost~$0.80–$3/M tokensGPU cost (≈$0.50–2/hr spot)Per hour or per token
Inference controlNone — OpenAI endpointFull — deploy anywhereProvider endpoint
Best forFastest path, closed infraOpen models, max controlOpen model, no GPU ops
Data sensitivityData leaves to OpenAIStays on your hardwareStays in your VPC
Read the when-to-fine-tune article first

This article is the 'how' — it assumes you've already decided fine-tuning is the right tool. If you haven't done that analysis, the 'When to Fine-Tune' article in this chapter covers the decision framework including cost comparison against prompt engineering and RAG.

Data PrepJSONL formatValidateformat + qualityTrainepochs 1-3Monitorloss curvesEvaluatevs. baselineMerge†LoRA → weightsHFDeployserve + monitorall paths† HF only — API / cloud skip

LoRA merge is the only step unique to the self-hosted path — API and cloud handle it server-side

What Will It Cost?

Every fine-tuning conversation ends with this question. Here's the math for a typical run: 5,000 training examples, average 300 tokens each, 3 epochs.

Total training tokens = 5,000 examples × 300 tokens × 3 epochs = 4,500,000 tokens = 4.5M tokens.

PathTraining cost (4.5M tokens)Inference (vs. base)Notes
GPT-4.1 API$3.00/M × 4.5M = $13.50$3/M in, $12/M out (50% more than base)Simplest; no GPU management
GPT-4.1-mini API$0.80/M × 4.5M = $3.60$0.80/M in, $3.20/M outBest cost for lighter tasks
GPT-4o API$25.00/M × 4.5M = $112.50$3.75/M in, $15/M outOnly if you need GPT-4o quality
HF + QLoRA (A100 spot)~$1.50/hr × 2 hrs = $3.00Your hardware cost onlyLlama 3.1 8B; varies by GPU
Together AI LoRA (Llama 3.3 8B)$4.50/M × 4.5M = $20.25Per-request at provider ratesNo GPU management needed
Vertex AI (Gemini 2.5 Flash)Check current docs — per GPU-hrPer-request at Vertex ratesTight GCP integration
Spot instances cut self-hosted cost by 60-80%

An A100 40GB spot instance on GCP or AWS runs $1–2/hr vs. $3–4/hr on-demand. A 5,000-example QLoRA run on Llama 3.1 8B typically finishes in 1–2 hours. Budget $3–6 for the training itself; the GPU instance setup time is the real cost.

OpenAI data-sharing discount

If you enable data sharing when creating an OpenAI fine-tune job, inference costs drop 50% on both standard and batch modes. For high-volume inference, this discount often makes the API path cheaper than self-hosted at scale.

OpenAI Fine-Tuning Pipeline

The OpenAI API is the fastest path: upload JSONL, configure hyperparameters, wait. The constraints are real — you cannot access model weights, change architecture, or deploy outside OpenAI's infrastructure — but for many production use cases, those constraints don't matter.

GPT-5.4 does NOT support fine-tuning

GPT-5.4 and GPT-5.4-mini are inference-only. Fine-tunable models as of April 2026: GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-4o, GPT-4o-mini, and o4-mini. Always verify at platform.openai.com/docs — this list changes.

Data validation — run this before uploading anything
Complete OpenAI fine-tuning workflow (GPT-4.1)
ModelTraining / 1M tokensInference in / 1MInference out / 1M
gpt-4.1-mini (fine-tuned)~$0.80~$0.80~$3.20
gpt-4.1 (fine-tuned)~$3.00~$3.00~$12.00
gpt-4o (fine-tuned)$25.00$3.75$15.00
o4-mini (fine-tuned)check docscheck docscheck docs
Start with gpt-4.1-mini

At ~$0.80/M training tokens and $0.80/$3.20 inference, gpt-4.1-mini is the right default for most fine-tuning tasks. Move to gpt-4.1 only if you've confirmed that the mini variant's output quality is insufficient for your use case on your evaluation set.

Hugging Face + QLoRA Pipeline

For open models with full deployment control, Hugging Face's transformers + TRL v1.0 + PEFT is the standard stack. TRL v1.0 (released April 2026) replaced the old TrainingArguments-based API with SFTConfig — the code below uses the current API.

TRL v1.0 API break: SFTConfig replaces TrainingArguments

Pre-v1.0 code passed TrainingArguments to SFTTrainer. In v1.0, use SFTConfig instead — it extends TrainingArguments with SFT-specific params (assistant_only_loss, packing, max_length). The old approach still runs but you lose access to the new SFT-specific options.

VRAM budget for this config on Llama 3.1 8B: 4-bit quantized model ≈ 8B × 0.5 bytes/param = 4 GB, plus activations and overhead ≈ 10 GB total; LoRA parameters (r=16 on 7 modules) ≈ 2 GB; optimizer states for LoRA-only params ≈ 4 GB. Total ≈ 16 GB — fits an RTX 4090 (24 GB) comfortably. For Llama 3.1 70B, scale to ≈ 48 GB (2× RTX 4090 or A100 80GB).

QLoRA fine-tuning with TRL v1.0 SFTConfig (Llama 3.1 8B)
Llama 4 models use MoE architecture — QLoRA config differs

Llama 4 Scout and Maverick are mixture-of-experts models. The LoRA target_modules list changes for MoE layers — check the model's config.json for the correct module names. Llama 3.1 8B is the safer starting point for first fine-tuning runs.

Cloud-Managed Fine-Tuning

Cloud-managed fine-tuning handles GPU provisioning for you while giving access to open models. The right choice when you're already on a cloud platform, need data to stay in your VPC, and don't want to manage GPU infrastructure.

PlatformTunable modelsPricing modelStandout feature
Google Vertex AIGemini 2.5 Flash, Llama 4 ScoutPer training hourGemini fine-tuning + GCP MLOps integration
AWS BedrockAmazon Nova, Llama, Qwen 3 32B, GPT-OSS 20BPer training tokenReinforcement fine-tuning; RFT improved accuracy 66% over base in AWS benchmarks
Microsoft FoundryGPT-4.1-nano, Llama 4 Scout, Qwen 3 32B, Llama 3.3 70BPer training hourRFT with o4-mini; good for distillation from larger models
Together AILlama 4, Llama 3.3, Mistral, Qwen, any HF modelPer training token (LoRA from $4.50/M)Widest model selection; serverless LoRA
Reinforcement fine-tuning is now available on Bedrock and Foundry

Reinforcement fine-tuning (RFT) — training with reward signals rather than supervised examples — is available on Bedrock and Microsoft Foundry (Foundry with o4-mini). AWS reported 66% accuracy improvement over base models on their benchmarks. If your task has a verifiable correct answer (math, code, structured extraction), RFT is worth testing alongside SFT.

Training Monitoring and Loss Curves

The loss curve tells you whether to continue training, stop early, or change hyperparameters. You need validation data — without it, overfitting is invisible until you test the model after the fact.

Healthylossstepscontinue trainingOverfittinglossstepsstop — reduce epochsUnstablelossstepsreduce learning ratetrain lossval loss

Blue = train loss · Dashed = validation loss · Diverging curves = stop training

PatternDiagnosisAction
Train ↓ · Val ↓ (parallel)Healthy — model is learningContinue training
Train ↓ · Val flattensNear-optimal, diminishing returnsReduce LR or stop soon
Train ↓ · Val ↑ (diverges)Overfitting — memorizing training dataStop now; reduce epochs; add more diverse data
Train barely moves from startLR too low or data format issueRaise LR 5-10×; validate data format
Train spiky throughoutLR too high or batch too smallHalve LR; double gradient_accumulation_steps
Early stopping callback — add to any SFTTrainer
Log to Weights & Biases from the start

Set report_to='wandb' in SFTConfig. W&B logs loss curves, learning rate, GPU utilization, and lets you compare runs side by side. The time to set this up is under 2 minutes; the time saved diagnosing 'why did this run underperform?' is hours.

Post-Training: Merge, Convert, Deploy

After a self-hosted fine-tuning run, you have a LoRA adapter alongside the base model weights. Serving them separately adds a small overhead per forward pass. For production, merge them into a single checkpoint first.

Merge LoRA adapters into base model weights
Post-training regression check — compare fine-tuned vs. base on your eval set
Run safety eval after fine-tuning

Fine-tuning on a narrow dataset can degrade the base model's safety alignment — especially if your training data doesn't include edge-case refusals. Run a safety eval suite (OpenAI Evals, Giskard, or your own) before deploying. A model that's better at customer support but more willing to reveal PII is not a net improvement.

Real project

A team fine-tuned Llama 3 8B on 3,000 customer support transcripts. Accuracy on their target eval jumped from 62% to 89%. They shipped it. Three days later, support agents reported the model was hallucinating product SKU numbers it had memorized from training data — SKUs that had since changed. The fix: they added product catalog lookup as a tool, reverted to prompt-engineering the base model, and rebuilt with RAG instead of fine-tuning. The root cause was fine-tuning on knowledge that should have stayed in a database.

Learn this in → Fine-tune for behavior, not knowledge. If the model needs to 'know' facts that change, use RAG or tool calling — not fine-tuning.

How Fine-Tuning Fails

Most fine-tuning failures are predictable. These five show up repeatedly in production projects.

Failure modeHow to detectDefense
Catastrophic forgettingGeneral-task accuracy drops after fine-tuningEval on diverse benchmarks before/after; use lower learning rate; include general examples in training mix (5-10%)
OverfittingVal loss diverges from train lossValidation split + OverfitDetector callback; stop at 1-3 epochs; add data diversity
Knowledge memorizationModel 'knows' facts from training data rather than current ground truthFine-tune for behavior (format, tone, reasoning style); use RAG or tools for facts
Mode collapseModel outputs same format/phrase for all inputsCheck training data diversity; shuffle thoroughly; add entropy penalty; inspect outlier inputs
Safety degradationModel complies with requests the base model refusedRun safety eval suite post-training; compare refusal rates on adversarial prompts
Evaluate on a diverse held-out set, not just your task

Build an evaluation set that covers: (1) your target task, (2) adjacent tasks the base model handles, (3) safety-relevant edge cases. If fine-tuning hurts category 2 or 3, you have a failure regardless of how good category 1 looks.

Best Practices

Best Practices

Do

  • Validate JSONL format before uploading to OpenAI — malformed examples fail silently during training
  • Use a 10-20% validation split and monitor validation loss every 50 steps throughout training
  • Start with 3 epochs and auto hyperparameters; tune only after seeing validation loss behavior
  • Add an OverfitDetector callback with patience=3 to auto-stop on diverging validation loss
  • Merge LoRA adapters before deploying — eliminates adapter overhead and simplifies serving
  • Compare fine-tuned model against the prompt-engineered baseline on the same evaluation set before shipping
  • Evaluate on a general-task benchmark (not just your target task) to catch catastrophic forgetting
  • Document training runs: model ID, data version, hyperparameters, eval scores, known limitations
  • Enable data-sharing discount on OpenAI fine-tuning if your data doesn't contain confidential information
  • Use assistant_only_loss=True in TRL SFTConfig to train on responses only, not repeated user/system turns

Don’t

  • Don't use gpt-5.4 or gpt-5.4-mini as fine-tuning base models — they don't support it
  • Don't fine-tune on knowledge that changes (product catalog, prices, policies) — use RAG instead
  • Don't skip validation data — you can't detect overfitting without it
  • Don't run more than 3 epochs without confirming validation loss hasn't plateaued or diverged
  • Don't deploy without a regression check on general-task performance
  • Don't assume the API fine-tuned model is safely aligned — run a safety eval after every fine-tuning run
  • Don't fine-tune with a learning rate that worked for a different model size — larger models need lower LR
  • Don't use TrainingArguments with TRL v1.0 SFTTrainer — use SFTConfig to access SFT-specific options

Key Takeaways

  • GPT-5.4 and GPT-5.4-mini do not support fine-tuning — use GPT-4.1, GPT-4.1-mini, or GPT-4o.
  • TRL v1.0 uses SFTConfig (not TrainingArguments) — old code still runs but misses SFT-specific options like assistant_only_loss.
  • A 4.5M-token fine-tuning run costs $3.60 on GPT-4.1-mini, $13.50 on GPT-4.1, or ~$3 on a spot GPU — compute the math before choosing a path.
  • Validation loss diverging from training loss is the overfitting signal — stop training immediately and reduce epochs.
  • Merge LoRA adapters before serving: unmerged adapters add inference overhead and complicate deployment.
  • Fine-tune for behavior (format, tone, task structure), not knowledge — facts that change belong in RAG or tools, not model weights.

Video on this topic

Fine-tuning an LLM from start to finish

instagram