LLM Foundations/Fine-Tuning
Advanced12 min

End-to-End Fine-Tuning Pipeline

Complete fine-tuning pipelines for three approaches: OpenAI (simplest), Hugging Face + PEFT (most control), and cloud-managed (Vertex AI, Bedrock). Includes training monitoring, loss curve interpretation, overfitting detection, and a full working Hugging Face training script.

Quick Reference

  • OpenAI fine-tuning: upload JSONL, configure hyperparams, wait -- simplest path with limited control
  • Hugging Face + PEFT: maximum flexibility, open models, full control over training
  • Cloud-managed (Vertex AI, Bedrock): middle ground -- managed infrastructure with open model support
  • Monitor loss curves: training loss should decrease; validation loss should decrease then flatten (not increase)
  • Overfitting signs: validation loss increases while training loss continues to decrease
  • Typical training: 1-3 epochs for most tasks, lower learning rate for larger models

OpenAI Fine-Tuning (Simplest Path)

OpenAI's fine-tuning API is the simplest way to fine-tune an LLM. You upload training data, set a few hyperparameters, and wait. The trade-off is limited control: you cannot change the architecture, access intermediate checkpoints, or deploy the model outside of OpenAI's infrastructure.

Complete OpenAI fine-tuning workflow
ModelTraining cost / 1M tokensInference input / 1MInference output / 1M
gpt-5.4-mini (fine-tuned)$3.00$0.30$1.20
gpt-5.4 (fine-tuned)$25.00$3.75$15.00
gpt-5.4-mini (fine-tuned, RL)$6.00$0.30$1.20
OpenAI fine-tuning tips

Start with 3 epochs and auto learning rate. If the model overfits (validation loss increases), reduce to 1-2 epochs. If quality is insufficient, increase data quality and quantity before increasing epochs. The 'auto' settings are surprisingly good defaults.

Hugging Face + PEFT (Maximum Control)

For open models with full control over the training process, Hugging Face's transformers library combined with PEFT (Parameter-Efficient Fine-Tuning) is the standard approach. This gives you control over every hyperparameter, access to checkpoints, and the ability to deploy anywhere.

Complete QLoRA fine-tuning script
Memory estimation

For QLoRA with the above config on Llama 3 8B: ~10 GB VRAM for the 4-bit model, ~2 GB for LoRA parameters, ~4 GB for optimizer states and gradients. Total: ~16 GB. An RTX 4090 (24 GB) handles this comfortably. For 70B models, you need ~48 GB (A100 or 2x RTX 4090).

Cloud-Managed Fine-Tuning

Cloud providers offer managed fine-tuning that handles infrastructure for you while giving you access to open models. This is the middle ground between OpenAI's simplicity and Hugging Face's flexibility.

PlatformSupported modelsPricing modelStrengths
Google Vertex AIGemini, Llama, MistralPer training hourIntegrated with Google Cloud, strong MLOps
AWS BedrockLlama, Mistral, Cohere, Amazon TitanPer training tokenIntegrated with AWS ecosystem, VPC support
Azure AI StudioLlama, Mistral, PhiPer training hourIntegrated with Azure, enterprise features
Together AILlama, Mistral, Qwen, any HF modelPer GPU hourWidest model selection, developer-friendly
When to use managed fine-tuning

Use managed fine-tuning when: (1) you want open model flexibility but don't want to manage GPUs, (2) you're already on a cloud platform and want tight integration, (3) your security team requires data to stay within your cloud VPC. The cost is typically 2-3x higher than self-managed GPU training, but the time savings are significant.

Training Monitoring and Loss Curves

Monitoring training progress is essential for catching issues early and knowing when to stop. The key metric is the loss curve: training loss and validation loss plotted over training steps.

PatternWhat it meansAction
Train loss decreasing, val loss decreasingHealthy training, model is learningContinue training
Train loss decreasing, val loss flatModel is near optimal, diminishing returnsConsider stopping or reducing LR
Train loss decreasing, val loss increasingOVERFITTING -- model is memorizing training dataStop training, reduce epochs, add regularization
Train loss flat from startLearning rate too low or data issueIncrease LR, check data format
Train loss spiky/unstableLearning rate too high or batch too smallReduce LR, increase batch size
Simple training monitor with early stopping
Use Weights & Biases or TensorBoard

Always log training metrics to a visualization tool. Weights & Biases (wandb) is the most popular choice for LLM fine-tuning -- it logs loss curves, learning rate schedules, GPU utilization, and allows comparing multiple training runs side by side. Set report_to='wandb' in TrainingArguments.

Post-Training Steps

After training completes, there are several important steps before deploying your fine-tuned model.

  • Evaluate on held-out test set: compare the fine-tuned model against your prompt-engineered baseline
  • Check for regression on general capabilities: make sure fine-tuning did not degrade the model on unrelated tasks
  • Merge LoRA adapters if desired: merged models have zero LoRA overhead at inference time
  • Convert to optimized format: GGUF for llama.cpp, GPTQ or AWQ for vLLM/TGI deployment
  • Run safety evaluation: ensure fine-tuning did not degrade the model's safety alignment
  • Document the training run: hyperparameters, data version, eval results, known limitations
Post-training evaluation and comparison
Watch for regressions

The 'base_only_correct' count shows cases where fine-tuning made things worse. If this is more than 5% of test cases, investigate those specific examples. Fine-tuning can degrade performance on certain input types -- especially if your training data is not diverse enough.

Best Practices

Best Practices

Do

  • Start with OpenAI fine-tuning for simplicity, move to Hugging Face for more control
  • Always include a validation split (10-20%) and monitor validation loss during training
  • Use early stopping based on validation loss to prevent overfitting
  • Log training metrics to Weights & Biases or TensorBoard for visualization
  • Compare fine-tuned model against prompt-engineered baseline on the same test set

Don’t

  • Don't train for too many epochs -- 1-3 is sufficient for most tasks, more leads to overfitting
  • Don't skip validation data -- without it, you have no way to detect overfitting during training
  • Don't use very high learning rates with large models -- they are more sensitive to instability
  • Don't deploy without checking for regressions on the base model's general capabilities
  • Don't forget to document the training run -- future you will want to reproduce it

Key Takeaways

  • Three paths: OpenAI (simplest, limited control), Hugging Face (full control), cloud-managed (middle ground).
  • Always use a validation set and monitor loss curves -- overfitting is the most common failure mode.
  • QLoRA on Hugging Face enables fine-tuning 70B models on a single GPU with near-full-fine-tuning quality.
  • Post-training evaluation must compare against the baseline AND check for regressions on unrelated tasks.
  • Document everything: hyperparameters, data version, eval results -- you will need to reproduce the run.

Video on this topic

Fine-tuning an LLM from start to finish

instagram