End-to-End Fine-Tuning Pipeline
Complete fine-tuning pipelines for three approaches: OpenAI (simplest), Hugging Face + PEFT (most control), and cloud-managed (Vertex AI, Bedrock). Includes training monitoring, loss curve interpretation, overfitting detection, and a full working Hugging Face training script.
Quick Reference
- →OpenAI fine-tuning: upload JSONL, configure hyperparams, wait -- simplest path with limited control
- →Hugging Face + PEFT: maximum flexibility, open models, full control over training
- →Cloud-managed (Vertex AI, Bedrock): middle ground -- managed infrastructure with open model support
- →Monitor loss curves: training loss should decrease; validation loss should decrease then flatten (not increase)
- →Overfitting signs: validation loss increases while training loss continues to decrease
- →Typical training: 1-3 epochs for most tasks, lower learning rate for larger models
In this article
OpenAI Fine-Tuning (Simplest Path)
OpenAI's fine-tuning API is the simplest way to fine-tune an LLM. You upload training data, set a few hyperparameters, and wait. The trade-off is limited control: you cannot change the architecture, access intermediate checkpoints, or deploy the model outside of OpenAI's infrastructure.
| Model | Training cost / 1M tokens | Inference input / 1M | Inference output / 1M |
|---|---|---|---|
| gpt-5.4-mini (fine-tuned) | $3.00 | $0.30 | $1.20 |
| gpt-5.4 (fine-tuned) | $25.00 | $3.75 | $15.00 |
| gpt-5.4-mini (fine-tuned, RL) | $6.00 | $0.30 | $1.20 |
Start with 3 epochs and auto learning rate. If the model overfits (validation loss increases), reduce to 1-2 epochs. If quality is insufficient, increase data quality and quantity before increasing epochs. The 'auto' settings are surprisingly good defaults.
Hugging Face + PEFT (Maximum Control)
For open models with full control over the training process, Hugging Face's transformers library combined with PEFT (Parameter-Efficient Fine-Tuning) is the standard approach. This gives you control over every hyperparameter, access to checkpoints, and the ability to deploy anywhere.
For QLoRA with the above config on Llama 3 8B: ~10 GB VRAM for the 4-bit model, ~2 GB for LoRA parameters, ~4 GB for optimizer states and gradients. Total: ~16 GB. An RTX 4090 (24 GB) handles this comfortably. For 70B models, you need ~48 GB (A100 or 2x RTX 4090).
Cloud-Managed Fine-Tuning
Cloud providers offer managed fine-tuning that handles infrastructure for you while giving you access to open models. This is the middle ground between OpenAI's simplicity and Hugging Face's flexibility.
| Platform | Supported models | Pricing model | Strengths |
|---|---|---|---|
| Google Vertex AI | Gemini, Llama, Mistral | Per training hour | Integrated with Google Cloud, strong MLOps |
| AWS Bedrock | Llama, Mistral, Cohere, Amazon Titan | Per training token | Integrated with AWS ecosystem, VPC support |
| Azure AI Studio | Llama, Mistral, Phi | Per training hour | Integrated with Azure, enterprise features |
| Together AI | Llama, Mistral, Qwen, any HF model | Per GPU hour | Widest model selection, developer-friendly |
Use managed fine-tuning when: (1) you want open model flexibility but don't want to manage GPUs, (2) you're already on a cloud platform and want tight integration, (3) your security team requires data to stay within your cloud VPC. The cost is typically 2-3x higher than self-managed GPU training, but the time savings are significant.
Training Monitoring and Loss Curves
Monitoring training progress is essential for catching issues early and knowing when to stop. The key metric is the loss curve: training loss and validation loss plotted over training steps.
| Pattern | What it means | Action |
|---|---|---|
| Train loss decreasing, val loss decreasing | Healthy training, model is learning | Continue training |
| Train loss decreasing, val loss flat | Model is near optimal, diminishing returns | Consider stopping or reducing LR |
| Train loss decreasing, val loss increasing | OVERFITTING -- model is memorizing training data | Stop training, reduce epochs, add regularization |
| Train loss flat from start | Learning rate too low or data issue | Increase LR, check data format |
| Train loss spiky/unstable | Learning rate too high or batch too small | Reduce LR, increase batch size |
Always log training metrics to a visualization tool. Weights & Biases (wandb) is the most popular choice for LLM fine-tuning -- it logs loss curves, learning rate schedules, GPU utilization, and allows comparing multiple training runs side by side. Set report_to='wandb' in TrainingArguments.
Post-Training Steps
After training completes, there are several important steps before deploying your fine-tuned model.
- ▸Evaluate on held-out test set: compare the fine-tuned model against your prompt-engineered baseline
- ▸Check for regression on general capabilities: make sure fine-tuning did not degrade the model on unrelated tasks
- ▸Merge LoRA adapters if desired: merged models have zero LoRA overhead at inference time
- ▸Convert to optimized format: GGUF for llama.cpp, GPTQ or AWQ for vLLM/TGI deployment
- ▸Run safety evaluation: ensure fine-tuning did not degrade the model's safety alignment
- ▸Document the training run: hyperparameters, data version, eval results, known limitations
The 'base_only_correct' count shows cases where fine-tuning made things worse. If this is more than 5% of test cases, investigate those specific examples. Fine-tuning can degrade performance on certain input types -- especially if your training data is not diverse enough.
Best Practices
Do
- ✓Start with OpenAI fine-tuning for simplicity, move to Hugging Face for more control
- ✓Always include a validation split (10-20%) and monitor validation loss during training
- ✓Use early stopping based on validation loss to prevent overfitting
- ✓Log training metrics to Weights & Biases or TensorBoard for visualization
- ✓Compare fine-tuned model against prompt-engineered baseline on the same test set
Don’t
- ✗Don't train for too many epochs -- 1-3 is sufficient for most tasks, more leads to overfitting
- ✗Don't skip validation data -- without it, you have no way to detect overfitting during training
- ✗Don't use very high learning rates with large models -- they are more sensitive to instability
- ✗Don't deploy without checking for regressions on the base model's general capabilities
- ✗Don't forget to document the training run -- future you will want to reproduce it
Key Takeaways
- ✓Three paths: OpenAI (simplest, limited control), Hugging Face (full control), cloud-managed (middle ground).
- ✓Always use a validation set and monitor loss curves -- overfitting is the most common failure mode.
- ✓QLoRA on Hugging Face enables fine-tuning 70B models on a single GPU with near-full-fine-tuning quality.
- ✓Post-training evaluation must compare against the baseline AND check for regressions on unrelated tasks.
- ✓Document everything: hyperparameters, data version, eval results -- you will need to reproduce the run.
Video on this topic
Fine-tuning an LLM from start to finish