Evaluating Fine-Tuned Models
How to rigorously evaluate fine-tuned LLMs: train/validation/test splitting for LLMs, detecting overfitting and benchmark contamination, A/B testing fine-tuned vs base models with real users, and a complete evaluation harness implementation.
Quick Reference
- →Split data: 80% train, 10% validation (used during training), 10% test (never seen during training)
- →Overfitting: model memorizes training data -- high train accuracy, low test accuracy
- →Contamination: training data overlaps with evaluation data, giving inflated scores
- →Evaluate on: task accuracy, format compliance, latency, cost, and regression on general tasks
- →A/B test with real users for final validation -- automated metrics do not capture everything
- →Keep the test set strictly separated -- never use it for training decisions
In this article
Train/Validation/Test Split for LLMs
Data splitting for LLM fine-tuning follows the same principles as traditional ML, but with some important nuances. The key is maintaining strict separation between data used for training, data used for hyperparameter decisions, and data used for final evaluation.
| Split | Purpose | Size | When used | Who sees it |
|---|---|---|---|---|
| Train | Update model weights | 80% of data | During training | The model during training |
| Validation | Monitor overfitting, early stopping, hyperparameter tuning | 10% | During training (evaluation steps) | Used for training decisions but not weight updates |
| Test | Final quality assessment, model comparison | 10% | After training is complete | Never during training -- only for final evaluation |
Never make training decisions based on test set performance. The test set is for final reporting only. If you tune hyperparameters based on test results, you are effectively training on the test set through your decisions. Use the validation set for all training-time decisions (early stopping, learning rate, epoch count).
Detecting Overfitting
Overfitting in LLM fine-tuning manifests differently than in traditional ML. The model may perfectly reproduce training examples while failing to generalize to new inputs. The signs are subtle but detectable.
- ▸Loss gap: training loss continues to decrease while validation loss increases or plateaus
- ▸Memorization: the model reproduces training examples verbatim instead of learning the pattern
- ▸Brittleness: high accuracy on test cases similar to training data, low accuracy on novel inputs
- ▸Repetitive outputs: the model generates the same phrases or structures regardless of input
- ▸Loss of general capability: the model becomes worse at tasks unrelated to the fine-tuning objective
A quick overfitting check: run the model at temperature 0 and temperature 0.7 on test examples. If outputs are nearly identical regardless of temperature, the model may have memorized rather than learned. A well-generalized model shows natural variation at higher temperatures while maintaining accuracy.
Benchmark Contamination in Fine-Tuning
If your training data accidentally overlaps with your test data, evaluation results will be inflated. This is especially insidious when using synthetic data, because the teacher model might generate examples that are very similar to your test cases.
When you generate synthetic training data from a teacher model, the teacher might produce examples similar to your test set (especially if both are generated from similar prompts). Always run contamination checks between synthetic training data and your test set.
A/B Testing with Real Users
Automated metrics tell you whether the fine-tuned model is better on your test set. But real-world performance depends on factors automated metrics cannot capture: user satisfaction, task completion rate, and handling of inputs your test set did not cover.
- ▸Route 10-20% of production traffic to the fine-tuned model, 80-90% to the current model
- ▸Measure task-specific success metrics: completion rate, user corrections, escalation rate
- ▸Measure user satisfaction: thumbs up/down, CSAT scores, time-to-resolution
- ▸Run for at least 1 week to capture day-of-week and time-of-day patterns
- ▸Monitor for failure modes not covered by your test set -- production inputs are more diverse
- ▸Have a kill switch: immediately route all traffic back to the base model if quality degrades
Before routing real user traffic, run the fine-tuned model in shadow mode: it receives the same inputs as the production model but its outputs are logged, not shown to users. Compare shadow outputs against production outputs offline. This catches catastrophic failures without affecting users.
Complete Evaluation Harness
Wrap this evaluation in a CI/CD pipeline that runs automatically when you produce a new fine-tuned model. The pipeline should: (1) load the model, (2) run the test suite, (3) compare against the baseline, (4) produce a report, (5) block deployment if accuracy is below threshold or regressions are detected.
Best Practices
Do
- ✓Maintain strict separation between train, validation, and test data
- ✓Run contamination checks between all data splits before evaluating
- ✓Evaluate on multiple dimensions: task accuracy, format compliance, latency, cost, general capability
- ✓Compare fine-tuned model against prompt-engineered baseline on the same test set
- ✓Use shadow mode and A/B testing with real users before full production deployment
Don’t
- ✗Don't make training decisions (epochs, learning rate) based on test set performance
- ✗Don't skip the contamination check -- overlapping train/test data gives meaninglessly inflated scores
- ✗Don't evaluate only on accuracy -- format compliance, latency, and cost matter in production
- ✗Don't trust a single evaluation run -- LLM outputs have variance even at temperature 0
- ✗Don't deploy a fine-tuned model without checking for regressions on general tasks
Key Takeaways
- ✓Strict train/validation/test separation is non-negotiable -- the test set must never influence training decisions.
- ✓Check for contamination between all splits, especially when using synthetic training data.
- ✓Overfitting manifests as: high train accuracy + low test accuracy, verbatim memorization, and loss of general capability.
- ✓Evaluate on multiple dimensions: accuracy, format compliance, latency, cost, and regression on general tasks.
- ✓Shadow mode and A/B testing with real users are the final validation before full deployment.
Video on this topic
How to tell if your fine-tuned model is actually better
tiktok