Evaluating Fine-Tuned Models
Five dimensions you must evaluate before shipping a fine-tuned model — including catastrophic forgetting, which kills most fine-tunes that look good on paper. Covers loss curve interpretation, contamination detection, LLM-as-judge eval, and a go/no-go decision framework.
Quick Reference
- →Evaluate on 5 dimensions: task accuracy, format compliance, general capability, latency/cost, user satisfaction
- →Loss curves are your first diagnostic — diverging train/val loss means stop training
- →Catastrophic forgetting: fine-tuned model loses general capability. Always run a regression suite
- →Contamination check: verify zero overlap between train and test sets before trusting any score
- →LLM-as-judge (different provider): more reliable than string-match for open-ended outputs
- →Shadow mode before any A/B traffic — catch catastrophic failures without affecting users
- →Test set is sacred: never use it for training decisions, only for final evaluation
In this article
What to Measure (and What Most Teams Miss)
Most teams evaluate a fine-tuned model on exactly one dimension: task accuracy on the test set. If it goes up, they ship. This is how you end up with a model that is technically better at your task but breaks everything else.
Fail any one of these and you may ship a model that is worse in production
Each dimension catches a different class of failure. Task accuracy confirms the model learned the task. Format compliance catches silent failures where outputs are correct but unparseable. General capability catches catastrophic forgetting. Latency and cost catch cases where the improvement is real but unaffordable. User satisfaction catches the gap between what automated metrics measure and what users actually experience.
A model that improves from 71% to 94% on your test set can still degrade in production if it loses general capability, violates output format contracts, or increases latency beyond your SLA. Establish a baseline for all 5 dimensions before fine-tuning, then verify none regressed after.
Reading Training Loss Curves
Training and validation loss curves are the first thing you check — before running any test set evaluation. They tell you whether training is going well before you spend time on downstream metrics. There are three patterns, each requiring a different response.
Blue = train loss · Red dashed = val loss · The gap tells you which problem you have
Underfitting: both curves stay high and track each other. The model is not learning. Check data quality and quantity first — you likely have too few examples, noisy labels, or mismatched system prompts. Overfitting: train loss drops while val loss rises. The model memorized training examples instead of learning the pattern. Roll back to the checkpoint before divergence (usually 1-2 epochs earlier). Good fit: both curves decrease together and plateau with a small gap.
When submitting a fine-tuning job to OpenAI, always pass a separate validation_file. Without it, you only see training loss — and you cannot distinguish good fit from overfitting until you evaluate on the test set, which is too late to cheaply fix.
Data Splitting for LLMs (What's Different)
Train/validation/test splitting for LLMs follows the standard 80/10/10 split, but with two LLM-specific pitfalls that standard ML doesn't have.
| Split | Purpose | LLM-specific concern |
|---|---|---|
| Train (80%) | Update model weights | Never split a multi-turn conversation across sets — split at conversation boundaries |
| Validation (10%) | Early stopping, hyperparameter tuning | Used during training; never use for reporting final scores |
| Test (10%) | Final evaluation only | Must be checked for contamination with train before trusting any score |
The conversation-level constraint matters when your training data contains multi-turn dialogues. If turn 1 is in train and turn 3 is in test, the model has seen the context — your test score is contaminated. Split entire conversations, not individual turns.
Never make training decisions based on test set performance. If you tune epoch count or learning rate by looking at test scores, you are effectively training on the test set through your decisions. Use validation for all in-training decisions. Use test only to report the final number.
Contamination Detection
Contamination means training examples ended up in your test set — so the test score is inflated because the model already saw those inputs. This is especially common when using synthetic data: the teacher model may generate test-like examples from similar prompts. Run this check before trusting any score.
When you generate synthetic training data from a teacher model, the teacher may produce examples semantically similar to your test set — especially if both sets were generated from overlapping prompt templates. Always run this contamination check between synthetic training data and your test set, not just between human-labeled splits.
Detecting Catastrophic Forgetting
Catastrophic forgetting is the most common fine-tuning failure that passes automated tests. The model improves on its target task but loses general language capabilities — answering questions outside the fine-tune scope with task-specific jargon, refusing to deviate from learned output patterns, or simply producing worse results on anything it was not trained on.
A payments company fine-tuned GPT-4o-mini on 8,000 transaction classification examples. Task accuracy jumped from 71% to 94% on their test set. They shipped it. Three days later, support tickets spiked: the bot was answering refund policy questions and account help requests with transaction classification jargon — it had learned to classify everything. Their regression suite had zero examples of general customer questions. They rolled back in 4 hours, added 150 regression examples to the test suite, retrained with a lower epoch count, and reshipped a week later.
Learn this in → A regression suite of 100-200 general-capability examples costs 2 hours to build. A production rollback costs days.
Cover: (1) tasks your users do outside the fine-tuned workflow, (2) refusals and safety cases, (3) format and tone diversity. A good regression suite of 100-200 examples takes 2 hours to build and will catch the most expensive class of fine-tuning failure before it reaches users.
LLM-as-Judge for Fine-Tuned Models
String-match evaluation works for structured extraction (JSON fields, categories, codes). It fails for anything conversational, generative, or style-dependent. If your fine-tuned model produces 'Order confirmed for SKU 4821' and the expected output is 'Your order for item 4821 has been confirmed' — string-match scores zero on a perfect answer. LLM-as-judge solves this.
Use a frontier model from a different provider as your judge. This reduces same-provider bias — an OpenAI model asked to judge another OpenAI model's output will tend to favor responses that match its own generation style. Use Claude to judge fine-tuned OpenAI models, or GPT-5.4 to judge fine-tuned Claude models.
Run the same test cases through the judge twice and compare scores. A well-calibrated judge should agree with itself on >90% of cases. If consistency is below 80%, your prompt is under-specified — add more explicit scoring criteria with concrete examples of what earns each score level.
Production Evaluation Harness
Combine all five dimensions into a single harness that runs automatically when you produce a new fine-tuned checkpoint. The harness compares fine-tuned vs base on every dimension and blocks deployment if any threshold is breached.
The Go/No-Go Decision
Once automated evaluation passes, the deployment path is shadow mode → limited A/B → full rollout. Each stage gates the next. Never skip shadow mode — it is the only way to catch failure modes your test set did not cover before they affect real users.
| Stage | Traffic | What you learn | Gate to next stage |
|---|---|---|---|
| Shadow mode | 0% (log only) | Catastrophic output failures, format violations at production input distribution | Zero catastrophic failures in 48 hours |
| Limited A/B | 5-10% | User satisfaction delta, task completion, escalation rate | Statistically significant improvement or neutral; no regression |
| Full rollout | 100% | Long-tail edge cases, time-of-day patterns | Monitor for 1 week; define kill-switch criteria upfront |
Before any A/B traffic, write down the exact numbers that trigger an immediate rollback: e.g., format compliance drops below 95%, escalation rate increases more than 20%, or any single catastrophic output type appears more than twice per 1,000 requests. Write these down before launch — you will not make good decisions when you are staring at a degrading metric at 2am.
Eval pipeline: traces to dataset, judge scores, CI gate blocks regressions
Best Practices
Do
- ✓Upload a validation file for every fine-tuning job — you need val loss to diagnose training
- ✓Run contamination checks between all data splits before reporting any evaluation score
- ✓Build a 100-200 example regression suite covering tasks outside the fine-tune scope
- ✓Use a judge from a different provider to evaluate open-ended outputs (cross-provider reduces bias)
- ✓Run shadow mode for at least 48 hours before any A/B traffic
- ✓Gate deployment on all 5 dimensions: accuracy, format, regression, latency, and user satisfaction
- ✓Set per-category task accuracy gates — a model that improves average accuracy but regresses on a critical category is not safe to ship
- ✓Define kill-switch criteria in writing before launching any A/B test
- ✓Compare fine-tuned model against a prompt-engineered baseline, not just the zero-shot base model
Don’t
- ✗Don't make training decisions (epochs, learning rate) based on test set performance — that's data leakage
- ✗Don't skip the contamination check for synthetic data — teacher models produce semantically similar outputs from similar prompts
- ✗Don't evaluate only on task accuracy — format compliance and regression are as important for production safety
- ✗Don't trust a single LLM judge score — verify judge consistency (same cases, twice) before relying on results
- ✗Don't split multi-turn conversations across train/test — split at conversation boundaries
- ✗Don't go straight from automated tests to full traffic — shadow mode is not optional
- ✗Don't use magic-number thresholds (like 0.15 accuracy gap = overfitting) — establish your own baseline gap before fine-tuning
- ✗Don't treat the 80/10/10 split as sacred — for small datasets (<500 examples), consider 90/5/5 to maximize training signal
- ✗Don't deploy without monitoring the same 5 eval dimensions in production — drift is real
Key Takeaways
- ✓Fine-tuned models must be evaluated on 5 dimensions — task accuracy alone will miss production failures.
- ✓Loss curve divergence is the earliest signal of overfitting — catch it during training, not on the test set.
- ✓Catastrophic forgetting kills fine-tuned models that look perfect on paper — always run a regression suite.
- ✓Contamination between train and test sets inflates scores silently; it is especially common with synthetic data.
- ✓LLM-as-judge from a different provider reduces same-provider bias and handles cases string-match cannot.
- ✓Shadow mode is non-optional — it catches failures your test set did not cover, before they affect users.
Video on this topic
How to tell if your fine-tuned model is actually better
tiktok