LLM Foundations/Fine-Tuning
Advanced10 min

Evaluating Fine-Tuned Models

How to rigorously evaluate fine-tuned LLMs: train/validation/test splitting for LLMs, detecting overfitting and benchmark contamination, A/B testing fine-tuned vs base models with real users, and a complete evaluation harness implementation.

Quick Reference

  • Split data: 80% train, 10% validation (used during training), 10% test (never seen during training)
  • Overfitting: model memorizes training data -- high train accuracy, low test accuracy
  • Contamination: training data overlaps with evaluation data, giving inflated scores
  • Evaluate on: task accuracy, format compliance, latency, cost, and regression on general tasks
  • A/B test with real users for final validation -- automated metrics do not capture everything
  • Keep the test set strictly separated -- never use it for training decisions

Train/Validation/Test Split for LLMs

Data splitting for LLM fine-tuning follows the same principles as traditional ML, but with some important nuances. The key is maintaining strict separation between data used for training, data used for hyperparameter decisions, and data used for final evaluation.

SplitPurposeSizeWhen usedWho sees it
TrainUpdate model weights80% of dataDuring trainingThe model during training
ValidationMonitor overfitting, early stopping, hyperparameter tuning10%During training (evaluation steps)Used for training decisions but not weight updates
TestFinal quality assessment, model comparison10%After training is completeNever during training -- only for final evaluation
Proper data splitting with stratification
The test set is sacred

Never make training decisions based on test set performance. The test set is for final reporting only. If you tune hyperparameters based on test results, you are effectively training on the test set through your decisions. Use the validation set for all training-time decisions (early stopping, learning rate, epoch count).

Detecting Overfitting

Overfitting in LLM fine-tuning manifests differently than in traditional ML. The model may perfectly reproduce training examples while failing to generalize to new inputs. The signs are subtle but detectable.

  • Loss gap: training loss continues to decrease while validation loss increases or plateaus
  • Memorization: the model reproduces training examples verbatim instead of learning the pattern
  • Brittleness: high accuracy on test cases similar to training data, low accuracy on novel inputs
  • Repetitive outputs: the model generates the same phrases or structures regardless of input
  • Loss of general capability: the model becomes worse at tasks unrelated to the fine-tuning objective
Detecting memorization vs generalization
The temperature test

A quick overfitting check: run the model at temperature 0 and temperature 0.7 on test examples. If outputs are nearly identical regardless of temperature, the model may have memorized rather than learned. A well-generalized model shows natural variation at higher temperatures while maintaining accuracy.

Benchmark Contamination in Fine-Tuning

If your training data accidentally overlaps with your test data, evaluation results will be inflated. This is especially insidious when using synthetic data, because the teacher model might generate examples that are very similar to your test cases.

Detecting train/test contamination
Synthetic data contamination

When you generate synthetic training data from a teacher model, the teacher might produce examples similar to your test set (especially if both are generated from similar prompts). Always run contamination checks between synthetic training data and your test set.

A/B Testing with Real Users

Automated metrics tell you whether the fine-tuned model is better on your test set. But real-world performance depends on factors automated metrics cannot capture: user satisfaction, task completion rate, and handling of inputs your test set did not cover.

  • Route 10-20% of production traffic to the fine-tuned model, 80-90% to the current model
  • Measure task-specific success metrics: completion rate, user corrections, escalation rate
  • Measure user satisfaction: thumbs up/down, CSAT scores, time-to-resolution
  • Run for at least 1 week to capture day-of-week and time-of-day patterns
  • Monitor for failure modes not covered by your test set -- production inputs are more diverse
  • Have a kill switch: immediately route all traffic back to the base model if quality degrades
Shadow mode before A/B testing

Before routing real user traffic, run the fine-tuned model in shadow mode: it receives the same inputs as the production model but its outputs are logged, not shown to users. Compare shadow outputs against production outputs offline. This catches catastrophic failures without affecting users.

Complete Evaluation Harness

Comprehensive evaluation comparing base vs fine-tuned model
Automate the evaluation pipeline

Wrap this evaluation in a CI/CD pipeline that runs automatically when you produce a new fine-tuned model. The pipeline should: (1) load the model, (2) run the test suite, (3) compare against the baseline, (4) produce a report, (5) block deployment if accuracy is below threshold or regressions are detected.

Best Practices

Best Practices

Do

  • Maintain strict separation between train, validation, and test data
  • Run contamination checks between all data splits before evaluating
  • Evaluate on multiple dimensions: task accuracy, format compliance, latency, cost, general capability
  • Compare fine-tuned model against prompt-engineered baseline on the same test set
  • Use shadow mode and A/B testing with real users before full production deployment

Don’t

  • Don't make training decisions (epochs, learning rate) based on test set performance
  • Don't skip the contamination check -- overlapping train/test data gives meaninglessly inflated scores
  • Don't evaluate only on accuracy -- format compliance, latency, and cost matter in production
  • Don't trust a single evaluation run -- LLM outputs have variance even at temperature 0
  • Don't deploy a fine-tuned model without checking for regressions on general tasks

Key Takeaways

  • Strict train/validation/test separation is non-negotiable -- the test set must never influence training decisions.
  • Check for contamination between all splits, especially when using synthetic training data.
  • Overfitting manifests as: high train accuracy + low test accuracy, verbatim memorization, and loss of general capability.
  • Evaluate on multiple dimensions: accuracy, format compliance, latency, cost, and regression on general tasks.
  • Shadow mode and A/B testing with real users are the final validation before full deployment.

Video on this topic

How to tell if your fine-tuned model is actually better

tiktok