Training Data Engineering
A decision-first guide to building training datasets that actually improve fine-tuned models. Covers when to invest, correct data formats for SFT/DPO/RFT, annotation quality, cleaning pipelines, synthetic generation with cost math, distribution engineering, and the failure modes that kill fine-tuning projects before training even starts.
Quick Reference
- →Try prompt engineering and few-shot first — fine-tuning is rarely the bottleneck
- →OpenAI supports three fine-tuning methods: SFT, DPO, and RFT — each needs a different data format
- →LIMA showed 1,000 curated examples can match RLHF-trained models — quality wins over volume
- →Deduplication is table stakes: exact hash + MinHash (Jaccard ≥ 0.8) at minimum
- →Synthetic data from GPT-5.4 or Claude Opus 4.7 costs real money — compute the budget before starting
- →Inter-annotator agreement below κ = 0.7 means your labels are noise, not signal
- →The data↔eval loop, not data collection, is where fine-tuning projects succeed or fail
- →Always hold out 10-20% for validation — never train on your test set
In this article
- 1.When to Invest in Training Data (and When Not To)
- 2.Data Formats: SFT, DPO, and RFT
- 3.How Much Data Do You Actually Need?
- 4.Annotation Quality and Label Consistency
- 5.Data Cleaning Pipeline
- 6.Synthetic Data Generation
- 7.Distribution Engineering
- 8.The Data↔Eval Feedback Loop
- 9.Failure Modes and Defenses
- ★Best Practices
- ✓Key Takeaways
When to Invest in Training Data (and When Not To)
Before spending weeks on data engineering, run this decision gate: have you exhausted prompting? System prompt improvements, few-shot examples, chain-of-thought, and structured output constraints each solve a distinct class of problems faster and cheaper than fine-tuning. Fine-tuning earns its cost when you need style/format that can't be prompted, consistent behavior across millions of calls without per-call token overhead, or capabilities that require dozens of in-context examples to work reliably.
| Situation | Right tool | Why not fine-tuning |
|---|---|---|
| Model ignores output format | Structured output / system prompt | Format is a prompting problem |
| Model occasionally hallucinates facts | RAG + grounding | Data won't fix hallucination — retrieval will |
| Model needs consistent brand voice | Fine-tuning SFT | Style can't be prompted at scale |
| Model needs to classify 30 categories | Fine-tuning SFT | Few-shot runs out at ~10 categories reliably |
| Model makes errors on edge cases | Eval-driven prompt improvement first | Edge cases need eval, not more data |
| Model is too slow / expensive per call | Model routing or smaller model | Fine-tuning doesn't reduce latency |
Fine-tuning steers capabilities that already exist from pretraining. If the base model can't reliably do the task at all (even with 10 good few-shot examples), more training data won't fix it — you need a different base model or a different approach.
A support team spent three weeks curating 2,000 ticket classification examples and fine-tuned gpt-4.1-mini. F1 improved from 0.71 to 0.74. They then spent two days rewriting the system prompt with 8 few-shot examples and reached 0.79. The fine-tuning wasn't wrong — it was premature. Prompt engineering should have come first.
Learn this in → Always benchmark a well-prompted base model before starting data collection.
Data Formats: SFT, DPO, and RFT
OpenAI currently supports three fine-tuning methods, each requiring a different data structure. SFT (supervised fine-tuning) trains on fixed input/output pairs. DPO (direct preference optimization) trains on preference pairs — a preferred and a non-preferred response to the same prompt. RFT (reinforcement fine-tuning) trains using a grader function that scores each response — no fixed labels needed. The methods are complementary: SFT first, then DPO to refine tone/preference, then RFT for tasks with computable correctness.
SFT trains on fixed labels · DPO trains on preference pairs · RFT trains on a grader score
OpenAI SFT/DPO: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano. OpenAI RFT: o4-mini (o3 in private preview). GPT-5.4 does NOT support fine-tuning — use it as a teacher model for synthetic data generation, not as a student model. Anthropic fine-tuning is in limited access; check platform.claude.com for current status.
How Much Data Do You Actually Need?
The LIMA paper (Meta AI, 2023) showed that a 65B model fine-tuned on 1,000 carefully curated examples was preferred over GPT-4 in 43% of head-to-head comparisons, and over InstructGPT (DaVinci003) in 65% of comparisons. The key insight: LLMs already have strong capabilities from pretraining. Fine-tuning steers those capabilities toward your task format — it doesn't build new knowledge from scratch. This means a small number of high-quality demonstrations can shift behavior dramatically, while large amounts of mediocre data can degrade it.
| Starting point | Target quality | Rough range |
|---|---|---|
| Narrow extraction / classification | Expert-annotated | 100–500 examples |
| Complex task, specific style | Expert-annotated | 500–2,000 examples |
| Multi-step workflow with edge cases | High quality + synthetic augmentation | 2,000–10,000 examples |
| Domain-specific chat / multi-turn | High quality, diverse | 1,000–5,000 examples |
| Large-scale behavior shift | Rigorous curation + filtering pipeline | 10,000+ (only if quality holds) |
More data helps when it adds coverage of genuinely different patterns. More data hurts when it adds near-duplicates, inconsistent labels, or edge cases that contradict each other. Before adding data, measure whether new examples are meaningfully different from what you have. If 80% of new examples are paraphrases of existing ones, stop collecting.
Annotation Quality and Label Consistency
The section most fine-tuning guides skip. You can have 10,000 examples with systematically wrong labels and never know it until your fine-tuned model performs worse than the base model. Annotation quality has two dimensions: accuracy (are the labels correct?) and consistency (do different annotators agree?). Both matter. A label that's 80% accurate and 95% consistent is better for training than one that's 95% accurate but 70% consistent — inconsistency is random noise; accuracy errors at least have a pattern.
The most common annotation failure: collecting 1,000 examples across a week, then discovering annotators had different interpretations of 'urgency: high'. Write explicit decision rules with worked examples before starting. For each category, provide: a definition, 2 clear positive examples, 2 clear negative examples, and rules for ambiguous edge cases. Annotator disagreements that cluster on specific categories reveal missing guidance.
Data Cleaning Pipeline
Raw data almost always contains exact duplicates, near-duplicates, length outliers, and structurally malformed examples. Cleaning is not optional — exact duplicates teach the model to memorize specific responses rather than generalize, and near-duplicates (paraphrased versions of the same example) create implicit oversampling that biases the model toward common patterns.
pip install datasketch. The MinHashLSH approach finds examples where >80% of words overlap — these are the near-duplicates that exact hashing misses. A Jaccard threshold of 0.8 is a reasonable default; lower it to 0.7 for shorter responses where small differences matter less.
Synthetic Data Generation
When you have fewer than 50 real examples, you can bootstrap a dataset using a frontier model (GPT-5.4, Claude Opus 4.7, or Gemini 3.1 Pro) as a teacher to generate examples for a smaller model to learn from. This is knowledge distillation at the data level. The workflow is: collect a small set of gold examples → use the teacher to generate variations → filter synthetic examples for quality → validate a sample by hand → combine real + validated synthetic for training.
Never train on raw synthetic output. Sample 10% of synthetic examples for manual review. Check: (1) does the format match exactly? (2) are the outputs actually correct? (3) are the examples diverse or mostly paraphrases? A 90% pass rate on manual review is the bar for including synthetic data. Below that, fix the generation prompt first.
OpenAI's Services Agreement restricts using their outputs to develop AI models that compete with OpenAI's products. The restriction applies to the user who generates the data — not to the data itself as an object. If you use GPT-5.4 to generate training data for gpt-4.1-mini (an OpenAI model), this is explicitly supported. Using it to train a competing external model is restricted. Check current terms at openai.com/policies — they update regularly.
A team used Claude Opus 4.7 to generate 3,000 contract clause examples for a legal extraction task (50 real examples as seeds). Without validation, synthetic accuracy was 71%. After filtering to only examples where Claude's output matched a rule-based validator, accuracy on the filtered 1,800 examples jumped to 89%. The 1,200 filtered-out examples would have actively degraded the fine-tuned model.
Learn this in → Validate synthetic examples against whatever verifiable ground truth you have — even a simple rule-based check — before including them.
Distribution Engineering
A dataset that only covers common cases produces a model that fails on uncommon production inputs. 'Distribution engineering' means deliberately ensuring your training data covers the same distribution you'll see in production — not the distribution that's easiest to collect examples for. The failure mode is subtle: your fine-tuned model works well on 80% of inputs (the ones that match your training distribution) and fails unexpectedly on the other 20% (which look similar but aren't).
- ▸Length diversity: include very short (1 sentence) and long (3+ paragraphs) inputs — training on uniform lengths produces a model that degrades on outliers
- ▸Category coverage: every target class should have ≥ 30 examples before training; fewer than 20 means the model is guessing for that class
- ▸Difficulty mix: 60% clear-cut cases, 25% moderate difficulty, 15% genuinely hard edge cases — skipping hard cases produces a model that confidently fails on them
- ▸Negative examples: include cases where the correct output is 'not found', 'not applicable', or an empty result — models not trained on negatives over-predict positive labels
- ▸Production input representation: sample 100 real production inputs and compare their distribution to your training data — gaps here predict deployment failures
- ▸Temporal diversity: if your data spans months, include examples from different time periods — models trained only on recent data fail on older patterns still present in production
The Data↔Eval Feedback Loop
Fine-tuning is an iterative process, not a one-shot pipeline. The most common mistake is treating data collection and model training as sequential phases. In practice, you collect → train → evaluate → find failure modes → collect targeted examples for those failure modes → retrain. Your eval results tell you exactly what data to collect next. A training run without an eval harness is a data collection exercise with no feedback signal.
Training data is not a one-pass pipeline — eval failures feed back to annotation
Hold out 15-20% of your data for validation before any training run — never train on your test set. When you collect new examples based on failure analysis, those examples go into training only, not the eval set. Letting eval-driven examples enter your test set inflates eval scores and hides real performance. Treat the test set as locked from the moment you start fine-tuning.
Failure Modes and Defenses
Fine-tuning projects fail in predictable ways. Most failures are data problems, not model problems. Understanding the specific failure modes lets you detect them before a bad training run costs you time and money.
| Failure mode | Symptom | Defense |
|---|---|---|
| Train/test leakage | Eval scores look great; production is much worse | Lock test set before any data collection; never include eval examples in training |
| Distribution shift | Model works on training inputs, fails on production inputs | Sample real production inputs; compare distributions before training |
| Label noise from inconsistent annotation | Model performance below base model on simple cases | Measure κ before training; retrain annotators or consolidate ambiguous labels |
| Synthetic data collapse | Model outputs sound fluent but are systematically wrong | Validate synthetic data with a rule-based checker; limit synthetic to < 60% of training set |
| Overfitting to training format | Model ignores minor input variations; brittle on paraphrases | Add paraphrased versions of seed examples; include format variation deliberately |
| Catastrophic forgetting | Fine-tuned model is better on target task but worse on general tasks | Use LoRA instead of full fine-tuning; evaluate on general benchmarks alongside task metrics |
When more than 60-70% of your training set is synthetic, the model starts learning the generation patterns of your teacher model rather than the underlying task. The output sounds fluent and confident but reflects the teacher's biases. Symptoms: high training accuracy, poor performance on real-world inputs that don't match the teacher's style. Defense: always keep real examples as the anchor (minimum 30-40% of training set), and validate synthetic examples against verifiable ground truth.
A team training a fine-tuned model for code review feedback got excellent eval scores (F1 = 0.91) but poor production performance. Root cause: their eval set was sampled from the same GitHub repos as their training data. When tested on repos from a different organization's codebase, F1 dropped to 0.67. The training and eval data had the same distribution; production did not. Fix: rebuild eval from production traffic, not from the same source as training data.
Learn this in → Your eval set must come from production traffic or be deliberately constructed to match it. Eval built from the same source as training data only measures memorization.
Best Practices
Do
- ✓Exhaust prompt engineering and few-shot examples before investing in data collection — fine-tuning is rarely the bottleneck
- ✓Measure inter-annotator agreement (Cohen's κ) before training; κ < 0.7 means your labels are noise
- ✓Write annotation guidelines with positive and negative worked examples before annotating a single example
- ✓Run exact-match deduplication (MD5 hash) plus near-duplicate detection (MinHash, Jaccard ≥ 0.8) as a pipeline step
- ✓Lock your test set before any data collection begins; never let eval-driven examples enter the test split
- ✓Validate 10% of synthetic examples by hand before including synthetic data in training
- ✓Track cost per synthetic example — 500 GPT-5.4 generated examples costs roughly $4–8; know your budget
- ✓Analyze category distribution, length distribution, and difficulty mix before training — gaps predict deployment failures
- ✓Use eval failures to drive the next round of data collection — the data↔eval loop is the job
- ✓Keep real human-labeled examples as ≥ 30% of your training set even when using synthetic augmentation
Don’t
- ✗Don't start data collection before writing annotation guidelines — you will get inconsistent labels
- ✗Don't use the wrong DPO format: OpenAI uses input/preferred_output, not prompt/chosen/rejected
- ✗Don't train on raw synthetic data without validation — fluent output from a teacher model is not the same as correct output
- ✗Don't let synthetic data exceed 60-70% of your training set — beyond this, models learn the teacher's biases, not the task
- ✗Don't skip near-duplicate removal — near-duplicates create implicit oversampling that biases models toward common patterns
- ✗Don't build your eval set from the same source as your training data — it only measures memorization
- ✗Don't use 'the more data the better' as a strategy — unfiltered large datasets actively degrade fine-tuned models
- ✗Don't fine-tune when RAG or a better system prompt would solve the problem faster
- ✗Don't skip the base model eval step — always benchmark a well-prompted base model before any fine-tuning
- ✗Don't assume fine-tuning worked without testing on real production inputs from a held-out time window
Key Takeaways
- ✓Prompt engineering and few-shot examples should precede fine-tuning — they solve most problems faster and cheaper.
- ✓OpenAI has three fine-tuning methods (SFT, DPO, RFT) with different data formats; using the wrong format is a silent failure.
- ✓LIMA established that 1,000 curated examples can match RLHF-trained models — quality of labels, not volume, determines fine-tuning success.
- ✓Inter-annotator agreement below κ = 0.7 means your training data is noise — measure it before any training run.
- ✓Synthetic data collapse happens when synthetic examples exceed 60-70% of training; always anchor with real human-labeled examples.
- ✓The data↔eval loop — using eval failures to drive targeted data collection — is what separates fine-tuning projects that ship from ones that stall.
Video on this topic
How to prepare training data for LLM fine-tuning
tiktok