LLM Foundations/Fine-Tuning
Advanced20 min

Training Data Engineering

A decision-first guide to building training datasets that actually improve fine-tuned models. Covers when to invest, correct data formats for SFT/DPO/RFT, annotation quality, cleaning pipelines, synthetic generation with cost math, distribution engineering, and the failure modes that kill fine-tuning projects before training even starts.

Quick Reference

  • Try prompt engineering and few-shot first — fine-tuning is rarely the bottleneck
  • OpenAI supports three fine-tuning methods: SFT, DPO, and RFT — each needs a different data format
  • LIMA showed 1,000 curated examples can match RLHF-trained models — quality wins over volume
  • Deduplication is table stakes: exact hash + MinHash (Jaccard ≥ 0.8) at minimum
  • Synthetic data from GPT-5.4 or Claude Opus 4.7 costs real money — compute the budget before starting
  • Inter-annotator agreement below κ = 0.7 means your labels are noise, not signal
  • The data↔eval loop, not data collection, is where fine-tuning projects succeed or fail
  • Always hold out 10-20% for validation — never train on your test set

When to Invest in Training Data (and When Not To)

Before spending weeks on data engineering, run this decision gate: have you exhausted prompting? System prompt improvements, few-shot examples, chain-of-thought, and structured output constraints each solve a distinct class of problems faster and cheaper than fine-tuning. Fine-tuning earns its cost when you need style/format that can't be prompted, consistent behavior across millions of calls without per-call token overhead, or capabilities that require dozens of in-context examples to work reliably.

SituationRight toolWhy not fine-tuning
Model ignores output formatStructured output / system promptFormat is a prompting problem
Model occasionally hallucinates factsRAG + groundingData won't fix hallucination — retrieval will
Model needs consistent brand voiceFine-tuning SFTStyle can't be prompted at scale
Model needs to classify 30 categoriesFine-tuning SFTFew-shot runs out at ~10 categories reliably
Model makes errors on edge casesEval-driven prompt improvement firstEdge cases need eval, not more data
Model is too slow / expensive per callModel routing or smaller modelFine-tuning doesn't reduce latency
Fine-tuning amplifies existing problems

Fine-tuning steers capabilities that already exist from pretraining. If the base model can't reliably do the task at all (even with 10 good few-shot examples), more training data won't fix it — you need a different base model or a different approach.

Real project

A support team spent three weeks curating 2,000 ticket classification examples and fine-tuned gpt-4.1-mini. F1 improved from 0.71 to 0.74. They then spent two days rewriting the system prompt with 8 few-shot examples and reached 0.79. The fine-tuning wasn't wrong — it was premature. Prompt engineering should have come first.

Learn this in → Always benchmark a well-prompted base model before starting data collection.

Data Formats: SFT, DPO, and RFT

OpenAI currently supports three fine-tuning methods, each requiring a different data structure. SFT (supervised fine-tuning) trains on fixed input/output pairs. DPO (direct preference optimization) trains on preference pairs — a preferred and a non-preferred response to the same prompt. RFT (reinforcement fine-tuning) trains using a grader function that scores each response — no fixed labels needed. The methods are complementary: SFT first, then DPO to refine tone/preference, then RFT for tasks with computable correctness.

SFTSupervised Fine-Tuningmessages [ ] system: … user: … assistant: … ★Fixed label — train on assistant turnDPODirect Preference Opt.input { } messages [ ]preferred_output ★non_preferred_outputPair of outputs — model learns preferenceRFTReinforcement Fine-Tuningmessages [ ] user: …grader_fn( ) ★ → score: 0–1No fixed label — grader scores the output★ = the signal the model learns from

SFT trains on fixed labels · DPO trains on preference pairs · RFT trains on a grader score

SFT format — OpenAI JSONL (one JSON per line)
DPO format — OpenAI API format (not the HuggingFace format)
RFT format — grader function scores the response (no fixed label)
Which models support fine-tuning (April 2026)

OpenAI SFT/DPO: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano. OpenAI RFT: o4-mini (o3 in private preview). GPT-5.4 does NOT support fine-tuning — use it as a teacher model for synthetic data generation, not as a student model. Anthropic fine-tuning is in limited access; check platform.claude.com for current status.

How Much Data Do You Actually Need?

The LIMA paper (Meta AI, 2023) showed that a 65B model fine-tuned on 1,000 carefully curated examples was preferred over GPT-4 in 43% of head-to-head comparisons, and over InstructGPT (DaVinci003) in 65% of comparisons. The key insight: LLMs already have strong capabilities from pretraining. Fine-tuning steers those capabilities toward your task format — it doesn't build new knowledge from scratch. This means a small number of high-quality demonstrations can shift behavior dramatically, while large amounts of mediocre data can degrade it.

Starting pointTarget qualityRough range
Narrow extraction / classificationExpert-annotated100–500 examples
Complex task, specific styleExpert-annotated500–2,000 examples
Multi-step workflow with edge casesHigh quality + synthetic augmentation2,000–10,000 examples
Domain-specific chat / multi-turnHigh quality, diverse1,000–5,000 examples
Large-scale behavior shiftRigorous curation + filtering pipeline10,000+ (only if quality holds)
When more data helps vs. hurts

More data helps when it adds coverage of genuinely different patterns. More data hurts when it adds near-duplicates, inconsistent labels, or edge cases that contradict each other. Before adding data, measure whether new examples are meaningfully different from what you have. If 80% of new examples are paraphrases of existing ones, stop collecting.

Annotation Quality and Label Consistency

The section most fine-tuning guides skip. You can have 10,000 examples with systematically wrong labels and never know it until your fine-tuned model performs worse than the base model. Annotation quality has two dimensions: accuracy (are the labels correct?) and consistency (do different annotators agree?). Both matter. A label that's 80% accurate and 95% consistent is better for training than one that's 95% accurate but 70% consistent — inconsistency is random noise; accuracy errors at least have a pattern.

Measuring inter-annotator agreement with Cohen's kappa
Build annotation guidelines before you annotate

The most common annotation failure: collecting 1,000 examples across a week, then discovering annotators had different interpretations of 'urgency: high'. Write explicit decision rules with worked examples before starting. For each category, provide: a definition, 2 clear positive examples, 2 clear negative examples, and rules for ambiguous edge cases. Annotator disagreements that cluster on specific categories reveal missing guidance.

Finding systematic disagreements to improve annotation guidelines

Data Cleaning Pipeline

Raw data almost always contains exact duplicates, near-duplicates, length outliers, and structurally malformed examples. Cleaning is not optional — exact duplicates teach the model to memorize specific responses rather than generalize, and near-duplicates (paraphrased versions of the same example) create implicit oversampling that biases the model toward common patterns.

Complete cleaning pipeline with near-duplicate detection
Install datasketch for near-duplicate detection

pip install datasketch. The MinHashLSH approach finds examples where >80% of words overlap — these are the near-duplicates that exact hashing misses. A Jaccard threshold of 0.8 is a reasonable default; lower it to 0.7 for shorter responses where small differences matter less.

Synthetic Data Generation

When you have fewer than 50 real examples, you can bootstrap a dataset using a frontier model (GPT-5.4, Claude Opus 4.7, or Gemini 3.1 Pro) as a teacher to generate examples for a smaller model to learn from. This is knowledge distillation at the data level. The workflow is: collect a small set of gold examples → use the teacher to generate variations → filter synthetic examples for quality → validate a sample by hand → combine real + validated synthetic for training.

Synthetic data generation with cost tracking
Validate synthetic data before training

Never train on raw synthetic output. Sample 10% of synthetic examples for manual review. Check: (1) does the format match exactly? (2) are the outputs actually correct? (3) are the examples diverse or mostly paraphrases? A 90% pass rate on manual review is the bar for including synthetic data. Below that, fix the generation prompt first.

Synthetic data licensing (OpenAI, Jan 2026)

OpenAI's Services Agreement restricts using their outputs to develop AI models that compete with OpenAI's products. The restriction applies to the user who generates the data — not to the data itself as an object. If you use GPT-5.4 to generate training data for gpt-4.1-mini (an OpenAI model), this is explicitly supported. Using it to train a competing external model is restricted. Check current terms at openai.com/policies — they update regularly.

Real project

A team used Claude Opus 4.7 to generate 3,000 contract clause examples for a legal extraction task (50 real examples as seeds). Without validation, synthetic accuracy was 71%. After filtering to only examples where Claude's output matched a rule-based validator, accuracy on the filtered 1,800 examples jumped to 89%. The 1,200 filtered-out examples would have actively degraded the fine-tuned model.

Learn this in → Validate synthetic examples against whatever verifiable ground truth you have — even a simple rule-based check — before including them.

Distribution Engineering

A dataset that only covers common cases produces a model that fails on uncommon production inputs. 'Distribution engineering' means deliberately ensuring your training data covers the same distribution you'll see in production — not the distribution that's easiest to collect examples for. The failure mode is subtle: your fine-tuned model works well on 80% of inputs (the ones that match your training distribution) and fails unexpectedly on the other 20% (which look similar but aren't).

Analyzing training data distribution and detecting gaps
  • Length diversity: include very short (1 sentence) and long (3+ paragraphs) inputs — training on uniform lengths produces a model that degrades on outliers
  • Category coverage: every target class should have ≥ 30 examples before training; fewer than 20 means the model is guessing for that class
  • Difficulty mix: 60% clear-cut cases, 25% moderate difficulty, 15% genuinely hard edge cases — skipping hard cases produces a model that confidently fails on them
  • Negative examples: include cases where the correct output is 'not found', 'not applicable', or an empty result — models not trained on negatives over-predict positive labels
  • Production input representation: sample 100 real production inputs and compare their distribution to your training data — gaps here predict deployment failures
  • Temporal diversity: if your data spans months, include examples from different time periods — models trained only on recent data fail on older patterns still present in production

The Data↔Eval Feedback Loop

Fine-tuning is an iterative process, not a one-shot pipeline. The most common mistake is treating data collection and model training as sequential phases. In practice, you collect → train → evaluate → find failure modes → collect targeted examples for those failure modes → retrain. Your eval results tell you exactly what data to collect next. A training run without an eval harness is a data collection exercise with no feedback signal.

Raw Datalogs / exportsAnnotatehuman labelsCleandedup + filterAugmentsynthetic dataValidatedistribution checkSplittrain / val / testTrainSFT / DPO / RFTEvalmetrics + humanfailing examples → back to annotation queue

Training data is not a one-pass pipeline — eval failures feed back to annotation

Extracting failure cases from eval to drive next data collection round
Split strategy that prevents contamination

Hold out 15-20% of your data for validation before any training run — never train on your test set. When you collect new examples based on failure analysis, those examples go into training only, not the eval set. Letting eval-driven examples enter your test set inflates eval scores and hides real performance. Treat the test set as locked from the moment you start fine-tuning.

Failure Modes and Defenses

Fine-tuning projects fail in predictable ways. Most failures are data problems, not model problems. Understanding the specific failure modes lets you detect them before a bad training run costs you time and money.

Failure modeSymptomDefense
Train/test leakageEval scores look great; production is much worseLock test set before any data collection; never include eval examples in training
Distribution shiftModel works on training inputs, fails on production inputsSample real production inputs; compare distributions before training
Label noise from inconsistent annotationModel performance below base model on simple casesMeasure κ before training; retrain annotators or consolidate ambiguous labels
Synthetic data collapseModel outputs sound fluent but are systematically wrongValidate synthetic data with a rule-based checker; limit synthetic to < 60% of training set
Overfitting to training formatModel ignores minor input variations; brittle on paraphrasesAdd paraphrased versions of seed examples; include format variation deliberately
Catastrophic forgettingFine-tuned model is better on target task but worse on general tasksUse LoRA instead of full fine-tuning; evaluate on general benchmarks alongside task metrics
Synthetic data collapse is subtle and dangerous

When more than 60-70% of your training set is synthetic, the model starts learning the generation patterns of your teacher model rather than the underlying task. The output sounds fluent and confident but reflects the teacher's biases. Symptoms: high training accuracy, poor performance on real-world inputs that don't match the teacher's style. Defense: always keep real examples as the anchor (minimum 30-40% of training set), and validate synthetic examples against verifiable ground truth.

Real project

A team training a fine-tuned model for code review feedback got excellent eval scores (F1 = 0.91) but poor production performance. Root cause: their eval set was sampled from the same GitHub repos as their training data. When tested on repos from a different organization's codebase, F1 dropped to 0.67. The training and eval data had the same distribution; production did not. Fix: rebuild eval from production traffic, not from the same source as training data.

Learn this in → Your eval set must come from production traffic or be deliberately constructed to match it. Eval built from the same source as training data only measures memorization.

Best Practices

Best Practices

Do

  • Exhaust prompt engineering and few-shot examples before investing in data collection — fine-tuning is rarely the bottleneck
  • Measure inter-annotator agreement (Cohen's κ) before training; κ < 0.7 means your labels are noise
  • Write annotation guidelines with positive and negative worked examples before annotating a single example
  • Run exact-match deduplication (MD5 hash) plus near-duplicate detection (MinHash, Jaccard ≥ 0.8) as a pipeline step
  • Lock your test set before any data collection begins; never let eval-driven examples enter the test split
  • Validate 10% of synthetic examples by hand before including synthetic data in training
  • Track cost per synthetic example — 500 GPT-5.4 generated examples costs roughly $4–8; know your budget
  • Analyze category distribution, length distribution, and difficulty mix before training — gaps predict deployment failures
  • Use eval failures to drive the next round of data collection — the data↔eval loop is the job
  • Keep real human-labeled examples as ≥ 30% of your training set even when using synthetic augmentation

Don’t

  • Don't start data collection before writing annotation guidelines — you will get inconsistent labels
  • Don't use the wrong DPO format: OpenAI uses input/preferred_output, not prompt/chosen/rejected
  • Don't train on raw synthetic data without validation — fluent output from a teacher model is not the same as correct output
  • Don't let synthetic data exceed 60-70% of your training set — beyond this, models learn the teacher's biases, not the task
  • Don't skip near-duplicate removal — near-duplicates create implicit oversampling that biases models toward common patterns
  • Don't build your eval set from the same source as your training data — it only measures memorization
  • Don't use 'the more data the better' as a strategy — unfiltered large datasets actively degrade fine-tuned models
  • Don't fine-tune when RAG or a better system prompt would solve the problem faster
  • Don't skip the base model eval step — always benchmark a well-prompted base model before any fine-tuning
  • Don't assume fine-tuning worked without testing on real production inputs from a held-out time window

Key Takeaways

  • Prompt engineering and few-shot examples should precede fine-tuning — they solve most problems faster and cheaper.
  • OpenAI has three fine-tuning methods (SFT, DPO, RFT) with different data formats; using the wrong format is a silent failure.
  • LIMA established that 1,000 curated examples can match RLHF-trained models — quality of labels, not volume, determines fine-tuning success.
  • Inter-annotator agreement below κ = 0.7 means your training data is noise — measure it before any training run.
  • Synthetic data collapse happens when synthetic examples exceed 60-70% of training; always anchor with real human-labeled examples.
  • The data↔eval loop — using eval failures to drive targeted data collection — is what separates fine-tuning projects that ship from ones that stall.

Video on this topic

How to prepare training data for LLM fine-tuning

tiktok