Training Data Engineering
How to prepare high-quality training data for LLM fine-tuning. Covers data formats, quality-over-quantity principles, data cleaning and deduplication, synthetic data generation, and a complete data preparation pipeline.
Quick Reference
- →Quality > quantity: 1000 excellent examples beat 100K mediocre ones for most fine-tuning tasks
- →Data format: instruction/response pairs, multi-turn conversations, or preference pairs (DPO)
- →Deduplication removes 5-20% of typical datasets and significantly improves model quality
- →Synthetic data: use a stronger model to generate training data, then human-filter for quality
- →Data diversity matters: cover all input categories, lengths, edge cases in your training set
- →Always hold out 10-20% for validation -- never train on your test set
In this article
Quality Over Quantity
The most counterintuitive finding in LLM fine-tuning is that data quality matters far more than quantity. Research from LIMA (Less Is More for Alignment) showed that 1,000 carefully curated examples can produce a model competitive with ones trained on 50,000+ examples. This is because LLMs already have strong capabilities from pretraining -- fine-tuning just needs to steer those capabilities, not build them from scratch.
| Dataset size | Quality level | Typical outcome |
|---|---|---|
| 50-200 | Expert-curated | Reasonable for narrow tasks (classification, simple extraction) |
| 200-1000 | Expert-curated | Sweet spot for most fine-tuning tasks |
| 1000-5000 | High quality | Excellent for complex tasks, style transfer, multi-step workflows |
| 5000-50000 | Mixed quality | Only if filtered aggressively -- unfiltered large sets often hurt quality |
| 50000+ | Any quality | Rarely needed. Risk of overfitting to noise. Data cleaning is critical |
Spend 80% of your data preparation time on quality and 20% on quantity. One engineer spending 2 days curating 500 perfect examples will produce a better fine-tuned model than a team spending 2 weeks scraping and cleaning 50,000 examples.
Data Formats for Fine-Tuning
Different fine-tuning approaches require different data formats. The most common are instruction/response pairs (supervised fine-tuning) and preference pairs (RLHF/DPO).
For Hugging Face fine-tuning, data can be in any format that your training script can parse. Common options: JSONL with 'instruction'/'output' fields, or a Hugging Face Dataset with 'text' column in the chat template format. The alpaca format (instruction, input, output) is widely supported.
Data Cleaning Pipeline
Raw data almost always contains duplicates, low-quality examples, formatting inconsistencies, and outliers. A systematic cleaning pipeline is essential before fine-tuning.
Exact deduplication catches identical responses, but near-duplicates (slight rephrasing) are equally problematic. For thorough deduplication, use MinHash (datasketch library) with a Jaccard similarity threshold of 0.8. Near-duplicates typically represent another 5-10% of the dataset.
Synthetic Data Generation
When you have limited real examples, you can use a stronger model (GPT-5.4, Claude Sonnet 4.6) to generate synthetic training data for a smaller model. This is the 'distillation' approach and is surprisingly effective when done correctly.
The reliable workflow is: (1) Collect 50-100 real, human-labeled examples. (2) Use GPT-5.4 to generate 1000-5000 synthetic examples seeded from the real ones. (3) Filter synthetic examples for quality (human review a sample, automated checks for all). (4) Combine real + validated synthetic data for training. The real examples anchor quality while synthetic data provides volume and diversity.
Some model providers restrict using their outputs as training data for competing models. OpenAI's terms allow fine-tuning on their own platform but restrict training competing models. Check the terms of service before using synthetic data generated by model X to train model Y.
Ensuring Data Diversity
A training dataset that only covers common cases will produce a model that fails on edge cases. Deliberate diversity engineering ensures your fine-tuned model handles the full range of production inputs.
- ▸Input length diversity: include very short (1 sentence) and very long (multiple paragraphs) inputs
- ▸Category coverage: every category in your classification task should have at least 20 examples
- ▸Language diversity: if your app serves multilingual users, include examples in all supported languages
- ▸Difficulty diversity: mix easy, moderate, and hard examples (60/25/15 split)
- ▸Negative examples: include cases where the correct output is 'not found', 'not applicable', or empty
- ▸Adversarial examples: include deliberately tricky inputs that test the model's boundaries
Best Practices
Do
- ✓Invest 80% of your time on data quality and 20% on quantity
- ✓Deduplicate rigorously -- exact match and near-duplicate (MinHash with Jaccard > 0.8)
- ✓Include diverse examples: varying length, difficulty, category, and edge cases
- ✓Use synthetic data generation from a strong teacher model to augment limited real data
- ✓Analyze your training data distribution before training -- look for category imbalances and length outliers
Don’t
- ✗Don't prioritize quantity over quality -- 500 perfect examples beat 50K mediocre ones
- ✗Don't skip data cleaning -- duplicates and low-quality examples actively harm fine-tuning
- ✗Don't generate synthetic data without seed examples -- the teacher needs examples to calibrate
- ✗Don't forget negative examples (correct answer is 'none' or 'not found')
- ✗Don't violate model provider terms when using synthetic data across providers
Key Takeaways
- ✓Quality vastly outweighs quantity: 1,000 curated examples often outperform 50,000 unfiltered ones.
- ✓Deduplication (exact + near-duplicate) removes 10-20% of typical datasets and measurably improves model quality.
- ✓Synthetic data from a teacher model (GPT-5.4 generating data for o4-mini) is a proven, cost-effective strategy.
- ✓Data diversity engineering -- covering all categories, lengths, difficulties, and edge cases -- prevents production failures.
- ✓Always analyze your training data distribution before fine-tuning: category balance, length distribution, and quality metrics.
Video on this topic
How to prepare data for LLM fine-tuning
tiktok