Data Quality for AI Systems
Data quality investment pays off at scale — but only if you know when to invest, what failure class you are fighting, and how to build gates that block bad documents rather than just logging them. This article gives you a decision framework, a failure taxonomy grounded in observable symptoms, ingestion contracts, layered deduplication (syntactic and semantic), LLM-as-judge scoring with real cost math, and CI/CD quality gates.
Quick Reference
- →Invest in quality pipelines only when you have retrieval failures in production traces — below 500 docs or with a single reliable source, ingestion contracts alone are sufficient
- →Five observable failure classes: stale content, near-duplicates, contradictions, incomplete docs, and format corruption — each has distinct symptoms and detection methods
- →Ingestion contracts: every document needs a stable_id, source_uri, content_hash, last_modified, and content_type before it enters the pipeline
- →Layer dedup: MinHash LSH (Jaccard threshold=0.85) for syntactic similarity, then embedding cosine similarity on candidates for semantic duplicates
- →Structural pre-filter is free and catches 60-80% of bad documents — run it before LLM-as-judge to control cost
- →LLM-as-judge evaluates factual consistency, completeness, and currency that heuristics cannot detect — with claude-haiku-4-5-20251001 at $1/MTok in and $5/MTok out, scoring a 500-token document costs ~$0.001
- →Freshness SLAs are domain-specific — calibrate by measuring how often your source data actually changes, not by guessing
- →Quality gates in CI should block ingestion, not just warn — a clear 'I don't know' from a coverage gap is better than a confidently wrong answer from bad data
When Data Quality Work Pays Off (and When It Doesn't)
fix retrieval before investing in ingestion-side quality work
Data quality pipelines are not free — they add latency to ingestion, consume compute, and require maintenance as your source schemas evolve. Before you build them, you need a production signal that justifies the investment. The clearest signal: retrieval failures showing up in traces. If your agent retrieves stale content, returns redundant chunks, or consistently says 'I don't have enough information', the root cause is almost always in the data — but before you build quality infrastructure, confirm it is not a retrieval or prompt problem first.
Teams often build MinHash deduplication pipelines and freshness trackers before their knowledge base has 200 documents. The ROI is negative. At small scale, manually curating documents is faster and more effective than automated pipelines. Start with ingestion contracts (catch missing metadata), then add quality gates only when you see retrieval failures at scale.
| Trigger Signal | Recommended Action | When to Add the Next Layer |
|---|---|---|
| Agents say 'I don't have enough information' on known topics | Coverage gap analysis from query logs | When gap reports stop driving content creation |
| Same question gets different answers on different sessions | Contradiction detection via LLM-as-judge | When contradictions persist after manual dedup pass |
| Retriever returns obviously outdated facts with high score | Freshness SLAs + auto-deprecation | When stale docs represent > 10% of retrieved context |
| Corpus > 1K docs from multiple authors or systems | MinHash deduplication before chunking | When semantic duplicates appear after syntactic dedup |
| Ingestion from PDFs, wikis, or scraped sources | Structural pre-filter at ingestion gate | When pre-filter false-positive rate drops below 5% |
A platform team spent three weeks building a MinHash deduplication pipeline for a 150-document internal knowledge base. In the same sprint, they could have manually reviewed and cleaned all 150 documents twice. The pipeline was correct — it just was not the right investment at that scale. They eventually used it when the corpus grew to 8,000 documents from four different Confluence spaces. The lesson: build quality infrastructure in response to production failure signals, not in anticipation of them.