Production & Scale/Data Engineering for AI
Advanced15 min

Data Quality for AI Systems

Data quality investment pays off at scale — but only if you know when to invest, what failure class you are fighting, and how to build gates that block bad documents rather than just logging them. This article gives you a decision framework, a failure taxonomy grounded in observable symptoms, ingestion contracts, layered deduplication (syntactic and semantic), LLM-as-judge scoring with real cost math, and CI/CD quality gates.

Quick Reference

  • Invest in quality pipelines only when you have retrieval failures in production traces — below 500 docs or with a single reliable source, ingestion contracts alone are sufficient
  • Five observable failure classes: stale content, near-duplicates, contradictions, incomplete docs, and format corruption — each has distinct symptoms and detection methods
  • Ingestion contracts: every document needs a stable_id, source_uri, content_hash, last_modified, and content_type before it enters the pipeline
  • Layer dedup: MinHash LSH (Jaccard threshold=0.85) for syntactic similarity, then embedding cosine similarity on candidates for semantic duplicates
  • Structural pre-filter is free and catches 60-80% of bad documents — run it before LLM-as-judge to control cost
  • LLM-as-judge evaluates factual consistency, completeness, and currency that heuristics cannot detect — with claude-haiku-4-5-20251001 at $1/MTok in and $5/MTok out, scoring a 500-token document costs ~$0.001
  • Freshness SLAs are domain-specific — calibrate by measuring how often your source data actually changes, not by guessing
  • Quality gates in CI should block ingestion, not just warn — a clear 'I don't know' from a coverage gap is better than a confidently wrong answer from bad data

When Data Quality Work Pays Off (and When It Doesn't)

NoYesNoYesDo you have retrieval failuresin production traces?Defer quality pipelinesfix retrieval & prompts first> 500 docs frommultiple sources?Contracts onlyfreshness SLAs + gap detectionFull pipelinecontracts → dedup → score → gates

fix retrieval before investing in ingestion-side quality work

Data quality pipelines are not free — they add latency to ingestion, consume compute, and require maintenance as your source schemas evolve. Before you build them, you need a production signal that justifies the investment. The clearest signal: retrieval failures showing up in traces. If your agent retrieves stale content, returns redundant chunks, or consistently says 'I don't have enough information', the root cause is almost always in the data — but before you build quality infrastructure, confirm it is not a retrieval or prompt problem first.

Don't quality-engineer a prototype

Teams often build MinHash deduplication pipelines and freshness trackers before their knowledge base has 200 documents. The ROI is negative. At small scale, manually curating documents is faster and more effective than automated pipelines. Start with ingestion contracts (catch missing metadata), then add quality gates only when you see retrieval failures at scale.

Trigger SignalRecommended ActionWhen to Add the Next Layer
Agents say 'I don't have enough information' on known topicsCoverage gap analysis from query logsWhen gap reports stop driving content creation
Same question gets different answers on different sessionsContradiction detection via LLM-as-judgeWhen contradictions persist after manual dedup pass
Retriever returns obviously outdated facts with high scoreFreshness SLAs + auto-deprecationWhen stale docs represent > 10% of retrieved context
Corpus > 1K docs from multiple authors or systemsMinHash deduplication before chunkingWhen semantic duplicates appear after syntactic dedup
Ingestion from PDFs, wikis, or scraped sourcesStructural pre-filter at ingestion gateWhen pre-filter false-positive rate drops below 5%
Real project

A platform team spent three weeks building a MinHash deduplication pipeline for a 150-document internal knowledge base. In the same sprint, they could have manually reviewed and cleaned all 150 documents twice. The pipeline was correct — it just was not the right investment at that scale. They eventually used it when the corpus grew to 8,000 documents from four different Confluence spaces. The lesson: build quality infrastructure in response to production failure signals, not in anticipation of them.