Data Quality for AI Systems

Data quality investment pays off at scale — but only if you know when to invest, what failure class you are fighting, and how to build gates that block bad documents rather than just logging them. This article gives you a decision framework, a failure taxonomy grounded in observable symptoms, ingestion contracts, layered deduplication (syntactic and semantic), LLM-as-judge scoring with real cost math, and CI/CD quality gates.

Quick Reference

→Invest in quality pipelines only when you have retrieval failures in production traces — below 500 docs or with a single reliable source, ingestion contracts alone are sufficient
→Five observable failure classes: stale content, near-duplicates, contradictions, incomplete docs, and format corruption — each has distinct symptoms and detection methods
→Ingestion contracts: every document needs a stable_id, source_uri, content_hash, last_modified, and content_type before it enters the pipeline
→Layer dedup: MinHash LSH (Jaccard threshold=0.85) for syntactic similarity, then embedding cosine similarity on candidates for semantic duplicates
→Structural pre-filter is free and catches 60-80% of bad documents — run it before LLM-as-judge to control cost
→LLM-as-judge evaluates factual consistency, completeness, and currency that heuristics cannot detect — with claude-haiku-4-5-20251001 at $1/MTok in and $5/MTok out, scoring a 500-token document costs ~$0.001
→Freshness SLAs are domain-specific — calibrate by measuring how often your source data actually changes, not by guessing
→Quality gates in CI should block ingestion, not just warn — a clear 'I don't know' from a coverage gap is better than a confidently wrong answer from bad data

When Data Quality Work Pays Off (and When It Doesn't)

fix retrieval before investing in ingestion-side quality work

Data quality pipelines are not free — they add latency to ingestion, consume compute, and require maintenance as your source schemas evolve. Before you build them, you need a production signal that justifies the investment. The clearest signal: retrieval failures showing up in traces. If your agent retrieves stale content, returns redundant chunks, or consistently says 'I don't have enough information', the root cause is almost always in the data — but before you build quality infrastructure, confirm it is not a retrieval or prompt problem first.

Don't quality-engineer a prototype

Teams often build MinHash deduplication pipelines and freshness trackers before their knowledge base has 200 documents. The ROI is negative. At small scale, manually curating documents is faster and more effective than automated pipelines. Start with ingestion contracts (catch missing metadata), then add quality gates only when you see retrieval failures at scale.

Trigger Signal	Recommended Action	When to Add the Next Layer
Agents say 'I don't have enough information' on known topics	Coverage gap analysis from query logs	When gap reports stop driving content creation
Same question gets different answers on different sessions	Contradiction detection via LLM-as-judge	When contradictions persist after manual dedup pass
Retriever returns obviously outdated facts with high score	Freshness SLAs + auto-deprecation	When stale docs represent > 10% of retrieved context
Corpus > 1K docs from multiple authors or systems	MinHash deduplication before chunking	When semantic duplicates appear after syntactic dedup
Ingestion from PDFs, wikis, or scraped sources	Structural pre-filter at ingestion gate	When pre-filter false-positive rate drops below 5%

Real project

A platform team spent three weeks building a MinHash deduplication pipeline for a 150-document internal knowledge base. In the same sprint, they could have manually reviewed and cleaned all 150 documents twice. The pipeline was correct — it just was not the right investment at that scale. They eventually used it when the corpus grew to 8,000 documents from four different Confluence spaces. The lesson: build quality infrastructure in response to production failure signals, not in anticipation of them.

How Bad Data Breaks RAG Systems

each failure class has distinct symptoms — no single quality metric covers all five

Ingestion Contracts: Catch Problems at the Door

An ingestion contract is a schema every document must satisfy before entering your pipeline. It is not a quality check — it is a precondition. Without stable IDs, you cannot track document history. Without source URIs, you cannot re-fetch stale content. Without last_modified timestamps, you cannot enforce freshness SLAs. Contracts cost nothing to enforce and prevent a large class of quality problems from ever reaching your vector index.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.