Human Review & Confidence Calibration
Designing human-in-the-loop review workflows with field-level confidence scoring, stratified sampling for error rate measurement, and routing logic that sends low-confidence extractions to human review while automating high-confidence ones.
Quick Reference
- →Aggregate accuracy (e.g., 97%) can mask catastrophic failure on specific document types or fields
- →Stratified random sampling measures error rates per document type and per field -- not just overall
- →Field-level confidence scores must be calibrated with labeled validation sets, not trusted at face value
- →Route by confidence: high (>0.95) -> auto-process, medium (0.80-0.95) -> sample for QA, low (<0.80) -> human review
- →Validate accuracy by document type AND field before deciding what to automate
- →Contradictory-source extractions should always route to human review regardless of confidence score
- →Human review queues need priority ordering: highest business impact first, not FIFO
- →Track per-reviewer accuracy to calibrate the human review process itself
- →Confidence thresholds should be tuned per field: names might be accurate at 0.90, dollar amounts need 0.99
- →The goal is not 100% automation -- it's knowing exactly which cases need humans
The Aggregate Accuracy Trap
The most dangerous metric in AI system evaluation is aggregate accuracy. A document extraction system that reports '97% accuracy overall' sounds production-ready. But that 97% might hide a much darker reality: 99.5% accuracy on standard invoices (80% of volume) and 60% accuracy on handwritten receipts (20% of volume). The 3% error rate is not evenly distributed -- it's concentrated in specific document types, specific fields, and specific edge cases.
If your system processes 1000 documents/day (800 invoices at 99.5% + 200 receipts at 60%), the 97% aggregate looks great. But you're producing 80 wrong extractions per day, all from receipts. A customer who only submits receipts experiences a 60% failure rate. Always decompose accuracy by type.
Route extractions by confidence: auto-approve, sample, or queue for human review