Primary Domains Tested
- ▸Domain 4 — API and Optimization: tool_choice, JSON schemas, Batch API, and structured output patterns
- ▸Domain 5 — Prompt Engineering: Schema design for extraction accuracy, instructions for handling missing data
Design a structured data extraction pipeline using tool_use and JSON schemas. Covers schema design for accuracy, forced tool selection, validation-retry loops, batch processing, and human review routing.
Quick Reference
You are building a structured data extraction pipeline that processes unstructured business documents — invoices, contracts, medical records, and legal filings — and extracts structured data for downstream systems. The pipeline receives documents as text (pre-OCR), sends them to Claude with a JSON schema defining the expected fields, validates the extracted data, and either routes it to downstream systems or flags it for human review. The pipeline processes 500-2,000 documents per day.
Key challenges include: designing schemas that maximize extraction accuracy while minimizing fabrication (the model inventing data that is not in the document), ensuring the model always produces structured output even for unusual documents, handling cases where retrying extraction improves quality versus cases where the information simply is not present, efficiently reprocessing failed documents in batch workflows, and routing low-confidence extractions to human reviewers without creating a bottleneck. The system must achieve 95% or higher accuracy on mandatory fields while processing documents within a 10-second per-document latency budget.