Exam Scenarios/Practice Scenarios
Advanced15 min

Scenario: Structured Data Extraction

Design a structured data extraction pipeline using tool_use and JSON schemas. Covers schema design for accuracy, forced tool selection, validation-retry loops, batch processing, and human review routing.

Quick Reference

  • Nullable optional fields prevent fabrication — the model can say 'I don't know' instead of inventing data
  • tool_choice: forced ensures the model always calls the extraction tool, never skips it
  • Validation-retry loops help when the info exists but was missed — not when info is absent from the source
  • Batch API with custom_id enables efficient resubmission of failed documents without reprocessing successes
  • Field-level confidence scores route uncertain extractions to human review without bottlenecking everything

Scenario Description

You are building a structured data extraction pipeline that processes unstructured business documents — invoices, contracts, medical records, and legal filings — and extracts structured data for downstream systems. The pipeline receives documents as text (pre-OCR), sends them to Claude with a JSON schema defining the expected fields, validates the extracted data, and either routes it to downstream systems or flags it for human review. The pipeline processes 500-2,000 documents per day.

Key challenges include: designing schemas that maximize extraction accuracy while minimizing fabrication (the model inventing data that is not in the document), ensuring the model always produces structured output even for unusual documents, handling cases where retrying extraction improves quality versus cases where the information simply is not present, efficiently reprocessing failed documents in batch workflows, and routing low-confidence extractions to human reviewers without creating a bottleneck. The system must achieve 95% or higher accuracy on mandatory fields while processing documents within a 10-second per-document latency budget.