Advanced15 min

Scenario: Structured Data Extraction

Design a structured data extraction pipeline using tool_use and JSON schemas. Covers schema design for accuracy, forced tool selection, validation-retry loops, batch processing, and human review routing.

Quick Reference

→Nullable optional fields prevent fabrication — the model can say 'I don't know' instead of inventing data
→tool_choice: forced ensures the model always calls the extraction tool, never skips it
→Validation-retry loops help when the info exists but was missed — not when info is absent from the source
→Batch API with custom_id enables efficient resubmission of failed documents without reprocessing successes
→Field-level confidence scores route uncertain extractions to human review without bottlenecking everything

Scenario Description

You are building a structured data extraction pipeline that processes unstructured business documents — invoices, contracts, medical records, and legal filings — and extracts structured data for downstream systems. The pipeline receives documents as text (pre-OCR), sends them to Claude with a JSON schema defining the expected fields, validates the extracted data, and either routes it to downstream systems or flags it for human review. The pipeline processes 500-2,000 documents per day.

Key challenges include: designing schemas that maximize extraction accuracy while minimizing fabrication (the model inventing data that is not in the document), ensuring the model always produces structured output even for unusual documents, handling cases where retrying extraction improves quality versus cases where the information simply is not present, efficiently reprocessing failed documents in batch workflows, and routing low-confidence extractions to human reviewers without creating a bottleneck. The system must achieve 95% or higher accuracy on mandatory fields while processing documents within a 10-second per-document latency budget.

Primary Domains Tested

▸Domain 4 — API and Optimization: tool_choice, JSON schemas, Batch API, and structured output patterns
▸Domain 5 — Prompt Engineering: Schema design for extraction accuracy, instructions for handling missing data

Question 1: Schema Design for Accuracy

You are designing a JSON schema for extracting invoice data. The schema has 15 fields including vendor_name, invoice_date, total_amount, line_items, tax_id, and payment_terms. In testing, the model achieves 98% accuracy on vendor_name and total_amount but only 70% on tax_id and payment_terms — these fields are frequently present in some invoices but absent in others. When the field is absent from the document, the model fabricates a plausible-looking value. How should you design the schema to handle fields that are sometimes absent?

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.