Structured Output Techniques
Both OpenAI and Anthropic now offer native constrained decoding — the model physically cannot produce tokens that violate your schema. This guide covers when to use it, how it works, and the three layers of validation you still need on top of it.
Quick Reference
- →Constrained decoding: model cannot produce tokens that violate the schema grammar
- →OpenAI: completions.parse(response_format=MyModel) — or responses.parse() in the Responses API
- →Anthropic: messages.parse(output_format=MyModel) — uses output_config.format internally
- →instructor: cross-provider wrapper with Pydantic validation + auto-retry
- →Grammar compilation adds latency on first request; cached 24 h — reuse schema objects
- →Schema compliance ≠ semantic correctness: age: -5 passes constrained decoding
- →Anthropic does not support recursive schemas or min/max constraints
- →All paths still need Pydantic field validators for domain-specific correctness
In this article
Should You Use Structured Output?
Constrained decoding forces the model to produce schema-valid output at the token level — it physically cannot emit tokens that violate the grammar. That is powerful, but it comes with costs: latency on first use (grammar compilation), potential quality degradation when the schema constrains reasoning, and provider-specific schema limitations. Before reaching for it, confirm you actually need it.
- ▸Skip structured output for: long-form prose (reports, emails), reasoning chains where the model needs to think freely, highly dynamic schemas that change per-request, and early prototypes where schema failures are not the bottleneck
- ▸Use structured output when: the response feeds downstream code (parsing, database writes, UI rendering), missing fields or wrong types cause crashes, or you are extracting specific entities from unstructured text at scale
Forcing a model to produce a specific JSON structure prevents it from expressing uncertainty or adding useful context. A field like confidence: float that the schema requires may get filled with a plausible-looking but meaningless value. If you need the model to reason, let it reason in free text — then extract structured data from that output in a second step.
All paths converge on semantic validation — constrained decoding handles syntax only
How Constrained Decoding Works
Constrained decoding pre-compiles your JSON schema into a context-free grammar, then masks invalid tokens at each generation step. The model never tries to follow the schema — it simply cannot produce a token that would create an invalid continuation. This is different from JSON mode, which only guarantees well-formed JSON, or instruction-following, which relies on the model cooperating.
The first request with a new schema triggers grammar compilation (typically 50–200 ms added latency). Subsequent requests reuse the compiled grammar, cached for 24 hours. Always reuse the same Pydantic model class across requests — do not reconstruct it dynamically — so the cache stays warm.
| Provider | Constrained Decoding | Native Pydantic SDK | Streaming | Key Constraint |
|---|---|---|---|---|
| OpenAI | Yes | Yes — completions.parse() | Yes | strict: true required on schema |
| Anthropic | Yes | Yes — messages.parse() | Yes | No recursive schemas or min/max |
| Gemini | Yes | Partial — response_schema= | Yes | Fewer supported JSON types |
OpenAI's response_format={"type": "json_object"} only guarantees well-formed JSON — no schema enforcement. Anthropic's equivalent is similarly limited. Both providers now recommend using constrained decoding (Structured Outputs on OpenAI, output_config on Anthropic) instead. JSON mode still works but should not appear in new production code.
Native Structured Outputs: OpenAI and Anthropic
Both providers expose constrained decoding through their native Python SDKs with first-class Pydantic integration. No external library required. The SDK handles schema generation, API call, and response parsing — you work directly with typed Python objects.
Use messages.parse() on Anthropic and completions.parse() on OpenAI when you know your provider. They give you constrained decoding without the overhead of a retry loop. Reserve instructor for multi-provider code — when you need to swap between OpenAI and Anthropic without changing application code, or when you need customized retry logic.
output_config does not support recursive schemas, min/max constraints on strings or numbers, additionalProperties: true, or external $ref. If your Pydantic model uses these features, the SDK will attempt to strip them and update field descriptions. Test your schema at integration time, not in production.
Instructor: Cross-Provider Structured Output
instructor wraps any provider's API with Pydantic validation and automatic retry. When a response fails validation, it sends the error back to the model as context and retries — up to max_retries times. Use it when you need provider portability, when native SDK ergonomics do not cover your case, or when you need semantic retries (not just syntactic compliance).
instructor's max_retries=3 means up to 4 total API calls if all retries fail. At high volume, even a 5% semantic failure rate can double your effective per-extraction cost. Use native constrained decoding to eliminate syntactic failures, then only retry for semantic validator failures — which should be rare when your Pydantic model is well-designed.
How Structured Output Fails
A team ran instructor with max_retries=5 on a document extraction pipeline processing thousands of requests per hour. Schema-valid JSON was already guaranteed by Anthropic's output_config — but the retry loop was still running on semantic validator failures. Analyzing logs showed 8% retry rate, adding 40% to token costs. The fix: tighten the Pydantic validators to only raise on genuinely unrecoverable data quality issues, and separate hard failures (stop processing) from soft failures (flag for human review). Retries are for unrecoverable extraction failures, not data quality signals — log every validation failure as an extraction quality metric.
- ▸Grammar compilation latency (50–200 ms) on the first request after the 24-hour cache expires — mitigate by warming the cache on startup
- ▸Model refusals bypass constrained decoding: a refusal returns plain text, not schema-valid JSON — always check finish_reason before parsing
- ▸Quality degradation: schemas with many required fields force the model to generate values for fields it cannot infer — you get plausible-looking garbage
- ▸Schema limitation failures on Anthropic: recursive schemas, min/max constraints, and additionalProperties: true silently fail or get stripped
- ▸Cross-field validation cannot run during streaming — validate the full object only after the stream completes
Extract, validate, and retry with specific error feedback until quality passes
Streaming Structured Output
Both providers support streaming with constrained decoding. You receive tokens incrementally while the grammar constraint applies throughout. Partial JSON is invalid until the stream completes — use partial parsers or instructor's streaming helpers to extract fields as they arrive.
Pydantic cross-field validators (like end_date > start_date) cannot run during streaming because both fields may not be populated yet. Show partial results in the UI but only run final validation after the stream completes. Flag the output as provisional until then.
Defense in Depth: Three Validation Layers
Constrained decoding handles syntax. It does not catch semantic errors. A schema-valid response can still contain age: -5, score: 1.5 for a 0–1 field, or date: 2099-13-45. Production extraction needs three distinct validation layers, each catching different classes of failures.
All three layers needed — schema compliance does not guarantee correct values
Build an eval set of 50–100 real documents with known ground-truth extractions. Gate CI on per-field precision and recall thresholds — not just 'does it parse.' A field with 70% recall is a prompt problem, not a Pydantic problem. Catch it before it ships.
Best Practices
Do
- ✓Use messages.parse() on Anthropic and completions.parse() on OpenAI — they handle schema generation and typed output natively
- ✓Reuse the same Pydantic model class across requests to benefit from 24-hour grammar caching
- ✓Add @field_validator and @model_validator for domain-specific constraints — type coercion alone is not enough
- ✓Check finish_reason before parsing — model refusals return text, not schema-valid JSON
- ✓Log every validation failure as an extraction quality metric, not just an error
- ✓Build an eval set with ground-truth labels and gate CI on per-field recall thresholds
- ✓Use instructor for cross-provider code; use native SDKs when the provider is fixed
- ✓Separate hard extraction failures (stop processing) from soft failures (flag for review)
Don’t
- ✗Don't use json.loads() directly — you lose type safety and get no validation
- ✗Don't use JSON mode in new production code — it guarantees syntax, not schema
- ✗Don't set max_retries higher than 2–3 without measuring the failure rate first — retries cost full token counts
- ✗Don't use min/max constraints or recursive schemas with Anthropic's output_config — they are silently stripped
- ✗Don't assume schema-valid JSON is semantically correct — add field validators for ranges, formats, and cross-field rules
- ✗Don't run cross-field validators during streaming — both fields may not be populated yet
- ✗Don't use constrained decoding for long-form reasoning — it constrains what the model can express, not just what it returns
- ✗Don't reconstruct Pydantic models dynamically per-request — the grammar cache will miss every time
Key Takeaways
- ✓Both OpenAI and Anthropic now support native constrained decoding — the model physically cannot produce tokens that violate the schema grammar.
- ✓Use completions.parse() on OpenAI and messages.parse() on Anthropic; reserve instructor for cross-provider code or custom retry logic.
- ✓Constrained decoding handles syntax only — add Pydantic field validators for semantic correctness and business logic checks for domain rules.
- ✓Grammar compilation adds latency on first use; reuse the same Pydantic model class across requests to benefit from 24-hour caching.
- ✓Each instructor retry is a full API call — at scale, even a 5% semantic failure rate meaningfully increases extraction cost.
- ✓JSON mode (json_object response format) is legacy on both providers — it guarantees well-formed JSON, not schema compliance.
Video on this topic
Getting reliable JSON from LLMs every time
tiktok