LLM Foundations/Prompt Engineering as a Discipline

Intermediate14 min

Structured Output Techniques

Both OpenAI and Anthropic now offer native constrained decoding — the model physically cannot produce tokens that violate your schema. This guide covers when to use it, how it works, and the three layers of validation you still need on top of it.

Quick Reference

→Constrained decoding: model cannot produce tokens that violate the schema grammar
→OpenAI: completions.parse(response_format=MyModel) — or responses.parse() in the Responses API
→Anthropic: messages.parse(output_format=MyModel) — uses output_config.format internally
→instructor: cross-provider wrapper with Pydantic validation + auto-retry
→Grammar compilation adds latency on first request; cached 24 h — reuse schema objects
→Schema compliance ≠ semantic correctness: age: -5 passes constrained decoding
→Anthropic does not support recursive schemas or min/max constraints
→All paths still need Pydantic field validators for domain-specific correctness

In this article

1.Should You Use Structured Output?
2.How Constrained Decoding Works
3.Native Structured Outputs: OpenAI and Anthropic
4.Instructor: Cross-Provider Structured Output
5.How Structured Output Fails
6.Streaming Structured Output
7.Defense in Depth: Three Validation Layers
★Best Practices
✓Key Takeaways

Should You Use Structured Output?

Constrained decoding forces the model to produce schema-valid output at the token level — it physically cannot emit tokens that violate the grammar. That is powerful, but it comes with costs: latency on first use (grammar compilation), potential quality degradation when the schema constrains reasoning, and provider-specific schema limitations. Before reaching for it, confirm you actually need it.

▸Skip structured output for: long-form prose (reports, emails), reasoning chains where the model needs to think freely, highly dynamic schemas that change per-request, and early prototypes where schema failures are not the bottleneck
▸Use structured output when: the response feeds downstream code (parsing, database writes, UI rendering), missing fields or wrong types cause crashes, or you are extracting specific entities from unstructured text at scale

Schema constraints can reduce output quality

Forcing a model to produce a specific JSON structure prevents it from expressing uncertainty or adding useful context. A field like confidence: float that the schema requires may get filled with a plausible-looking but meaningless value. If you need the model to reason, let it reason in free text — then extract structured data from that output in a second step.

All paths converge on semantic validation — constrained decoding handles syntax only

How Constrained Decoding Works

Constrained decoding pre-compiles your JSON schema into a context-free grammar, then masks invalid tokens at each generation step. The model never tries to follow the schema — it simply cannot produce a token that would create an invalid continuation. This is different from JSON mode, which only guarantees well-formed JSON, or instruction-following, which relies on the model cooperating.

Grammar compilation and caching

The first request with a new schema triggers grammar compilation (typically 50–200 ms added latency). Subsequent requests reuse the compiled grammar, cached for 24 hours. Always reuse the same Pydantic model class across requests — do not reconstruct it dynamically — so the cache stays warm.

Provider	Constrained Decoding	Native Pydantic SDK	Streaming	Key Constraint
OpenAI	Yes	Yes — completions.parse()	Yes	strict: true required on schema
Anthropic	Yes	Yes — messages.parse()	Yes	No recursive schemas or min/max
Gemini	Yes	Partial — response_schema=	Yes	Fewer supported JSON types

JSON mode is legacy

OpenAI's response_format={"type": "json_object"} only guarantees well-formed JSON — no schema enforcement. Anthropic's equivalent is similarly limited. Both providers now recommend using constrained decoding (Structured Outputs on OpenAI, output_config on Anthropic) instead. JSON mode still works but should not appear in new production code.

Native Structured Outputs: OpenAI and Anthropic

Both providers expose constrained decoding through their native Python SDKs with first-class Pydantic integration. No external library required. The SDK handles schema generation, API call, and response parsing — you work directly with typed Python objects.

OpenAI: completions.parse() with Pydantic (constrained decoding)

Anthropic: messages.parse() with Pydantic (constrained decoding)

When to use native SDKs vs instructor

Use messages.parse() on Anthropic and completions.parse() on OpenAI when you know your provider. They give you constrained decoding without the overhead of a retry loop. Reserve instructor for multi-provider code — when you need to swap between OpenAI and Anthropic without changing application code, or when you need customized retry logic.

Anthropic schema limitations

output_config does not support recursive schemas, min/max constraints on strings or numbers, additionalProperties: true, or external $ref. If your Pydantic model uses these features, the SDK will attempt to strip them and update field descriptions. Test your schema at integration time, not in production.

Instructor: Cross-Provider Structured Output

instructor wraps any provider's API with Pydantic validation and automatic retry. When a response fails validation, it sends the error back to the model as context and retries — up to max_retries times. Use it when you need provider portability, when native SDK ergonomics do not cover your case, or when you need semantic retries (not just syntactic compliance).

instructor: cross-provider extraction with Pydantic and retry

Each retry is a full LLM call

instructor's max_retries=3 means up to 4 total API calls if all retries fail. At high volume, even a 5% semantic failure rate can double your effective per-extraction cost. Use native constrained decoding to eliminate syntactic failures, then only retry for semantic validator failures — which should be rare when your Pydantic model is well-designed.

How Structured Output Fails

Real project

A team ran instructor with max_retries=5 on a document extraction pipeline processing thousands of requests per hour. Schema-valid JSON was already guaranteed by Anthropic's output_config — but the retry loop was still running on semantic validator failures. Analyzing logs showed 8% retry rate, adding 40% to token costs. The fix: tighten the Pydantic validators to only raise on genuinely unrecoverable data quality issues, and separate hard failures (stop processing) from soft failures (flag for human review). Retries are for unrecoverable extraction failures, not data quality signals — log every validation failure as an extraction quality metric.

▸Grammar compilation latency (50–200 ms) on the first request after the 24-hour cache expires — mitigate by warming the cache on startup
▸Model refusals bypass constrained decoding: a refusal returns plain text, not schema-valid JSON — always check finish_reason before parsing
▸Quality degradation: schemas with many required fields force the model to generate values for fields it cannot infer — you get plausible-looking garbage
▸Schema limitation failures on Anthropic: recursive schemas, min/max constraints, and additionalProperties: true silently fail or get stripped
▸Cross-field validation cannot run during streaming — validate the full object only after the stream completes

Handling refusals and schema limits safely

Extract, validate, and retry with specific error feedback until quality passes

Streaming Structured Output

Both providers support streaming with constrained decoding. You receive tokens incrementally while the grammar constraint applies throughout. Partial JSON is invalid until the stream completes — use partial parsers or instructor's streaming helpers to extract fields as they arrive.

Streaming with instructor partial models (OpenAI)

Cross-field validation waits for the full response

Pydantic cross-field validators (like end_date > start_date) cannot run during streaming because both fields may not be populated yet. Show partial results in the UI but only run final validation after the stream completes. Flag the output as provisional until then.

Defense in Depth: Three Validation Layers

Constrained decoding handles syntax. It does not catch semantic errors. A schema-valid response can still contain age: -5, score: 1.5 for a 0–1 field, or date: 2099-13-45. Production extraction needs three distinct validation layers, each catching different classes of failures.

All three layers needed — schema compliance does not guarantee correct values

All three layers in a single extraction pipeline

Set per-field minimum recall as a CI gate

Build an eval set of 50–100 real documents with known ground-truth extractions. Gate CI on per-field precision and recall thresholds — not just 'does it parse.' A field with 70% recall is a prompt problem, not a Pydantic problem. Catch it before it ships.

Best Practices

✓Use messages.parse() on Anthropic and completions.parse() on OpenAI — they handle schema generation and typed output natively
✓Reuse the same Pydantic model class across requests to benefit from 24-hour grammar caching
✓Add @field_validator and @model_validator for domain-specific constraints — type coercion alone is not enough
✓Check finish_reason before parsing — model refusals return text, not schema-valid JSON
✓Log every validation failure as an extraction quality metric, not just an error
✓Build an eval set with ground-truth labels and gate CI on per-field recall thresholds
✓Use instructor for cross-provider code; use native SDKs when the provider is fixed
✓Separate hard extraction failures (stop processing) from soft failures (flag for review)

Don’t

✗Don't use json.loads() directly — you lose type safety and get no validation
✗Don't use JSON mode in new production code — it guarantees syntax, not schema
✗Don't set max_retries higher than 2–3 without measuring the failure rate first — retries cost full token counts
✗Don't use min/max constraints or recursive schemas with Anthropic's output_config — they are silently stripped
✗Don't assume schema-valid JSON is semantically correct — add field validators for ranges, formats, and cross-field rules
✗Don't run cross-field validators during streaming — both fields may not be populated yet
✗Don't use constrained decoding for long-form reasoning — it constrains what the model can express, not just what it returns
✗Don't reconstruct Pydantic models dynamically per-request — the grammar cache will miss every time

Key Takeaways

✓Both OpenAI and Anthropic now support native constrained decoding — the model physically cannot produce tokens that violate the schema grammar.
✓Use completions.parse() on OpenAI and messages.parse() on Anthropic; reserve instructor for cross-provider code or custom retry logic.
✓Constrained decoding handles syntax only — add Pydantic field validators for semantic correctness and business logic checks for domain rules.
✓Grammar compilation adds latency on first use; reuse the same Pydantic model class across requests to benefit from 24-hour caching.
✓Each instructor retry is a full API call — at scale, even a 5% semantic failure rate meaningfully increases extraction cost.
✓JSON mode (json_object response format) is legacy on both providers — it guarantees well-formed JSON, not schema compliance.

Video on this topic

Getting reliable JSON from LLMs every time

tiktok

←

Techniques That Work

Systematic Prompt Iteration

→