Error Handling & Retry
Build production error handling that classifies first, retries selectively, and degrades gracefully. Covers the full stack: error classification, exponential backoff, circuit breakers, fallback chains, dead letter queues, idempotency for tool calls, and the metrics that tell you when any of it is failing.
Quick Reference
- →Classify errors before any retry: retriable (429, 5xx, timeouts) vs. non-retriable (401, 400, content policy)
- →Import RetryPolicy from `langgraph.types`, not `langgraph.pregel` — the public API path
- →Formula: delay = min(initial × backoff^attempt + jitter, max_interval); cap total retry budget at 30s
- →At claude-sonnet-4-6 pricing, max_attempts=3 means a worst-case request costs 4× the no-retry price
- →Circuit breakers stop retries from making provider outages worse: CLOSED → OPEN after N failures, OPEN → HALF-OPEN after timeout
- →Fallback chain: primary model → cross-provider model → semantic cache → static message
- →Idempotency keys prevent duplicate side effects when retrying tool calls that write to external systems
- →Monitor retry_rate (>15% = systemic issue), fallback_rate (>5% = primary degraded), DLQ depth (>100 = replay needed)
Error Classification: The First Branch
Every error must be classified before a retry decision is made. Retrying a non-retriable error wastes tokens and time. Not retrying a retriable error throws away a request that would have succeeded. This classification shapes every other pattern in this article.
| Error Type | Retriable? | Examples | Strategy |
|---|---|---|---|
| Rate Limit (429) | Yes | Anthropic 429, OpenAI 429 | Backoff with jitter; respect Retry-After header |
| Timeout | Yes | LLM call exceeds 30s, tool API timeout | Retry 2–3 times with same or shorter timeout |
| Server Error (5xx) | Yes | Provider outage, internal service error | Backoff with jitter; open circuit after N failures |
| Auth Failure (401/403) | No | Expired API key, revoked permissions | Fail fast, alert on-call, do not retry |
| Invalid Input (400) | No | Malformed tool args, schema validation failure | Log input + state, return user-facing error |
| Content Policy (400) | No | Model refused due to content policy | Route to guardrail handler, do not retry with same input |
| Context Length Exceeded | Conditional | Input + history exceeds model window | Retry with truncated context or summarized history |
Wrap all external calls in typed exceptions that carry the retriability classification. This lets your retry logic branch on error type rather than parsing status codes or error messages at every call site. Here is the hierarchy for Anthropic calls; extend it for each provider you use.
Error retry flow: classify errors, retry with backoff, or route to dead letter queue