Advanced10 min
Error Handling & Retry
Production-grade error handling: retry strategies, fallback chains, dead letter queues, and graceful degradation patterns.
Quick Reference
- →Classify errors into retriable (rate limits, timeouts) and non-retriable (invalid input, auth failure) — handle differently
- →Use exponential backoff with jitter: delay = min(base * 2^attempt + random_jitter, max_delay)
- →Implement fallback chains for LLM calls: primary model → fallback model → cached response → error message
- →Store failed requests in a dead letter queue for manual review and replay after fixes
- →Use LangGraph's retry_policy on nodes for automatic retry with configurable backoff and max attempts
Error Classification
Retriable vs non-retriable is the first branch
Every error must be classified before a retry decision is made. Retrying a non-retriable error wastes time and tokens. Not retrying a retriable error loses a request unnecessarily.
| Error Type | Retriable? | Examples | Strategy |
|---|---|---|---|
| Rate Limit (429) | Yes | OpenAI 429, Anthropic rate limit | Backoff with jitter, respect Retry-After header |
| Timeout | Yes | LLM call exceeds 30s, tool API timeout | Retry with same or shorter timeout, max 2-3 attempts |
| Server Error (5xx) | Yes | Provider outage, internal service error | Backoff with jitter, switch to fallback after N failures |
| Auth Failure (401/403) | No | Expired API key, revoked permissions | Fail fast, alert on-call, do not retry |
| Invalid Input (400) | No | Malformed tool args, schema validation failure | Log input + state, return user-facing error |
| Content Policy (400) | No | Model refused due to content policy | Route to guardrail handler, do not retry with same input |
| Context Length Exceeded | Conditional | Input + history exceeds model window | Retry with truncated context or summarized history |
Wrap all external calls in typed exceptions that carry the retriability classification. This lets your retry logic branch on error type rather than parsing error messages or status codes at every call site.
Error retry flow: classify errors, retry with backoff, or route to dead letter queue