Production & Scale/Production Operations
Advanced20 min

Error Handling & Retry

Build production error handling that classifies first, retries selectively, and degrades gracefully. Covers the full stack: error classification, exponential backoff, circuit breakers, fallback chains, dead letter queues, idempotency for tool calls, and the metrics that tell you when any of it is failing.

Quick Reference

  • Classify errors before any retry: retriable (429, 5xx, timeouts) vs. non-retriable (401, 400, content policy)
  • Import RetryPolicy from `langgraph.types`, not `langgraph.pregel` — the public API path
  • Formula: delay = min(initial × backoff^attempt + jitter, max_interval); cap total retry budget at 30s
  • At claude-sonnet-4-6 pricing, max_attempts=3 means a worst-case request costs 4× the no-retry price
  • Circuit breakers stop retries from making provider outages worse: CLOSED → OPEN after N failures, OPEN → HALF-OPEN after timeout
  • Fallback chain: primary model → cross-provider model → semantic cache → static message
  • Idempotency keys prevent duplicate side effects when retrying tool calls that write to external systems
  • Monitor retry_rate (>15% = systemic issue), fallback_rate (>5% = primary degraded), DLQ depth (>100 = replay needed)

Error Classification: The First Branch

Retriable vs. non-retriable is the first decision

Every error must be classified before a retry decision is made. Retrying a non-retriable error wastes tokens and time. Not retrying a retriable error throws away a request that would have succeeded. This classification shapes every other pattern in this article.

Error TypeRetriable?ExamplesStrategy
Rate Limit (429)YesAnthropic 429, OpenAI 429Backoff with jitter; respect Retry-After header
TimeoutYesLLM call exceeds 30s, tool API timeoutRetry 2–3 times with same or shorter timeout
Server Error (5xx)YesProvider outage, internal service errorBackoff with jitter; open circuit after N failures
Auth Failure (401/403)NoExpired API key, revoked permissionsFail fast, alert on-call, do not retry
Invalid Input (400)NoMalformed tool args, schema validation failureLog input + state, return user-facing error
Content Policy (400)NoModel refused due to content policyRoute to guardrail handler, do not retry with same input
Context Length ExceededConditionalInput + history exceeds model windowRetry with truncated context or summarized history

Wrap all external calls in typed exceptions that carry the retriability classification. This lets your retry logic branch on error type rather than parsing status codes or error messages at every call site. Here is the hierarchy for Anthropic calls; extend it for each provider you use.

Typed exception hierarchy — one classify_error call per provider
RequestTrySuccessFailureClassifyRetriableyesExponential Backoff2^n * base delayRetryNon-RetriablenoFallback ChainDead Letter Queuemanual review

Error retry flow: classify errors, retry with backoff, or route to dead letter queue