Production & Scale/Production Operations
Advanced10 min

Error Handling & Retry

Production-grade error handling: retry strategies, fallback chains, dead letter queues, and graceful degradation patterns.

Quick Reference

  • Classify errors into retriable (rate limits, timeouts) and non-retriable (invalid input, auth failure) — handle differently
  • Use exponential backoff with jitter: delay = min(base * 2^attempt + random_jitter, max_delay)
  • Implement fallback chains for LLM calls: primary model → fallback model → cached response → error message
  • Store failed requests in a dead letter queue for manual review and replay after fixes
  • Use LangGraph's retry_policy on nodes for automatic retry with configurable backoff and max attempts

Error Classification

Retriable vs non-retriable is the first branch

Every error must be classified before a retry decision is made. Retrying a non-retriable error wastes time and tokens. Not retrying a retriable error loses a request unnecessarily.

Error TypeRetriable?ExamplesStrategy
Rate Limit (429)YesOpenAI 429, Anthropic rate limitBackoff with jitter, respect Retry-After header
TimeoutYesLLM call exceeds 30s, tool API timeoutRetry with same or shorter timeout, max 2-3 attempts
Server Error (5xx)YesProvider outage, internal service errorBackoff with jitter, switch to fallback after N failures
Auth Failure (401/403)NoExpired API key, revoked permissionsFail fast, alert on-call, do not retry
Invalid Input (400)NoMalformed tool args, schema validation failureLog input + state, return user-facing error
Content Policy (400)NoModel refused due to content policyRoute to guardrail handler, do not retry with same input
Context Length ExceededConditionalInput + history exceeds model windowRetry with truncated context or summarized history

Wrap all external calls in typed exceptions that carry the retriability classification. This lets your retry logic branch on error type rather than parsing error messages or status codes at every call site.

RequestTrySuccessFailureClassifyRetriableyesExponential Backoff2^n * base delayRetryNon-RetriablenoFallback ChainDead Letter Queuemanual review

Error retry flow: classify errors, retry with backoff, or route to dead letter queue