Advanced14 min

RetryPolicy, Error Taxonomy & CachePolicy

Most LangGraph error handling fails not from missing retry logic, but from misclassifying errors. This article covers how LangGraph handles errors by default, a 4-category taxonomy for routing errors to the right handler, and the three production tools — RetryPolicy, self-healing feedback loops, and CachePolicy — with their failure modes and multi-layer retry pitfalls.

Quick Reference

→RetryPolicy defaults: max_attempts=3, initial_interval=0.5s, backoff_factor=2.0, max_interval=128s, jitter=True
→default_retry_on excludes ValueError, TypeError, RuntimeError, OSError and 8 more — code bugs are NOT retried by default
→For requests/httpx, default_retry_on only retries 5xx status codes — not 4xx
→When max_attempts is exhausted, the exception bubbles up and the graph stops — no built-in fallback hook
→Error taxonomy: transient → RetryPolicy, LLM-recoverable → feedback loop, user-fixable → interrupt(), unexpected → bubble up
→CachePolicy requires both cache_policy on the node AND cache=... at compile() — missing either produces zero cache hits with no error
→LLM SDK retries (max_retries) and LangGraph RetryPolicy multiply: worst case = sdk_max_retries × lg_max_attempts total API calls
→runtime.execution_info.node_attempt gives the current 1-indexed attempt number for in-node fallback logic

When Default Error Handling Is Enough

Before adding retry logic, ask whether you need it. Graphs that only call a local LLM with a managed provider (Claude, GPT-4) already get HTTP-level retries from the SDK. Graphs with no external API calls will mostly surface programming errors — which should not be retried. Custom error handling adds complexity; add it only when you can name the specific failure mode it addresses.

When to add custom error handling

Add RetryPolicy when your node calls an external API that transiently fails (rate limits, network timeouts, 503s). Add a feedback loop when the LLM produces structurally invalid output. Add interrupt() when a human needs to provide missing information. Leave everything else to bubble up — unexpected errors are bugs, not runtime conditions.

How LangGraph Handles Errors by Default

By default, when a node raises an exception, LangGraph stops the graph and surfaces the exception immediately. It does not silently retry, skip the node, or recover. This is deliberate — unexpected errors should be loud. The RetryPolicy default for retry_on is not 'all exceptions'. It uses default_retry_on, a function that explicitly excludes 12 common exception types. This matters: a node that raises RuntimeError or ValueError will NOT be retried by default, even with RetryPolicy() in place.

Error Taxonomy: Classify Before You Handle

Classify errors by who can fix them → route to the right handler

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.