Intermediate14 min

Durable Execution

LangGraph checkpoints state after every super-step, enabling crash recovery without data loss. Three durability modes — exit, async, sync — let you trade off write overhead against recovery guarantees.

Quick Reference

→Durable execution: graph state is checkpointed after every super-step so the graph resumes from the crash point, not the beginning
→durability='sync': writes checkpoint before each step starts — safest, +1 DB write per step
→durability='async': writes checkpoint while next step runs — low write overhead, small crash window
→durability='exit': writes checkpoint only when the graph exits — fastest, zero mid-run crash recovery
→Completed nodes are skipped on recovery — only the interrupted super-step re-executes
→MemorySaver cannot provide durable execution — use PostgresSaver or SqliteSaver for production
→Wrap external side effects in idempotency keys — exactly-once applies to the graph, not your API calls
→@task in the Functional API caches results in the checkpoint, preventing re-execution of completed tasks

Should You Use Durable Execution?

Most short-lived agents don't need durable execution. A request-response agent that runs in under 30 seconds, fails fast, and can be retried from scratch doesn't benefit from per-step checkpointing — it only pays the write overhead. Durable execution earns its cost when: your workflow runs for minutes or hours (LLM calls, tool chains, human approvals), your graph calls external APIs mid-execution where re-running from scratch causes duplicates (emails, payments, database writes), or you're building human-in-the-loop flows where a crashed server between an interrupt and a resume would lose the user's state.

When NOT to use durable execution

Skip durable execution for: simple chat agents (conversation history is stored separately in the checkpointer anyway), pipelines that run in under 10 seconds and are safe to retry in full, and batch jobs where starting over is cheaper than managing checkpoint storage. A persistent checkpointer is still useful for memory across turns — but that's different from relying on durable execution semantics for crash recovery.

Durability Modes: Exit, Async, Sync

LangGraph exposes a durability parameter on invoke() and stream() that controls when checkpoints are written. This is the core production decision for any workflow where crash recovery matters. Choosing the wrong mode means either paying unnecessary write overhead or having no crash recovery when you thought you did.

How Crash Recovery Works

When you call invoke() with a thread_id that has a checkpoint, LangGraph reads the latest checkpoint from the backend and fast-forwards past every node that has already completed. Only the interrupted super-step re-executes — and with sync mode, that re-execution starts from the beginning of the interrupted node with its inputs from the checkpoint. This is the same invoke() call you'd make on a fresh run. No special recovery logic. No restart API. Just invoke() with the same thread_id.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.