LangGraph/Persistence
Intermediate14 min

Durable Execution

LangGraph checkpoints state after every super-step, enabling crash recovery without data loss. Three durability modes — exit, async, sync — let you trade off write overhead against recovery guarantees.

Quick Reference

  • Durable execution: graph state is checkpointed after every super-step so the graph resumes from the crash point, not the beginning
  • durability='sync': writes checkpoint before each step starts — safest, +1 DB write per step
  • durability='async': writes checkpoint while next step runs — low write overhead, small crash window
  • durability='exit': writes checkpoint only when the graph exits — fastest, zero mid-run crash recovery
  • Completed nodes are skipped on recovery — only the interrupted super-step re-executes
  • MemorySaver cannot provide durable execution — use PostgresSaver or SqliteSaver for production
  • Wrap external side effects in idempotency keys — exactly-once applies to the graph, not your API calls
  • @task in the Functional API caches results in the checkpoint, preventing re-execution of completed tasks

Should You Use Durable Execution?

Most short-lived agents don't need durable execution. A request-response agent that runs in under 30 seconds, fails fast, and can be retried from scratch doesn't benefit from per-step checkpointing — it only pays the write overhead. Durable execution earns its cost when: your workflow runs for minutes or hours (LLM calls, tool chains, human approvals), your graph calls external APIs mid-execution where re-running from scratch causes duplicates (emails, payments, database writes), or you're building human-in-the-loop flows where a crashed server between an interrupt and a resume would lose the user's state.

When NOT to use durable execution

Skip durable execution for: simple chat agents (conversation history is stored separately in the checkpointer anyway), pipelines that run in under 10 seconds and are safe to retry in full, and batch jobs where starting over is cheaper than managing checkpoint storage. A persistent checkpointer is still useful for memory across turns — but that's different from relying on durable execution semantics for crash recovery.