Long-Running Agents
When to build an agent that runs for hours instead of seconds — which orchestration framework to choose, how to compute real costs, the five ways long-running agents fail in production, and a reference implementation with checkpointing, error classification, idempotency, and budget enforcement.
Quick Reference
- →If the task takes under 5 minutes and is safe to retry fully, skip durable execution — just retry the whole request
- →Long-running agents fail in five specific ways: retry storms (429s), context window exhaustion, state serialization failures, idempotency violations, and session drift
- →LangGraph checkpoints between nodes; Temporal checkpoints within workflow steps; Managed Agents (public beta) handles both — choose based on how much mid-step durability you need
- →Cost example: a 2-hour research agent (200 LLM calls, 2K input / 800 output tokens each) at Sonnet 4.6 pricing costs ~$3.60 in tokens; Opus 4.7 costs ~$6.00
- →Classify errors before retrying: transient (5xx) → exponential backoff with jitter; rate-limited (429) → honor Retry-After exactly; fatal (400/401/403) → bubble up immediately
- →Wrap every side effect (email, DB write, webhook) in an idempotency key — resumption replays everything after the last checkpoint
- →Context degrades non-linearly past ~60% utilization — long sessions need compaction or summarization to maintain quality across session boundaries
- →Always return partial results when a budget fires — 70% of a research task is useful; a complete failure is not
When to Build a Long-Running Agent
A typical agent handles a request in 5–30 seconds: receive input, call an LLM a few times, return a result. Long-running agents — data migration pipelines, research synthesis, code generation workflows — run for minutes, hours, or days. The decision to go long-running is a significant architectural commitment. Over-engineer a 2-minute task with Temporal and you've added weeks of infrastructure. Under-engineer a 2-hour task with in-memory state and you'll lose progress on the first deploy.
Under 5 min with safe retries: no durable execution needed. Over 5 min: choose orchestration tier by side-effect risk and mid-step failure tolerance.
| Dimension | Short-Lived (< 5 min) | Long-Running (> 5 min) |
|---|---|---|
| Process lifecycle | One HTTP request | Must survive deploys, restarts, crashes |
| State | In-memory, lost on completion | Persisted externally, resumable |
| Failure recovery | Retry the entire request | Resume from last checkpoint |
| Side effects | Usually idempotent to retry | Replay risk: duplicate emails, writes |
| User experience | Loading spinner → result | Progress updates, partial results |
| Rate limits | Rare for single requests | Guaranteed over hours of API calls |
| Observability | Single trace, seconds long | Distributed trace spanning hours |
If your task runs in under 5 minutes and has no side effects that break on retry, don't use durable execution. The overhead of Temporal or LangGraph + PostgresSaver adds latency, operational complexity, and storage costs. The simpler pattern — retry the whole request with exponential backoff — works for most tasks. Reach for durable execution only when task duration, irreversible side effects, or crash-recovery requirements make full-retry unacceptable.