Agent Architecture/Autonomous Agents
★ OverviewAdvanced12 min

Long-Running Agents

Architecting agents that run for hours or days: durable execution, checkpointing strategies, progress reporting, and timeout/budget management to prevent runaway costs.

Quick Reference

  • Long-running agents (minutes to days) need fundamentally different architecture than request-response agents (seconds)
  • Durable execution: use frameworks like Temporal, Inngest, or LangGraph's persistent checkpointing to survive crashes
  • Checkpoint after every significant step — the cost of re-doing work after a crash far exceeds the cost of saving state
  • Progress webhooks: push status updates to your frontend so users know the agent is still working
  • Budget management: set hard limits on cost ($), time (hours), and actions (iterations) — enforce them in the agent loop
  • Always design a graceful shutdown: save current state, report partial results, and allow manual resume

Why Long-Running Agents Are Different

A typical agent handles a request in 5-30 seconds: receive input, call LLM a few times, return result. Long-running agents — data migration, research synthesis, code generation pipelines — run for minutes, hours, or even days. Everything that works for short-lived agents breaks at this timescale: server restarts kill your process, LLM rate limits require backoff, and costs can spiral without budgets.

ChallengeShort-Lived AgentLong-Running Agent
Process lifecycleLives within one HTTP requestMust survive deploys, restarts, crashes
StateIn-memory, lost on completionMust be persisted and resumable
Failure recoveryRetry the whole requestResume from last checkpoint
CostPredictable ($0.01-0.50 per request)Unbounded without budget limits
User experienceLoading spinner → resultProgress bar, status updates, partial results
Rate limitsRare for single requestsGuaranteed over hours of API calls
ObservabilitySingle traceDistributed trace spanning hours
The silent cost explosion

A long-running agent without a budget can burn through $100+ in API calls before anyone notices. Always set a hard dollar limit that kills the agent when exceeded. Better to stop early than explain an unexpected bill.