Advanced11 min
Designing for Failure
Building agents that degrade gracefully: circuit breakers, fallback chains, timeout handling, and recovery strategies.
Quick Reference
- →Assume every external call (LLM, tool, API) will fail — wrap each in try/catch with structured error state
- →Implement circuit breakers: after N consecutive failures, skip the failing tool and use a fallback path
- →Set per-node timeouts and a global workflow timeout — an agent that hangs forever is worse than one that fails fast
- →Use fallback model chains: try Claude Sonnet → fall back to Haiku → fall back to cached response
- →Log all failures with full context (input, error, stack trace, state snapshot) for post-mortem debugging
Common Agent Failure Modes
| Failure | Symptom | Impact | Mitigation |
|---|---|---|---|
| LLM rate limit (429) | API returns 429 Too Many Requests | Agent hangs or crashes | Exponential backoff + model fallback chain |
| Tool timeout | External API doesn't respond | Agent stuck on one step | Per-tool timeout (5-30s) + skip with error state |
| Infinite loop | Agent keeps calling the same tool | Cost explosion, no response | recursion_limit + loop detection in state |
| Malformed tool output | Tool returns unexpected format | LLM hallucinates on bad input | Output validation + structured error messages |
| Context overflow | Conversation exceeds context window | Unpredictable behavior, truncation | Token budgets + trim_messages() |
| State corruption | Node writes invalid data to state | Downstream nodes crash | State validation at node boundaries |