Agent Architecture/System Design
Advanced11 min

Designing for Failure

Building agents that degrade gracefully: circuit breakers, fallback chains, timeout handling, and recovery strategies.

Quick Reference

  • Assume every external call (LLM, tool, API) will fail — wrap each in try/catch with structured error state
  • Implement circuit breakers: after N consecutive failures, skip the failing tool and use a fallback path
  • Set per-node timeouts and a global workflow timeout — an agent that hangs forever is worse than one that fails fast
  • Use fallback model chains: try Claude Sonnet → fall back to Haiku → fall back to cached response
  • Log all failures with full context (input, error, stack trace, state snapshot) for post-mortem debugging

Common Agent Failure Modes

FailureSymptomImpactMitigation
LLM rate limit (429)API returns 429 Too Many RequestsAgent hangs or crashesExponential backoff + model fallback chain
Tool timeoutExternal API doesn't respondAgent stuck on one stepPer-tool timeout (5-30s) + skip with error state
Infinite loopAgent keeps calling the same toolCost explosion, no responserecursion_limit + loop detection in state
Malformed tool outputTool returns unexpected formatLLM hallucinates on bad inputOutput validation + structured error messages
Context overflowConversation exceeds context windowUnpredictable behavior, truncationToken budgets + trim_messages()
State corruptionNode writes invalid data to stateDownstream nodes crashState validation at node boundaries