Designing for Failure
How to architect agents that don't crash: classify errors before handling them, design timeout budgets at every level, validate state at node boundaries, and verify failure paths with fault injection.
Quick Reference
- →Classify errors before handling: transient (retry), rate-limited (honor Retry-After), LLM-recoverable (re-prompt), user-fixable (interrupt), fatal (bubble up)
- →Retrying non-idempotent tools creates duplicates — always generate an idempotency key before calling write operations
- →Design a four-level timeout budget: tool (5–30s) → node (30–60s) → workflow (60–120s) → user-facing SLA
- →Validate state at every node boundary with Pydantic — state corruption caught at entry is a one-line fix; caught 3 nodes downstream it's a 2-hour debugging session
- →The Anthropic SDK retries automatically (max_retries=2 by default) — adding tenacity on top without disabling SDK retry gives you 6 attempts instead of 3
- →Fault injection in staging finds failures that happy-path unit tests never reach
- →For runtime resilience — fallback chains, circuit breakers, feature degradation under load — see the Graceful Degradation article
When Failure Design Matters
A single LLM call with no external tools needs a try/catch and not much else. Add one external API and you need timeouts and retries. Add user-facing latency requirements and you need a timeout budget. Add multi-step workflows and you need state validation. Add write operations and you need idempotency. The following table maps agent complexity to the patterns that are actually required — not as a checklist to complete, but as a guide to where bugs hide:
| Agent Type | What Can Go Wrong | Minimum Required |
|---|---|---|
| Single LLM call | Model error, rate limit | try/catch + retry |
| LLM + read-only tools | Tool timeout, API error | Timeout + error taxonomy |
| LLM + write tools | Duplicate writes on retry | Idempotency key + timeout |
| Multi-step workflow | State corruption cascade | + State validation at boundaries |
| User-facing, multi-step | Latency SLA violations | + Timeout budget hierarchy |
This article covers design-time decisions: how to build an agent that doesn't crash when things go wrong. For runtime behavior — fallback chains, circuit breakers, feature degradation under load — see the Graceful Degradation article. The two are complementary, not overlapping.