Agent Architecture/System Design
Advanced17 min

Designing for Failure

How to architect agents that don't crash: classify errors before handling them, design timeout budgets at every level, validate state at node boundaries, and verify failure paths with fault injection.

Quick Reference

  • Classify errors before handling: transient (retry), rate-limited (honor Retry-After), LLM-recoverable (re-prompt), user-fixable (interrupt), fatal (bubble up)
  • Retrying non-idempotent tools creates duplicates — always generate an idempotency key before calling write operations
  • Design a four-level timeout budget: tool (5–30s) → node (30–60s) → workflow (60–120s) → user-facing SLA
  • Validate state at every node boundary with Pydantic — state corruption caught at entry is a one-line fix; caught 3 nodes downstream it's a 2-hour debugging session
  • The Anthropic SDK retries automatically (max_retries=2 by default) — adding tenacity on top without disabling SDK retry gives you 6 attempts instead of 3
  • Fault injection in staging finds failures that happy-path unit tests never reach
  • For runtime resilience — fallback chains, circuit breakers, feature degradation under load — see the Graceful Degradation article

When Failure Design Matters

A single LLM call with no external tools needs a try/catch and not much else. Add one external API and you need timeouts and retries. Add user-facing latency requirements and you need a timeout budget. Add multi-step workflows and you need state validation. Add write operations and you need idempotency. The following table maps agent complexity to the patterns that are actually required — not as a checklist to complete, but as a guide to where bugs hide:

Agent TypeWhat Can Go WrongMinimum Required
Single LLM callModel error, rate limittry/catch + retry
LLM + read-only toolsTool timeout, API errorTimeout + error taxonomy
LLM + write toolsDuplicate writes on retryIdempotency key + timeout
Multi-step workflowState corruption cascade+ State validation at boundaries
User-facing, multi-stepLatency SLA violations+ Timeout budget hierarchy
Boundary with Graceful Degradation

This article covers design-time decisions: how to build an agent that doesn't crash when things go wrong. For runtime behavior — fallback chains, circuit breakers, feature degradation under load — see the Graceful Degradation article. The two are complementary, not overlapping.