Advanced17 min

Designing for Failure

How to architect agents that don't crash: classify errors before handling them, design timeout budgets at every level, validate state at node boundaries, and verify failure paths with fault injection.

Quick Reference

→Classify errors before handling: transient (retry), rate-limited (honor Retry-After), LLM-recoverable (re-prompt), user-fixable (interrupt), fatal (bubble up)
→Retrying non-idempotent tools creates duplicates — always generate an idempotency key before calling write operations
→Design a four-level timeout budget: tool (5–30s) → node (30–60s) → workflow (60–120s) → user-facing SLA
→Validate state at every node boundary with Pydantic — state corruption caught at entry is a one-line fix; caught 3 nodes downstream it's a 2-hour debugging session
→The Anthropic SDK retries automatically (max_retries=2 by default) — adding tenacity on top without disabling SDK retry gives you 6 attempts instead of 3
→Fault injection in staging finds failures that happy-path unit tests never reach
→For runtime resilience — fallback chains, circuit breakers, feature degradation under load — see the Graceful Degradation article

When Failure Design Matters

A single LLM call with no external tools needs a try/catch and not much else. Add one external API and you need timeouts and retries. Add user-facing latency requirements and you need a timeout budget. Add multi-step workflows and you need state validation. Add write operations and you need idempotency. The following table maps agent complexity to the patterns that are actually required — not as a checklist to complete, but as a guide to where bugs hide:

Agent Type	What Can Go Wrong	Minimum Required
Single LLM call	Model error, rate limit	try/catch + retry
LLM + read-only tools	Tool timeout, API error	Timeout + error taxonomy
LLM + write tools	Duplicate writes on retry	Idempotency key + timeout
Multi-step workflow	State corruption cascade	+ State validation at boundaries
User-facing, multi-step	Latency SLA violations	+ Timeout budget hierarchy

Boundary with Graceful Degradation

This article covers design-time decisions: how to build an agent that doesn't crash when things go wrong. For runtime behavior — fallback chains, circuit breakers, feature degradation under load — see the Graceful Degradation article. The two are complementary, not overlapping.

Classify Failures Before Handling Them

The most common mistake in agent error handling is treating all exceptions the same: catch everything, log it, retry it. This breaks in three ways: you retry fatal errors (code bugs, schema mismatches), you fail to retry transient ones (network hiccup, brief service overload), and you silently eat user-fixable issues (auth failures) that needed human attention. Every exception in an agent falls into one of five categories, each with a different handler:

Retry Without Making Things Worse

Retrying is not free. Three failure modes hide inside every retry loop. First: retrying a non-retryable error (a code bug, a schema error) wastes token budget and obscures the root cause. Second: retrying without jitter creates thundering herds — all agents recover from the same rate limit at the same second, immediately triggering it again. Third, and most dangerous: retrying a non-idempotent tool creates duplicate records. A `search()` is safe to retry. A `create_invoice(amount=500)` is not — a timeout doesn't tell you whether the first attempt succeeded before it timed out.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.