Agent Architecture/System Design
Advanced18 min

Graceful Degradation

How to design AI agents that always return something useful — even when LLM APIs fail, rate limits hit, or traffic spikes. Covers fallback chains, circuit breakers, semantic degradation detection, and progressive load shedding.

Quick Reference

  • Three failure types: binary (5xx), partial (slow 200), semantic (fast 200, wrong content) — most teams only handle the first
  • Fallback chain: primary model → cheaper model → cached response → static message — always return something
  • Circuit breakers: after 5 consecutive failures, route to fallback instantly; cooldown 120s for LLM rate limits, 30s for vector DBs
  • Semantic degradation: the API returns 200 OK but content quality silently drops — your monitors say green, users get wrong answers
  • Canary queries: send a known Q+A through your agent every 5 minutes; alert when similarity score drops below threshold
  • Load shedding: derive thresholds from measured peak RPM, not invented numbers — start shedding RAG at 2× normal peak
  • Disable SDK auto-retry (max_retries=0) when you own the fallback logic — otherwise both the SDK and your code retry

When Graceful Degradation Is Worth the Complexity

Fallback chains and circuit breakers add real complexity. Before building them, decide whether your use case actually needs them.

Use caseBuild degradation?Why
User-facing chat or assistantYesUsers are waiting; an error page ends the session
Real-time product search or recommendationsYesRevenue impact is immediate when answers disappear
Internal batch pipelineNoFailures are visible to engineers; retry at job level
Offline data processing or enrichmentNoCorrectness matters more than availability; fail loudly
Developer tooling used by your own teamMaybeDepends on how disruptive a downtime is to their workflow
Prototype or MVP under active developmentNoComplexity slows iteration; add it when you have traffic
Start with a cache and one fallback model

You don't need a full circuit breaker framework on day one. A response cache and a fallback to Haiku on any APIStatusError 5xx gets you 80% of the reliability for 10% of the code. Add circuit breakers when you have enough production data to tune the thresholds.