Advanced18 min
Graceful Degradation
How to design AI agents that always return something useful — even when LLM APIs fail, rate limits hit, or traffic spikes. Covers fallback chains, circuit breakers, semantic degradation detection, and progressive load shedding.
Quick Reference
- →Three failure types: binary (5xx), partial (slow 200), semantic (fast 200, wrong content) — most teams only handle the first
- →Fallback chain: primary model → cheaper model → cached response → static message — always return something
- →Circuit breakers: after 5 consecutive failures, route to fallback instantly; cooldown 120s for LLM rate limits, 30s for vector DBs
- →Semantic degradation: the API returns 200 OK but content quality silently drops — your monitors say green, users get wrong answers
- →Canary queries: send a known Q+A through your agent every 5 minutes; alert when similarity score drops below threshold
- →Load shedding: derive thresholds from measured peak RPM, not invented numbers — start shedding RAG at 2× normal peak
- →Disable SDK auto-retry (max_retries=0) when you own the fallback logic — otherwise both the SDK and your code retry
When Graceful Degradation Is Worth the Complexity
Fallback chains and circuit breakers add real complexity. Before building them, decide whether your use case actually needs them.
| Use case | Build degradation? | Why |
|---|---|---|
| User-facing chat or assistant | Yes | Users are waiting; an error page ends the session |
| Real-time product search or recommendations | Yes | Revenue impact is immediate when answers disappear |
| Internal batch pipeline | No | Failures are visible to engineers; retry at job level |
| Offline data processing or enrichment | No | Correctness matters more than availability; fail loudly |
| Developer tooling used by your own team | Maybe | Depends on how disruptive a downtime is to their workflow |
| Prototype or MVP under active development | No | Complexity slows iteration; add it when you have traffic |
Start with a cache and one fallback model
You don't need a full circuit breaker framework on day one. A response cache and a fallback to Haiku on any APIStatusError 5xx gets you 80% of the reliability for 10% of the code. Add circuit breakers when you have enough production data to tune the thresholds.