Error Propagation in Multi-Agent Systems
How errors cascade through multi-agent systems, why generic error statuses hide valuable context, and how to design structured error propagation that enables intelligent recovery by the coordinator.
Quick Reference
- →Structured error context (failure type, attempted query, partial results, alternatives) enables intelligent coordinator recovery
- →Distinguish access failures (timeouts, auth errors) from valid empty results (no matches found) -- they require different handling
- →Generic error messages like 'search unavailable' hide whether it was a timeout, auth failure, rate limit, or malformed query
- →Neither silent suppression (pretend it didn't happen) nor total workflow termination (abort everything) is appropriate
- →Subagents should retry transient errors locally (timeouts, rate limits) and propagate unresolvable errors (auth failures, missing data)
- →The coordinator must synthesize results with coverage annotations: which claims are well-supported and where gaps exist
- →Error budgets define how many subagent failures a workflow can tolerate before the overall result is unreliable
- →Partial results are often more valuable than no results -- design your system to degrade gracefully
- →Log the full error chain (subagent -> coordinator -> final output) for post-mortem debugging
- →Circuit breakers prevent cascading failures: if a service fails 3 times in 5 minutes, stop calling it
Why Errors Cascade in Multi-Agent Systems
In a single-agent system, an error is a local event: the tool call failed, you retry or tell the user. In a multi-agent system, errors are distributed events that propagate through the agent graph. A research subagent's timeout becomes the coordinator's missing data, which becomes a gap in the final report, which becomes a user receiving incomplete information presented as complete. The architect must design error handling at every boundary.
Errors bubble up with context; the coordinator picks the best recovery strategy
Scenario 3 (Multi-Agent Research) is the primary testing ground for error propagation. Expect questions where one of three research subagents fails and you must determine the correct coordinator behavior. The answer is never 'ignore the failure' and never 'abort the entire workflow.'
The exam tests three levels of error handling: (1) subagent-level -- what the failing agent does locally, (2) coordinator-level -- how the orchestrator handles partial results, and (3) output-level -- how the final response communicates completeness and confidence to the user.