Context Management & Reliability/Reliability & Provenance
Advanced12 min

Error Propagation in Multi-Agent Systems

How errors cascade through multi-agent systems, why generic error statuses hide valuable context, and how to design structured error propagation that enables intelligent recovery by the coordinator.

Quick Reference

  • Structured error context (failure type, attempted query, partial results, alternatives) enables intelligent coordinator recovery
  • Distinguish access failures (timeouts, auth errors) from valid empty results (no matches found) -- they require different handling
  • Generic error messages like 'search unavailable' hide whether it was a timeout, auth failure, rate limit, or malformed query
  • Neither silent suppression (pretend it didn't happen) nor total workflow termination (abort everything) is appropriate
  • Subagents should retry transient errors locally (timeouts, rate limits) and propagate unresolvable errors (auth failures, missing data)
  • The coordinator must synthesize results with coverage annotations: which claims are well-supported and where gaps exist
  • Error budgets define how many subagent failures a workflow can tolerate before the overall result is unreliable
  • Partial results are often more valuable than no results -- design your system to degrade gracefully
  • Log the full error chain (subagent -> coordinator -> final output) for post-mortem debugging
  • Circuit breakers prevent cascading failures: if a service fails 3 times in 5 minutes, stop calling it

Why Errors Cascade in Multi-Agent Systems

In a single-agent system, an error is a local event: the tool call failed, you retry or tell the user. In a multi-agent system, errors are distributed events that propagate through the agent graph. A research subagent's timeout becomes the coordinator's missing data, which becomes a gap in the final report, which becomes a user receiving incomplete information presented as complete. The architect must design error handling at every boundary.

Subagent ErrorAPI timeout / bad dataLocal Retrytransient errors onlyexponential backoffResolved?YESContinueNOStructured Errortype + context + partialresults to coordinatorCoordinatordecides recovery strategyRetry Modifieddifferent query / paramsTry Alternativedifferent agent / toolProceed Partialuse what we have

Errors bubble up with context; the coordinator picks the best recovery strategy

Exam context

Scenario 3 (Multi-Agent Research) is the primary testing ground for error propagation. Expect questions where one of three research subagents fails and you must determine the correct coordinator behavior. The answer is never 'ignore the failure' and never 'abort the entire workflow.'

The exam tests three levels of error handling: (1) subagent-level -- what the failing agent does locally, (2) coordinator-level -- how the orchestrator handles partial results, and (3) output-level -- how the final response communicates completeness and confidence to the user.