Agent Architecture/Workflow Patterns
Advanced16 min

Parallelization: Concurrent LLM Execution

Parallelization runs multiple LLMs concurrently to gain confidence (voting: same input, N opinions) or speed (sectioning: split input, parallel workers). This article starts with whether you should parallelize at all, walks through the actual cost math — 3× Haiku can be cheaper than 1× Sonnet with caching — and covers the two production failure modes most articles skip: superstep atomicity and correlated errors.

Quick Reference

  • Parallelization earns its keep only when single-call accuracy is below requirements AND tasks decompose independently
  • Voting (same input, N opinions) is for accuracy; sectioning (different chunks, same task) is for speed
  • Parallel latency = max(worker latencies), not sum — voting adds almost no latency over a single call
  • 3× Haiku 4.5 ≈ 77% of 1× Sonnet 4.6 cost without caching; ~40% with Anthropic prompt caching
  • LangGraph superstep atomicity: one unhandled exception retries ALL parallel workers — always wrap in try/except
  • Set max_concurrency on app.invoke() to prevent parallel calls from exceeding your provider's RPM limit
  • Correlated errors are the silent failure — 3 models wrong in the same direction gives false confidence; use diverse prompts, not just temperatures

Should I Parallelize at All?

Most articles open with how to build a parallelization pipeline. Start one step earlier: should you? Parallelization multiplies your API calls — voting 3× your input costs, sectioning N× your requests. That complexity pays off only under specific conditions.

ScenarioVerdictReason
Single-call accuracy already meets requirementsDon't parallelizeYou're paying N× for a benefit you don't need
Tasks have dependencies (narrative text, code with cross-file refs)Don't sectionWorkers summarizing fragment 3 don't know what fragment 2 said
Accuracy is the bottleneck, categories are discreteVotingRedundancy helps when the single-call answer is a close call
Speed is the bottleneck, input is large and splittableSectioningN parallel workers finish in max(worker latencies), not N × latency
Tasks require planning + coordination across subtasksOrchestrator-WorkerParallelization assumes no inter-worker coordination — O-W provides it
Categories overlap / inputs are frequently ambiguousVoting may not helpLow agreement rate means the task is too fuzzy — fix the prompt first

The honest trade: before running 3× Haiku voting, ask whether 1× Sonnet 4.6 or Opus 4.7 already answers correctly. A single more-capable model often beats an ensemble of weaker ones at lower total cost. Measure first.

PatternLLM callsLatencyBest for
Single call (Sonnet/Opus)1Most cases — try this first
Voting (N× Haiku)N~1× (max of N)Accuracy bottleneck with discrete categories
SectioningN + 1 (synthesis)~1× (max of N)Speed bottleneck with independent document sections
Orchestrator-Worker1 + N + 1Higher (plan phase adds latency)Complex tasks requiring coordination
Input TaskSingle call accurateenough?YesDon't parallelizeSave 3× costNoTasks decomposeindependently?NoOrchestrator-Workeror Prompt ChainingYesPrimarybottleneck?AccuracyVotingaccuracy/confidenceSpeedSectioningspeed/throughput

Parallelization earns its keep only when a single call can't meet requirements and tasks decompose independently