Parallelization: Concurrent LLM Execution
Parallelization runs multiple LLMs concurrently to gain confidence (voting: same input, N opinions) or speed (sectioning: split input, parallel workers). This article starts with whether you should parallelize at all, walks through the actual cost math — 3× Haiku can be cheaper than 1× Sonnet with caching — and covers the two production failure modes most articles skip: superstep atomicity and correlated errors.
Quick Reference
- →Parallelization earns its keep only when single-call accuracy is below requirements AND tasks decompose independently
- →Voting (same input, N opinions) is for accuracy; sectioning (different chunks, same task) is for speed
- →Parallel latency = max(worker latencies), not sum — voting adds almost no latency over a single call
- →3× Haiku 4.5 ≈ 77% of 1× Sonnet 4.6 cost without caching; ~40% with Anthropic prompt caching
- →LangGraph superstep atomicity: one unhandled exception retries ALL parallel workers — always wrap in try/except
- →Set max_concurrency on app.invoke() to prevent parallel calls from exceeding your provider's RPM limit
- →Correlated errors are the silent failure — 3 models wrong in the same direction gives false confidence; use diverse prompts, not just temperatures
Should I Parallelize at All?
Most articles open with how to build a parallelization pipeline. Start one step earlier: should you? Parallelization multiplies your API calls — voting 3× your input costs, sectioning N× your requests. That complexity pays off only under specific conditions.
| Scenario | Verdict | Reason |
|---|---|---|
| Single-call accuracy already meets requirements | Don't parallelize | You're paying N× for a benefit you don't need |
| Tasks have dependencies (narrative text, code with cross-file refs) | Don't section | Workers summarizing fragment 3 don't know what fragment 2 said |
| Accuracy is the bottleneck, categories are discrete | Voting | Redundancy helps when the single-call answer is a close call |
| Speed is the bottleneck, input is large and splittable | Sectioning | N parallel workers finish in max(worker latencies), not N × latency |
| Tasks require planning + coordination across subtasks | Orchestrator-Worker | Parallelization assumes no inter-worker coordination — O-W provides it |
| Categories overlap / inputs are frequently ambiguous | Voting may not help | Low agreement rate means the task is too fuzzy — fix the prompt first |
The honest trade: before running 3× Haiku voting, ask whether 1× Sonnet 4.6 or Opus 4.7 already answers correctly. A single more-capable model often beats an ensemble of weaker ones at lower total cost. Measure first.
| Pattern | LLM calls | Latency | Best for |
|---|---|---|---|
| Single call (Sonnet/Opus) | 1 | 1× | Most cases — try this first |
| Voting (N× Haiku) | N | ~1× (max of N) | Accuracy bottleneck with discrete categories |
| Sectioning | N + 1 (synthesis) | ~1× (max of N) | Speed bottleneck with independent document sections |
| Orchestrator-Worker | 1 + N + 1 | Higher (plan phase adds latency) | Complex tasks requiring coordination |
Parallelization earns its keep only when a single call can't meet requirements and tasks decompose independently