Advanced16 min

Parallelization: Concurrent LLM Execution

Parallelization runs multiple LLMs concurrently to gain confidence (voting: same input, N opinions) or speed (sectioning: split input, parallel workers). This article starts with whether you should parallelize at all, walks through the actual cost math — 3× Haiku can be cheaper than 1× Sonnet with caching — and covers the two production failure modes most articles skip: superstep atomicity and correlated errors.

Quick Reference

→Parallelization earns its keep only when single-call accuracy is below requirements AND tasks decompose independently
→Voting (same input, N opinions) is for accuracy; sectioning (different chunks, same task) is for speed
→Parallel latency = max(worker latencies), not sum — voting adds almost no latency over a single call
→3× Haiku 4.5 ≈ 77% of 1× Sonnet 4.6 cost without caching; ~40% with Anthropic prompt caching
→LangGraph superstep atomicity: one unhandled exception retries ALL parallel workers — always wrap in try/except
→Set max_concurrency on app.invoke() to prevent parallel calls from exceeding your provider's RPM limit
→Correlated errors are the silent failure — 3 models wrong in the same direction gives false confidence; use diverse prompts, not just temperatures

Should I Parallelize at All?

Most articles open with how to build a parallelization pipeline. Start one step earlier: should you? Parallelization multiplies your API calls — voting 3× your input costs, sectioning N× your requests. That complexity pays off only under specific conditions.

Scenario	Verdict	Reason
Single-call accuracy already meets requirements	Don't parallelize	You're paying N× for a benefit you don't need
Tasks have dependencies (narrative text, code with cross-file refs)	Don't section	Workers summarizing fragment 3 don't know what fragment 2 said
Accuracy is the bottleneck, categories are discrete	Voting	Redundancy helps when the single-call answer is a close call
Speed is the bottleneck, input is large and splittable	Sectioning	N parallel workers finish in max(worker latencies), not N × latency
Tasks require planning + coordination across subtasks	Orchestrator-Worker	Parallelization assumes no inter-worker coordination — O-W provides it
Categories overlap / inputs are frequently ambiguous	Voting may not help	Low agreement rate means the task is too fuzzy — fix the prompt first

The honest trade: before running 3× Haiku voting, ask whether 1× Sonnet 4.6 or Opus 4.7 already answers correctly. A single more-capable model often beats an ensemble of weaker ones at lower total cost. Measure first.

Pattern	LLM calls	Latency	Best for
Single call (Sonnet/Opus)	1	1×	Most cases — try this first
Voting (N× Haiku)	N	~1× (max of N)	Accuracy bottleneck with discrete categories
Sectioning	N + 1 (synthesis)	~1× (max of N)	Speed bottleneck with independent document sections
Orchestrator-Worker	1 + N + 1	Higher (plan phase adds latency)	Complex tasks requiring coordination

Parallelization earns its keep only when a single call can't meet requirements and tasks decompose independently

Two Patterns, Two Goals

Same input → multiple parallel workers (different temps) → aggregate via voting

What Does Parallelization Cost?

Concrete cost math for a support ticket classifier (stated assumptions: 500 input tokens, 15 output tokens, 300-token system prompt that is cacheable):

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.