LangChain/Models
Intermediate12 min

Batch Processing

batch() parallelizes LLM calls client-side — all requests fire concurrently, results return together. When you need 50%+ cost savings and can tolerate ~1h latency, use your provider's async batch API instead. This article shows you which to pick, how to handle partial failures, and what a production pipeline looks like.

Quick Reference

  • model.batch([...inputs]) fires all requests concurrently and blocks until all complete
  • batch_as_completed() yields (index, result) as each request finishes — results arrive out of order
  • abatch() / abatch_as_completed() are the async variants — use these in FastAPI / async apps
  • return_exceptions=True collects failures as exceptions instead of raising on the first error
  • max_concurrency in config caps parallel calls — tune to stay within provider rate limits
  • Provider batch APIs (Anthropic, OpenAI) offer 50% off and complete most batches in ~1h
  • InMemoryRateLimiter on the model gives token-level control independent of concurrency cap

When to Use Batch Processing

You have N independent inputs — classifying 500 documents, translating 200 strings, summarizing 100 articles. The question isn't whether to parallelize; it's which parallelism to use. Three options exist, each with a different cost/latency tradeoff.

OptionLatencyCostWhen to use
batch() / abatch()Seconds1× standard rateNeed results now, any batch size up to ~1000
Provider batch API~1h typical, 24h max0.5× (50% off)Latency-tolerant offline jobs, large volumes
asyncio.gather() (raw)Seconds1× standard rateAlready in async code, don't want LangChain overhead
N independent inputsNeed resultsin < 30 seconds?YesAsynccontext?NoProvider Batch API50% cost discount · completes in ~1hAnthropic / OpenAI async endpointYesabatch()abatch_as_completed()Nobatch()batch_as_completed()1× cost · results in seconds

Start with latency need → then check sync vs async context

When NOT to use batch()

batch() is for independent inputs only. If input B depends on the result of input A, you need sequential calls or a chain. Also avoid batch() for streaming UX — a user waiting for a response should use stream(), not batch().