Batch Processing
batch() parallelizes LLM calls client-side — all requests fire concurrently, results return together. When you need 50%+ cost savings and can tolerate ~1h latency, use your provider's async batch API instead. This article shows you which to pick, how to handle partial failures, and what a production pipeline looks like.
Quick Reference
- →model.batch([...inputs]) fires all requests concurrently and blocks until all complete
- →batch_as_completed() yields (index, result) as each request finishes — results arrive out of order
- →abatch() / abatch_as_completed() are the async variants — use these in FastAPI / async apps
- →return_exceptions=True collects failures as exceptions instead of raising on the first error
- →max_concurrency in config caps parallel calls — tune to stay within provider rate limits
- →Provider batch APIs (Anthropic, OpenAI) offer 50% off and complete most batches in ~1h
- →InMemoryRateLimiter on the model gives token-level control independent of concurrency cap
When to Use Batch Processing
You have N independent inputs — classifying 500 documents, translating 200 strings, summarizing 100 articles. The question isn't whether to parallelize; it's which parallelism to use. Three options exist, each with a different cost/latency tradeoff.
| Option | Latency | Cost | When to use |
|---|---|---|---|
| batch() / abatch() | Seconds | 1× standard rate | Need results now, any batch size up to ~1000 |
| Provider batch API | ~1h typical, 24h max | 0.5× (50% off) | Latency-tolerant offline jobs, large volumes |
| asyncio.gather() (raw) | Seconds | 1× standard rate | Already in async code, don't want LangChain overhead |
Start with latency need → then check sync vs async context
batch() is for independent inputs only. If input B depends on the result of input A, you need sequential calls or a chain. Also avoid batch() for streaming UX — a user waiting for a response should use stream(), not batch().