Batching & Throughput
Maximize LLM serving throughput with continuous batching, dynamic request grouping, and request coalescing. Understand the prefill vs decode bottleneck and tune the throughput-latency tradeoff for your workload.
Quick Reference
- →Continuous batching: process new requests while others are still generating — 3-5x throughput vs static batching
- →Dynamic batching: collect requests over a short window (5-50ms) and process as a group
- →Prefill phase (prompt processing) is compute-bound; decode phase (token generation) is memory-bandwidth-bound
- →Request coalescing: merge semantically similar concurrent requests to serve one response to many users
- →Tradeoff: more batching = higher throughput but higher per-request latency — tune based on your SLA
Why Batching Matters for LLM Serving
Without batching, an LLM server processes one request at a time. The GPU is heavily underutilized because the decode phase (generating tokens one by one) uses only a fraction of the GPU's compute capacity — it is bottlenecked on memory bandwidth, not arithmetic. Batching allows the server to process multiple requests simultaneously, filling the compute gap and achieving 3-10x higher total throughput.
| Batching Strategy | Throughput Gain | Latency Impact | Implementation Complexity |
|---|---|---|---|
| No batching (sequential) | 1x (baseline) | Lowest per-request | None |
| Static batching | 2-4x | Wait for batch to fill — high variance | Low |
| Dynamic batching | 3-5x | Bounded wait window (5-50ms) | Medium |
| Continuous batching | 5-10x | No wait — requests enter immediately | High (vLLM/TGI handle this) |
| Continuous + coalescing | 10-20x | Minimal added latency | Very high |
vLLM and TGI both implement continuous batching out of the box. New requests are inserted into the running batch without waiting for all current requests to finish. A request that completes its generation is removed and a waiting request takes its slot. This eliminates the 'waiting for the batch' latency penalty of static batching.