Batching & Throughput
Configure continuous batching, speculative decoding, and disaggregated serving to maximize LLM throughput on vLLM and SGLang. Understand prefill/decode interference, tune per-workload profiles, recognize the five failure modes before they hit production, and build the monitoring layer that tells you when your configuration has drifted.
Quick Reference
- →Continuous batching processes new requests at each iteration — no waiting for the slowest request to finish
- →Prefill (processing input) is compute-bound; decode (generating output) is memory-bandwidth-bound — they need different optimizations
- →Chunked prefill prevents a single large prompt from stalling decode for all other requests in the batch
- →Speculative decoding generates N tokens per step instead of 1 — 2-3x speedup on code-heavy workloads
- →Disaggregated serving routes prefill and decode to separate GPU pools — worth the complexity only above ~100 sustained req/s
- →KV cache utilization above 85% is the leading indicator of batch failures — monitor before tuning anything else
- →If you use a hosted API (Anthropic, OpenAI, Google), the provider handles batching — this article is for self-hosters
Should You Optimize Batching?
Before reaching for tuning parameters, check whether batching optimization is even your problem. If you call a hosted API, the provider's infrastructure handles batching — your only lever is request parallelism from your application code, and the self-hosted tuning in this article does not apply. If you run vLLM or SGLang, the defaults are already good for most workloads. Tuning pays off when you have specific throughput targets, cost-per-token pressure, or latency SLAs that the defaults don't meet.
| Situation | Do You Need to Tune? | What to Do Instead |
|---|---|---|
| Using Anthropic, OpenAI, or Google APIs | No | Tune request parallelism in your app; use the provider's Batch API for async workloads |
| Self-hosting with default vLLM or SGLang config | Probably not yet | Baseline first (see the First-30-Days Runbook); tune only if TTFT or cost targets aren't met |
| High throughput workload (>50 req/s sustained) | Yes | Tune --max-num-seqs and --max-num-batched-tokens; consider speculative decoding |
| Mixed interactive + batch traffic | Yes | Implement priority queuing; tune separate profiles for each traffic class |
| Cost-per-token pressure at scale | Yes | Speculative decoding + prefix caching are the highest-ROI levers; quantization is in the sibling article |
This article assumes you have already chosen between vLLM and SGLang. If you haven't, read the Self-Hosting article first — it covers framework selection, PagedAttention vs RadixAttention, and basic deployment. This article covers what comes after: tuning the configuration you've already chosen.