Batching & Throughput

Configure continuous batching, speculative decoding, and disaggregated serving to maximize LLM throughput on vLLM and SGLang. Understand prefill/decode interference, tune per-workload profiles, recognize the five failure modes before they hit production, and build the monitoring layer that tells you when your configuration has drifted.

Quick Reference

→Continuous batching processes new requests at each iteration — no waiting for the slowest request to finish
→Prefill (processing input) is compute-bound; decode (generating output) is memory-bandwidth-bound — they need different optimizations
→Chunked prefill prevents a single large prompt from stalling decode for all other requests in the batch
→Speculative decoding generates N tokens per step instead of 1 — 2-3x speedup on code-heavy workloads
→Disaggregated serving routes prefill and decode to separate GPU pools — worth the complexity only above ~100 sustained req/s
→KV cache utilization above 85% is the leading indicator of batch failures — monitor before tuning anything else
→If you use a hosted API (Anthropic, OpenAI, Google), the provider handles batching — this article is for self-hosters

Should You Optimize Batching?

Before reaching for tuning parameters, check whether batching optimization is even your problem. If you call a hosted API, the provider's infrastructure handles batching — your only lever is request parallelism from your application code, and the self-hosted tuning in this article does not apply. If you run vLLM or SGLang, the defaults are already good for most workloads. Tuning pays off when you have specific throughput targets, cost-per-token pressure, or latency SLAs that the defaults don't meet.

Situation	Do You Need to Tune?	What to Do Instead
Using Anthropic, OpenAI, or Google APIs	No	Tune request parallelism in your app; use the provider's Batch API for async workloads
Self-hosting with default vLLM or SGLang config	Probably not yet	Baseline first (see the First-30-Days Runbook); tune only if TTFT or cost targets aren't met
High throughput workload (>50 req/s sustained)	Yes	Tune --max-num-seqs and --max-num-batched-tokens; consider speculative decoding
Mixed interactive + batch traffic	Yes	Implement priority queuing; tune separate profiles for each traffic class
Cost-per-token pressure at scale	Yes	Speculative decoding + prefix caching are the highest-ROI levers; quantization is in the sibling article

Prerequisite: pick your framework first

This article assumes you have already chosen between vLLM and SGLang. If you haven't, read the Self-Hosting article first — it covers framework selection, PagedAttention vs RadixAttention, and basic deployment. This article covers what comes after: tuning the configuration you've already chosen.

Continuous Batching: How It Works

Traditional static batching waits for a fixed number of requests before processing them as a group. Every request in the batch must finish before the next batch starts — one slow request with a long output holds everyone else. Continuous batching, introduced by Orca (2022) and now standard in vLLM and SGLang, schedules at the iteration level. After each token generation step, finished sequences are evicted and waiting requests are inserted into the running batch immediately. No request waits for another to finish — the batch composition changes at every step.

Prefill vs Decode Interference

Prefill and decode compete for the same GPU

Prefill (processing the input prompt) is compute-bound — it processes all input tokens in parallel using the GPU's arithmetic units at high utilization. Decode (generating output tokens) is memory-bandwidth-bound — it reads the entire model's weights for each token while arithmetic units sit partially idle. In a continuous batch, prefilling a new request and decoding existing ones happen simultaneously — and the prefill's compute demand can starve ongoing decodes, spiking inter-token latency (ITL) for all other requests.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.