Production & Scale/Inference Optimization
Advanced18 min

Batching & Throughput

Configure continuous batching, speculative decoding, and disaggregated serving to maximize LLM throughput on vLLM and SGLang. Understand prefill/decode interference, tune per-workload profiles, recognize the five failure modes before they hit production, and build the monitoring layer that tells you when your configuration has drifted.

Quick Reference

  • Continuous batching processes new requests at each iteration — no waiting for the slowest request to finish
  • Prefill (processing input) is compute-bound; decode (generating output) is memory-bandwidth-bound — they need different optimizations
  • Chunked prefill prevents a single large prompt from stalling decode for all other requests in the batch
  • Speculative decoding generates N tokens per step instead of 1 — 2-3x speedup on code-heavy workloads
  • Disaggregated serving routes prefill and decode to separate GPU pools — worth the complexity only above ~100 sustained req/s
  • KV cache utilization above 85% is the leading indicator of batch failures — monitor before tuning anything else
  • If you use a hosted API (Anthropic, OpenAI, Google), the provider handles batching — this article is for self-hosters

Should You Optimize Batching?

Before reaching for tuning parameters, check whether batching optimization is even your problem. If you call a hosted API, the provider's infrastructure handles batching — your only lever is request parallelism from your application code, and the self-hosted tuning in this article does not apply. If you run vLLM or SGLang, the defaults are already good for most workloads. Tuning pays off when you have specific throughput targets, cost-per-token pressure, or latency SLAs that the defaults don't meet.

SituationDo You Need to Tune?What to Do Instead
Using Anthropic, OpenAI, or Google APIsNoTune request parallelism in your app; use the provider's Batch API for async workloads
Self-hosting with default vLLM or SGLang configProbably not yetBaseline first (see the First-30-Days Runbook); tune only if TTFT or cost targets aren't met
High throughput workload (>50 req/s sustained)YesTune --max-num-seqs and --max-num-batched-tokens; consider speculative decoding
Mixed interactive + batch trafficYesImplement priority queuing; tune separate profiles for each traffic class
Cost-per-token pressure at scaleYesSpeculative decoding + prefix caching are the highest-ROI levers; quantization is in the sibling article
Prerequisite: pick your framework first

This article assumes you have already chosen between vLLM and SGLang. If you haven't, read the Self-Hosting article first — it covers framework selection, PagedAttention vs RadixAttention, and basic deployment. This article covers what comes after: tuning the configuration you've already chosen.