The Inference Pipeline
What actually happens when you call an LLM API -- from prompt tokenization through logit computation to output sampling. Understand KV caching, sampling strategies (temperature, top-p, top-k), batching, and how these choices affect output quality and latency.
Quick Reference
- →Inference: prompt -> tokenize -> compute logits for each vocab token -> sample next token -> repeat
- →Prefill phase processes all input tokens in parallel; decode phase generates output tokens one at a time
- →KV cache stores attention computations from previous tokens, avoiding redundant recomputation
- →Temperature controls randomness: 0 = deterministic (greedy), 1 = default, >1 = more creative/chaotic
- →top-p (nucleus sampling) ranks tokens by probability, keeps only the top tokens that together account for p of total probability mass, then samples from that set — at p=0.9, rare tokens below the 90% cutoff are excluded
- →Speculative decoding uses a small model to draft tokens verified by the large model, speeding up inference 2-3x
In this article
From Prompt to Tokens to Logits
When you send a prompt to an LLM, the first thing that happens is tokenization -- your text is converted to a sequence of integer token IDs. These token IDs are mapped to embedding vectors (learned during training) and passed through every transformer layer. The final layer's output is projected to a vector of size |vocabulary| -- these are the logits, representing the model's raw (unnormalized) score for every possible next token. The token with the highest logit is the model's best guess for what comes next.
- ▸Step 1: Tokenize the prompt into token IDs (e.g., [9906, 11, 1268, ...])
- ▸Step 2: Look up embedding vectors for each token ID
- ▸Step 3: Pass embeddings through all transformer layers (prefill phase)
- ▸Step 4: Project the final hidden state to vocabulary-size logits
- ▸Step 5: Apply softmax to convert logits to probabilities
- ▸Step 6: Sample the next token according to the sampling strategy
- ▸Step 7: Repeat steps 2-6 for each output token (decode phase)
The prefill phase processes all input tokens in parallel -- this is fast because GPUs excel at parallel matrix multiplications. The decode phase generates one token at a time, each depending on all previous tokens. This is why time-to-first-token (TTFT) is much faster per token than the subsequent generation speed. A 1000-token prompt might prefill in 200ms, but generating 100 output tokens takes 2-3 seconds.
KV Cache: The Key to Fast Inference
During autoregressive generation, the model needs to compute attention over all previous tokens for each new token it generates. Without optimization, generating the 100th token would require recomputing attention for tokens 1-99 again. The KV cache solves this by storing the Key and Value matrices from all previous tokens, so only the new token's Q, K, V need to be computed. This turns an O(n^2) per-token operation into O(n).
| Model | KV cache per token | 128K context KV cache | Full 128K naive (no cache) |
|---|---|---|---|
| Llama 3 70B (FP16) | ~1.3 MB | ~160 GB | Computationally infeasible |
| Llama 3 8B (FP16) | ~0.25 MB | ~32 GB | Infeasible for real-time |
| Mistral 7B (GQA) | ~0.06 MB | ~8 GB | Still too slow without cache |
For long-context models, the KV cache -- not the model weights -- dominates GPU memory. A Llama 3 70B model with 128K context needs ~160 GB just for the KV cache, on top of ~140 GB for the model weights. This is why providers limit effective context lengths and why techniques like GQA (fewer KV heads) and KV cache quantization are critical.
- ▸Prompt caching (Anthropic, OpenAI): reuses KV cache from repeated system prompts, cutting prefill time and cost by ~90%
- ▸PagedAttention (vLLM): manages KV cache like virtual memory pages, eliminating fragmentation and enabling efficient batching
- ▸KV cache quantization: storing cached K,V in lower precision (INT8, INT4) to reduce memory 2-4x
- ▸Sliding window attention (Mistral): only caches the last N tokens, bounding memory at the cost of limited long-range attention
Sampling Strategies
After computing logits, the model needs to choose the next token. The sampling strategy determines how this choice is made, and it profoundly affects output quality. The three key parameters are temperature, top-p (nucleus sampling), and top-k.
| Parameter | Range | Effect | Recommended for |
|---|---|---|---|
| temperature | 0.0 - 2.0 | Scales logits before softmax. Lower = more deterministic | 0.0-0.3 for factual tasks, 0.7-1.0 for creative tasks |
| top_p | 0.0 - 1.0 | Only sample from tokens whose cumulative prob >= p | 0.9-0.95 for most tasks, 1.0 to disable |
| top_k | 1 - vocab_size | Only consider the top k most likely tokens | 20-50 typical, rarely needed with top_p |
| stop sequences | list of strings | Stop generation when any of these strings appear | Always set for structured output |
For deterministic, factual tasks (classification, extraction, code generation): use temperature=0. For creative tasks (writing, brainstorming): use temperature=0.7-1.0 with top_p=0.95. Never set both temperature and top_p to extreme values simultaneously. Always set stop sequences when you need the output to end at a specific point.
Batching and Throughput
LLM providers don't run one request at a time. They batch multiple requests together, processing them simultaneously on GPUs. This is where the economics of LLM serving get interesting: batching amortizes the cost of loading model weights from memory, which is the actual bottleneck (not computation) for most inference workloads.
- ▸Static batching: group requests of similar length, process together. Simple but wasteful -- short requests wait for long ones
- ▸Continuous batching (Orca): dynamically add new requests and retire finished ones without waiting. Used by vLLM, TGI, and most production systems
- ▸Memory-bound vs compute-bound: for small batches, inference is memory-bandwidth limited (loading weights). For large batches, it becomes compute-limited
- ▸Optimal batch size depends on model size, sequence length, and GPU memory. Typical: 8-64 concurrent requests
- ▸Prefill and decode can interfere: a long prefill can block decoding for other requests. Some systems separate prefill and decode into different GPU pools
A technique where a small 'draft' model generates several candidate tokens, then the large model verifies them all in a single forward pass. If the draft model guessed correctly (which it does ~70-80% of the time for predictable text), you get multiple tokens from one large-model pass. This can speed up inference 2-3x with no quality loss. Used by Anthropic, Google, and available in vLLM for open models.
What This Means for Your Application
Understanding the inference pipeline helps you make better architectural decisions. Here are the practical takeaways for building LLM-powered applications.
- ▸Streaming is essential for UX: output tokens arrive one at a time, so stream them to the user rather than waiting for completion
- ▸Input is cheap, output is expensive: prefill is parallel and fast; generation is sequential and slow. Minimize output tokens where possible
- ▸Long prompts with short outputs are the ideal workload -- summarization, classification, extraction
- ▸Long outputs with short prompts are the worst workload -- creative writing, long code generation
- ▸Prompt caching cuts cost by up to 90% for repeated system prompts -- use it when your system prompt is stable
For interactive applications, aim for <500ms time-to-first-token (TTFT). For batch processing, optimize for throughput (tokens/second) instead. These are fundamentally different optimization targets -- the same model configuration can't optimize for both.
Best Practices
Do
- ✓Stream responses in user-facing applications -- never make users wait for full completion
- ✓Use temperature=0 for deterministic tasks (classification, extraction, structured output)
- ✓Set explicit stop sequences to prevent runaway generation
- ✓Monitor token usage per request to catch unexpected cost spikes
- ✓Use prompt caching for stable system prompts that repeat across requests
Don’t
- ✗Don't set temperature > 0 for tasks requiring consistent, reproducible output
- ✗Don't ignore the prefill vs decode cost difference -- they have different latency profiles
- ✗Don't assume all providers have the same inference performance for the same model size
- ✗Don't use top_k and top_p simultaneously unless you understand their interaction
- ✗Don't send long prompts expecting immediate responses -- prefill time scales with input length
Key Takeaways
- ✓LLM inference has two phases: prefill (fast, parallel input processing) and decode (slow, sequential output generation).
- ✓The KV cache stores previous attention computations, making each new token generation O(n) instead of O(n^2).
- ✓Temperature, top-p, and top-k control how tokens are sampled from the probability distribution -- choose based on your task.
- ✓Continuous batching and speculative decoding are the key techniques providers use to maximize throughput.
- ✓Understanding the pipeline helps you optimize for the right metric: TTFT for interactive apps, throughput for batch jobs.
Video on this topic
What happens when you call ChatGPT's API
tiktok