Intermediate13 min

The Inference Pipeline

Two fundamentally different phases explain every latency, cost, and quality trade-off you face when calling an LLM API. Understanding them tells you what to optimize — and stops you from optimizing the wrong thing.

Quick Reference

→Prefill: all input tokens processed in parallel — fast, GPU-efficient. Decode: one token per pass — slow, sequential.
→KV cache stores attention keys/values from previous tokens so each decode step is O(n), not O(n²).
→Memory formula: 2 × layers × kv_heads × head_dim × seq_len × bytes_per_value (GQA reduces kv_heads significantly).
→Output tokens cost 5× more than input tokens (Sonnet 4.6: $3/MTok in, $15/MTok out) — minimize output length, not prompt length.
→Prompt caching drops repeated input cost by 90% — cache reads cost $0.30/MTok vs $3.00/MTok standard (Sonnet 4.6, April 2026).
→temperature=0 is greedy decoding — reproducible within a provider version, not guaranteed identical across model updates.
→top-k and top-p filter the sampling pool after temperature scaling; setting both is usually redundant.
→TTFT and throughput are different optimization targets — streaming solves the first, prompt caching solves the second.

In this article

1.Two Phases: Why Your First Token Is Fast and the Rest Are Slow
2.KV Cache: The Memory You're Paying For
3.Sampling: The Only Knobs You Actually Control
4.Cost and Latency: The Math That Matters
5.Behind the API: Batching, Speculation, and What Affects Your Latency
★Best Practices
✓Key Takeaways

Two Phases: Why Your First Token Is Fast and the Rest Are Slow

When you call an LLM API, two fundamentally different computations happen back to back. In the prefill phase, all your input tokens are processed in one parallel forward pass — GPUs are designed for exactly this kind of matrix multiplication, so a 1,000-token prompt completes prefill in roughly the same time as a 100-token prompt (not 10× slower). Then the first output token arrives, and everything changes.

Prefill is fast (parallel). Decode is slow (sequential). TTFT marks the switch.

In the decode phase, the model generates one token at a time. Each new token attends to all previous tokens — input and output — so generation is inherently sequential: token N cannot be computed until token N-1 is done. A 200-token response requires 200 sequential decode steps. This is why time-to-first-token (TTFT) and generation throughput are measured separately: they're different operations with different bottlenecks, and optimizing one does nothing for the other.

Typical latency on Claude Sonnet 4.6

A 1,000-token prompt typically completes prefill in 200–400ms. Subsequent tokens arrive at roughly 40–80ms each (15–25 tokens/second). A 200-token response therefore takes 8–16 seconds total — most of which is decode, not prefill. This is why streaming is essential: without it, a user stares at a blank screen for 10+ seconds.

The design implication

Input is cheap in both cost and latency. Output is expensive in both. This inverts most engineers' instincts: prefer long, detailed prompts over vague prompts that require long explanations back. Use stop sequences to terminate generation the moment you have what you need. Design for structured, brief outputs rather than prose.

KV Cache: The Memory You're Paying For

During the decode phase, each new token attends to all previous tokens through self-attention. Without caching, generating token 100 would require recomputing attention across tokens 1–99 — that's O(n²) work per step. The KV cache eliminates this: after prefill, the model stores the Key and Value matrices computed for every input token. Each decode step reads those cached values and only computes Q, K, V for the single new token.

KV cache memory per token (simplified formula for full attention): memory = 2 × num_layers × num_kv_heads × head_dim × bytes_per_value Example with Llama 3 70B (80 layers, 8 GQA heads, head_dim=128, FP16): 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 0.32 MB per token At 128K context: 0.32 MB × 128,000 ≈ 40 GB — just for the cache, on top of ~140 GB for the model weights. The '2' is for K and V (not Q). The 8 KV heads vs 64 Q heads is GQA: by sharing KV heads across multiple query heads, the cache shrinks by 8× compared to a naive multi-head attention model.

KV cache is the real memory ceiling

For long-context models, the KV cache — not the weights — dictates how many concurrent requests fit in GPU memory. A single Llama 3 70B request at 128K context needs ~40 GB of KV cache. With 80 GB VRAM (H100 SXM5), that leaves room for two concurrent full-context requests. This is why providers charge more for long-context inputs and why KV cache quantization (INT8, INT4) is critical: halving cache precision roughly doubles the number of concurrent sessions.

▸Prompt caching (Anthropic, OpenAI): reuses KV cache from identical prompt prefixes across requests — the first call pays full prefill cost, all subsequent calls skip it. Anthropic charges $3.75/MTok to write a cache entry (5-min TTL) and $0.30/MTok to read it.
▸GQA (Grouped Query Attention): multiple query heads share one KV head — Llama 3 70B uses 8 KV heads vs 64 Q heads, cutting cache size 8× vs standard multi-head attention.
▸KV cache quantization: storing K and V in INT8 or INT4 halves or quarters memory with minimal quality impact — standard in production serving frameworks.
▸Chunked prefill (2024+): breaks long prompts into smaller chunks interleaved with decode steps for other requests, preventing one large prompt from blocking the entire batch.
▸Sliding window attention: only caches the last N tokens — bounds memory at the cost of no cross-attention beyond the window. Used in older Mistral models; less common now.

Sampling: The Only Knobs You Actually Control

After computing logits — one score per vocabulary token — the model must select the next token. Temperature, top-k, and top-p apply in sequence, each narrowing the candidate pool before the final random draw.

Sampling applies three filters in sequence — each stage narrows the candidate pool

Parameter	What it does	Use when	Avoid when
temperature	Divides logits before softmax. <1 sharpens (high-prob tokens dominate). >1 flattens (low-prob tokens get more weight). 0 = greedy: always picks highest-prob token.	0 for extraction, classification, structured output. 0.7–1.0 for writing, brainstorming.	>1.5 — outputs become incoherent. Don't combine with tight top-p (redundant).
top_p	After temperature, discard tokens until only the top P% of probability mass remains. p=0.9 cuts any token outside the top-90% cumulative.	0.9–0.95 as a backstop against garbage tokens. 1.0 to disable.	Combined with temperature=0 (sampling never runs; top_p is irrelevant).
top_k	Keep only the k highest-probability tokens. Applied after temperature, before top_p.	Rarely needed alongside top_p. Useful for very constrained outputs (k=5 for yes/no).	k < 5 on open-ended generation — causes repetition loops.
max_tokens	Hard cap on output length. Generation stops at this limit regardless of content.	Always. Set this on every request.	Leaving it unset — models can generate 8K+ tokens unexpectedly.

temperature=0 is not truly deterministic

Greedy decoding picks the highest-probability token at each step — which is reproducible within a single provider deployment. But floating-point arithmetic on different hardware can break probability ties differently. Batch configuration (how many requests run concurrently) can shift outputs. And when a provider updates a model, the underlying distribution changes. Never write snapshot tests that compare LLM outputs by exact string equality and expect them to pass across deploys.

Real project

A fintech team built nightly regression tests for their document extraction pipeline using temperature=0 output snapshots. When Anthropic updated Claude Sonnet in late 2024, the model's preferred formatting for monetary amounts changed subtly — 'USD 1,234.56' became '$1,234.56'. All their downstream regex extractors continued working, but 30% of snapshot tests broke. Two days of debugging later, they realized the model — not their code — had changed. The fix: test for semantic correctness (does the extracted value match the source?), not exact string output.

Streaming response with explicit sampling parameters (Anthropic SDK)

Cost and Latency: The Math That Matters

The most common mismatch between how engineers think about LLM costs and how billing actually works: output tokens cost 5× more than input tokens. Sending a longer, more detailed prompt is almost always cheaper than receiving a longer response. Engineers who optimize prompt brevity instead of output brevity are optimizing the wrong side.

Token type	Sonnet 4.6 price ($/MTok)	1,000 req × 500 tokens
Input (standard)	$3.00	$1.50
Output	$15.00	$7.50
Cache write (5-min TTL)	$3.75	$1.88 — first call only
Cache write (1-hour TTL)	$6.00	$3.00 — first call only
Cache read	$0.30	$0.15 — every subsequent call

Prompt caching ROI example: your system prompt is 2,000 tokens and you receive 1,000 requests/day with steady traffic. Standard cost for that system prompt: 2,000 × 1,000 × $3.00/MTok = $6.00/day. With 5-minute TTL caching, each cache write covers all requests in a 5-min window (~70 requests at 1,000/day). Write cost per window: 2,000 × $3.75/MTok = $0.0075. Read cost per request: 2,000 × $0.30/MTok = $0.0006. Break-even is 2 cache reads per write (the write is 1.25×, so 2 reads at 0.1× = 0.2×, total 1.45× vs 2.0× standard). At 70 reads per write, prompt caching saves ~$5.90/day (98%).

When to use 5-min vs 1-hour TTL

The 5-minute TTL (write cost: $3.75/MTok) is optimal for real-time applications with consistent traffic — you pay the write overhead once per 5-min window and read cheaply for all requests in that window. The 1-hour TTL (write cost: $6.00/MTok) is better for batch workloads with sparser traffic: fewer cache writes per hour offsets the higher write cost. If your traffic is too sparse to hit the cache more than once per 5 minutes, skip caching entirely — you'll pay the write premium without capturing reads.

TTFT vs throughput: choose your target before you optimize

Time-to-first-token (TTFT) determines perceived responsiveness in interactive apps — users notice anything above 500ms. Throughput (tokens/second or requests/second) determines batch job cost and speed. These require opposite approaches: for TTFT, stream responses, use smaller models, and minimize input token count. For throughput, maximize prompt caching, batch requests, and accept higher per-request latency in exchange for lower cost. Configuring a system optimized for one and measuring the other will always look like a failure.

Behind the API: Batching, Speculation, and What Affects Your Latency

API providers run your requests alongside thousands of others on shared GPU clusters. You don't control this — but understanding it explains the latency patterns you'll observe, especially the gap between median and 99th-percentile response times.

▸Continuous batching (Orca paper, 2022): instead of waiting for all batch requests to finish before starting new ones, requests join and leave the batch mid-flight. This maximizes GPU utilization but means a short request can queue behind a long prefill from another request.
▸Chunked prefill (2024+): long prompts are segmented and processed alongside decode steps from other requests — prevents a single 128K-token prefill from stalling everyone else in the batch.
▸Memory-bandwidth bound at small batch sizes: for small batches, inference speed is limited by how fast model weights are loaded from GPU memory, not by computation. This is why your request latency doesn't scale linearly with model size.
▸Speculative decoding: a small draft model proposes several tokens; the large model verifies them all in one forward pass. For predictable text, acceptance rates are 70–80%, yielding 2–3× faster generation at identical quality. For unusual or highly specialized text, acceptance rates drop to 40–60%. Now standard in vLLM, SGLang, and TensorRT-LLM.
▸Prefill-decode disaggregation: some providers (and self-hosted setups) route prefill and decode to separate GPU pools — prefill to compute-optimized hardware, decode to memory-bandwidth-optimized hardware. Reduces prefill interference on decode latency.

Self-hosting: vLLM vs SGLang

vLLM (PagedAttention) remains the most widely deployed open-source serving framework. SGLang (RadixAttention) has emerged as a strong competitor: it achieves 29% higher throughput on H100s and up to 6× gains on prefix-heavy workloads like RAG or multi-turn chat where many requests share a long common prefix. For new self-hosting setups in 2026, benchmark both on your specific workload — the winner depends heavily on your traffic pattern.

Best Practices

✓Stream every user-facing response — never make users wait for full completion before showing output
✓Use temperature=0 for extraction, classification, structured output, and any task requiring near-reproducible results
✓Set max_tokens explicitly on every request — unset, some models will generate 8K+ tokens
✓Set stop sequences when your output format has a natural termination point (e.g., stop=["</answer>", "\n\n"])
✓Enable prompt caching for system prompts longer than 1,024 tokens that repeat across requests
✓Minimize output token count — design prompts to return short structured answers, not prose explanations
✓Prefer detailed 2,000-token prompts over vague 200-token prompts that produce 500-token explanations
✓Test LLM outputs semantically (does the answer match the source?), not by exact string equality
✓Measure TTFT and throughput separately — they require different optimizations and shouldn't share a single SLA

Don’t

✗Don't assume temperature=0 is guaranteed-identical output across model updates or different hardware
✗Don't set top_p=0.95 when temperature=0 — greedy decoding bypasses sampling; top_p has no effect
✗Don't ignore the input/output token cost asymmetry — output costs 5× more; optimize generation length first
✗Don't build latency SLAs on median TTFT — p99 latency spikes sharply when prefill queues fill under load
✗Don't use top_k < 5 for open-ended generation — the tiny candidate pool causes repetition loops
✗Don't apply prompt caching to prompts that change per-request — caching only reuses identical token prefixes
✗Don't stream responses in batch pipelines — streaming adds overhead when throughput is the goal
✗Don't set temperature > 1.5 — outputs become incoherent; for diversity, use 0.8–1.0 with multiple samples

Key Takeaways

✓LLM inference has two phases: prefill (parallel input processing, fast) and decode (sequential output generation, slow) — optimizing one doesn't help the other.
✓Output tokens cost 5× more than input tokens — a longer, richer prompt is almost always cheaper than a longer response.
✓The KV cache trades GPU memory for O(n) decode cost instead of O(n²) — long context is expensive because the cache is large, not because the math is hard.
✓Prompt caching saves up to 90% on repeated system prompts — at 1,000 requests/day with a 2K-token system prompt, that's ~$5.90/day in savings on Sonnet 4.6.
✓temperature=0 is greedy decoding, not guaranteed deterministic — don't snapshot-test LLM outputs across provider updates.
✓Speculative decoding and continuous batching are now default in production serving frameworks — understanding them explains latency variance, but you don't implement them when using an API.

Video on this topic

What happens when you call ChatGPT's API

tiktok

←

Tokenization Deep Dive

Context Windows & Context Management

→