LLM Foundations/How LLMs Work
Intermediate10 min

Context Windows & Attention

What context windows really mean, why the 'lost in the middle' problem plagues long-context models, how attention patterns change at different positions, and practical strategies for working within context limits.

Quick Reference

  • Context window = maximum number of tokens (input + output combined) the model can process
  • More context is not always better -- models struggle with information in the middle of long inputs
  • GPT-5.4: 128K tokens, Claude Sonnet 4.6: 200K tokens, Gemini 3.1 Pro: 1M tokens
  • Effective recall drops significantly for facts placed in the middle 30-60% of the context
  • Strategies: place critical info at start/end, use retrieval instead of stuffing, summarize intermediate context
  • Input tokens are 3-10x cheaper than output tokens across all major providers

What Context Window Actually Means

The context window is the total number of tokens a model can process in a single call -- including both input tokens (your prompt, system message, conversation history) and output tokens (the model's response). When people say GPT-5.4 has a '128K context window,' they mean 128,000 tokens total for input and output combined. This is not 128K words -- it is roughly 96K words for English prose, less for code or non-English text.

ModelContext windowApprox English wordsInput price / 1M tokensOutput price / 1M tokens
GPT-5.4128K~96K$2.00$8.00
o4-mini128K~96K$1.10$4.40
Claude Sonnet 4.6200K~150K$3.00$15.00
Claude Haiku 4.5200K~150K$1.00$5.00
Gemini 3.1 Pro1M~750K$1.00$10.00
Gemini 3 Flash1M~750K$0.075$0.30
Llama 4 (70B)128K~96KSelf-hostedSelf-hosted
Context window != useful context

Just because a model accepts 200K tokens does not mean it uses all 200K equally well. Research consistently shows that retrieval accuracy degrades with context length, especially for information placed in the middle. A focused 4K-token prompt with the right information often outperforms a 100K-token prompt with everything included.

The 'Lost in the Middle' Problem

In 2023, researchers discovered that LLMs have a U-shaped attention pattern: they reliably attend to information at the beginning and end of the context, but performance degrades significantly for information placed in the middle. This 'lost in the middle' phenomenon has been confirmed across multiple models and tasks. The implication is stark: you cannot just dump a large document into the context and expect the model to find the relevant needle.

  • Experiments with 'needle in a haystack' tests show 10-30% accuracy drops for needles placed in the middle third
  • The effect is stronger for longer contexts -- a 4K context might show mild degradation, but 100K shows dramatic drops
  • Recency bias: models tend to weight recent (end of context) information more heavily
  • Primacy effect: the very beginning of the context also gets strong attention (system prompt, first instructions)
  • Different models handle this differently -- Gemini 3.1 Pro with its ring attention shows less middle degradation than many competitors
Position your critical information strategically

Place the most important context at the beginning (right after the system prompt) or at the very end (just before the user question). If you are injecting retrieved documents, put the most relevant ones first and last. This simple reordering can improve accuracy by 10-20% on information retrieval tasks.

How Attention Actually Distributes

Attention is not uniformly distributed. Different heads in different layers attend to different parts of the input, and the patterns vary by task. Understanding these patterns helps explain model behavior and informs prompt design.

  • Attention sinks: some attention heads strongly attend to the first token regardless of content, using it as a 'default' attention target
  • Local attention: many heads focus on nearby tokens (within 50-200 positions), capturing local syntax and grammar
  • Retrieval heads: specific attention heads specialize in long-range retrieval, attending to semantically relevant distant tokens
  • Instruction heads: certain heads attend strongly to instruction tokens in the system prompt, connecting them to the task
  • The sparse attention pattern means most token pairs have near-zero attention -- the model effectively ignores most of the context
Attention sink phenomenon

StreamingLLM research showed that the first few tokens receive disproportionate attention across almost all heads and layers, regardless of their content. This is why some implementations keep the first few tokens in a 'sink' buffer even when using sliding window attention. It also suggests that the very first tokens of your prompt carry outsized influence on model behavior.

Strategies for Working With Limited Context

Even with 200K-token context windows, you will frequently need strategies to work within context limits. More importantly, you often should use less context than you can, because quality degrades with length.

StrategyWhen to useTrade-off
Retrieval (RAG)Large knowledge bases, dynamic dataAdds latency for retrieval step, requires embedding pipeline
SummarizationLong conversation histories, documentsLoses detail, requires extra LLM call
Chunking + map-reduceProcessing very long documentsMultiple API calls, higher cost
Sliding windowChat applications with long historiesLoses early context, may miss important details
Hierarchical contextComplex multi-step tasksMore complex prompt engineering
Practical context management for a chat application
The context budget approach

Treat your context window like a budget. Allocate fixed portions: 10-15% for system prompt, 20-30% for retrieved context, 30-40% for conversation history, and 20-30% reserved for output. Track actual usage and alert when any category exceeds its budget. This prevents the common failure mode of context silently overflowing.

How Much Context Is Actually Useful?

The marketing numbers (200K, 1M tokens) are impressive, but in practice the relationship between context size and output quality is not linear. Here is what the research and production experience shows.

  • For most tasks, the sweet spot is 2K-8K tokens of focused, relevant context -- beyond this, diminishing returns set in rapidly
  • RAG retrieval: 3-5 highly relevant chunks (2K-4K tokens) typically outperforms 20+ marginally relevant chunks (20K+ tokens)
  • Conversation history: beyond ~20 turns, quality often degrades. Summarize earlier history instead of preserving it verbatim
  • Code context: providing the full file is often better than snippets, because the model needs structural understanding. But providing 50 files rarely helps
  • Long-form analysis (legal documents, research papers): models can handle 50K-100K tokens for extraction tasks but struggle with synthesis across the full context
Test your actual context usage

Run an experiment: take your production prompts and measure quality at 25%, 50%, 75%, and 100% of your current context usage. You may find that 50% of the context produces 95% of the quality -- saving significant cost and latency. The cheapest optimization is sending less context.

Context window ≠ memory

LLMs have no persistent memory. Every API call starts fresh. The context window is working memory, not long-term storage. If your application needs to 'remember' things across conversations, you must implement that yourself through databases, vector stores, or conversation summaries.

Best Practices

Best Practices

Do

  • Place critical information at the beginning and end of the context, not the middle
  • Budget your context window: allocate fixed portions for system prompt, retrieved context, history, and output
  • Measure quality at different context sizes -- you likely need less context than you think
  • Use retrieval (RAG) for large knowledge bases instead of stuffing everything into context
  • Always reserve tokens for output -- forgetting this causes truncated responses

Don’t

  • Don't assume longer context always means better results -- it often means worse results
  • Don't dump entire documents into context when only specific sections are relevant
  • Don't preserve full conversation history indefinitely -- summarize or truncate older messages
  • Don't confuse context window size with the model's ability to reason over that context
  • Don't rely on the model to find a needle in a haystack -- use retrieval to pre-filter

Key Takeaways

  • Context window is total input + output tokens, and marketing numbers overstate practical utility.
  • The 'lost in the middle' problem means models reliably attend to the beginning and end but struggle with middle content.
  • For most tasks, 2K-8K tokens of focused context outperforms 50K+ tokens of unfocused context.
  • Treat context as a budget: allocate portions for system prompt, retrieved data, history, and output reserve.
  • LLMs have no persistent memory -- the context window is working memory that resets on every call.

Video on this topic

Context windows explained: what 200K tokens really means

instagram