Intermediate12 min

Context Windows & Context Management

Context windows in 2026 are large -- the problem is no longer size, it's quality degradation. Learn what every major model actually offers, what the 'lost in the middle' problem means for your production prompts, how to do real cost math, and how prompt caching cuts your context spend by up to 90%.

Quick Reference

→Context window = total tokens (input + output combined) the model can process in one call
→Claude Opus 4.7 / Sonnet 4.6: 1M tokens. GPT-5.4: ~1.05M. Gemini 2.5 Pro/Flash: 1M. Haiku 4.5: 200K
→The 'lost in the middle' problem persists in 2026 -- every frontier model still degrades for information in the middle of long contexts
→Prompt caching reduces input cost by up to 90% (Claude: 0.1x base price on cache hits) -- the biggest context cost lever
→GPT-5.4 doubles input pricing above 272K tokens per session -- long-context has a cost cliff
→Treat context like a budget: allocate fixed portions for system prompt, retrieved docs, history, and output reserve
→Server-side compaction (Claude) and RAG are the two recommended approaches for conversations that grow beyond your window

In this article

1.The 2026 Context Window Landscape
2.Bigger Windows, Worse Results: The Degradation Problem
3.The Cost of Context: Real Numbers
4.Prompt Caching: The 90% Discount
5.Context Management Strategies That Actually Work
6.The Context Budget Framework
7.How Much Context Is Actually Useful?
★Best Practices
✓Key Takeaways

The 2026 Context Window Landscape

The context window is the total number of tokens a model can process in a single API call -- including both input (system prompt, conversation history, retrieved documents) and output (the model's response). When you see 'Claude Sonnet 4.6 has a 1M context window,' that means 1,000,000 tokens total for everything in and out of that call. One million tokens is roughly 750,000 English words -- more than two copies of War and Peace.

Model	Context window	Input price / 1M tokens	Output price / 1M tokens	Cached input
Claude Opus 4.7	1M	$5.00	$25.00	$0.50 (10%)
Claude Sonnet 4.6	1M	$3.00	$15.00	$0.30 (10%)
Claude Haiku 4.5	200K	$1.00	$5.00	$0.10 (10%)
GPT-5.4	~1.05M	$2.50*	$15.00*	$1.25 (50%)
Gemini 3.1 Pro	1M	$2.00 (≤200K) / $4.00	$12.00 / $18.00	Context caching available
Gemini 3 Flash	1M	$0.50	$3.00	Context caching available
Llama 4 Maverick	1M	Self-hosted	Self-hosted	—
Llama 4 Scout	10M	Self-hosted	Self-hosted	—
DeepSeek V3	164K	$0.014	$0.028	—

GPT-5.4 has a context surcharge above 272K tokens

GPT-5.4 doubles input pricing for sessions that exceed 272K input tokens. A 300K-token request costs 2x the listed input rate for the entire session, not just the overflow. Budget accordingly if you're targeting long-context tasks with OpenAI.

Claude 4.5+ models are context-aware

Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 actively track their remaining token budget throughout a conversation. The model receives explicit token budget updates after each tool call, so it can plan work rather than guessing how much space remains. This is particularly valuable for long-running agentic sessions.

Bigger Windows, Worse Results: The Degradation Problem

The marketing numbers (1M, 10M tokens) are real. The useful capacity is not. In 2023, researchers discovered that LLMs have a U-shaped attention pattern: they reliably attend to information at the beginning and end of the context, but performance degrades for information placed in the middle. In 2025, Chroma tested 18 frontier models including GPT-5.4, Claude Opus 4, and Gemini 2.5 Pro. Every single model still showed performance degradation as input length increased. Claude models degraded the slowest. No model was immune.

▸NIAH (Needle in a Haystack) tests: modern models often score near 100% -- but NIAH is considered too easy. It only tests single-fact retrieval.
▸RULER benchmark (more rigorous): tests multi-hop reasoning and aggregation across long contexts. Almost all models fail to maintain performance at longer lengths even when they ace NIAH.
▸The primacy effect: the very beginning of the context (system prompt, first instructions) receives disproportionate attention.
▸The recency effect: the end of the context (the most recent message) also gets strong attention.
▸The middle gap: facts or documents placed in the middle 40-70% of a long context are the least reliably retrieved.
▸The effect grows with length: a 4K context shows mild degradation, a 200K context shows substantial drops for mid-context retrieval.

Position critical information at the edges

Place the most important context at the very beginning (right after the system prompt) or at the very end (immediately before the user question). For RAG, put the most relevant retrieved documents first. This simple reordering consistently improves recall for mid-context information -- the exact gain varies by model and task, but the direction is reliable.

Real project

A team building a document Q&A system was dumping 20 retrieved chunks into the context in random order. Their accuracy on questions requiring synthesis across multiple chunks was poor. After reordering so the most semantically relevant chunks appeared first and last, accuracy improved noticeably -- without changing the model, the chunks, or the question. The fix cost zero dollars and took 20 minutes.

Learn this in → Position bias is real and cheap to exploit.

The Cost of Context: Real Numbers

Context costs money proportional to size -- but the split between input and output matters a lot. Input tokens are 3-5x cheaper than output tokens across providers. And cached input tokens are up to 10x cheaper than fresh input tokens. Here's what a realistic 50K-token RAG request actually costs.

Assume: 45K input tokens (system prompt + 10 retrieved chunks + conversation history) and 5K output tokens (model response). This is a realistic mid-sized production request.

Provider	Input cost (45K uncached)	Output cost (5K)	Total uncached	Total with caching
Claude Sonnet 4.6	$0.135	$0.075	$0.210	$0.013 + $0.075 = $0.088
GPT-5.4	$0.113	$0.075	$0.188	$0.056 + $0.075 = $0.131
Gemini 3.1 Pro (≤200K)	$0.090	$0.060	$0.150	Context caching varies
Gemini 3 Flash	$0.023	$0.015	$0.038	Context caching varies
DeepSeek V3	$0.0006	$0.0001	$0.0007	Ultra-cheap, self-hosted

The 90/10 rule for context cost

If your system prompt and retrieved documents are stable across requests (common in RAG and agent systems), caching them can cut your input cost by 90%. On Claude, a cache hit costs 0.1x the base input rate. On a system doing 10,000 requests/day with a 30K-token shared prefix, this is the difference between $45/day and $4.50/day in input costs alone. Implement caching before you optimize anything else.

Prompt Caching: The 90% Discount

Prompt caching lets you reuse previously processed context across API calls. Instead of the model re-processing your 30K-token system prompt and knowledge base on every request, it reads from a server-side cache at a fraction of the cost. Cache hits are billed at 10% of standard input rates on Anthropic; at 50% on OpenAI (auto-applied); and through explicit context caching on Gemini.

Provider	How to enable	Cache hit cost	Cache TTL options	Max cache breakpoints
Claude (Anthropic)	Add cache_control to content blocks	0.1x base input (10%)	5 min (1.25x write) or 1 hour (2x write)	4 per request
OpenAI	Automatic on shared prefixes >1024 tokens	0.5x base input (50%)	Automatic (hours)	Automatic
Gemini	Explicit context caching API	Varies	Configurable	Explicit

Anthropic prompt caching: cache the system prompt and knowledge base, keep the user message dynamic

Structure your prompt so stable content comes first

Caching works on prefixes. Put your system instructions and knowledge base before any dynamic content (user messages, per-request context). If you inject user-specific data into the system prompt, put it after the stable sections so the stable prefix stays cacheable.

1-hour cache TTL costs 2x write vs. 5-minute TTL

Claude's 5-minute cache write costs 1.25x base input rate; the 1-hour TTL costs 2x. A cache write pays off after one hit for the 5-minute option, or two hits for the 1-hour option. Use 1-hour for high-value, infrequently-requested content (rare but expensive queries). Use 5-minute for busy endpoints where the prefix is hit often.

Context Management Strategies That Actually Work

The question is not 'how do I fit more in context?' -- it is 'what should be in context at all?' More context is not free. It costs money, it takes latency, and past a point it hurts quality. Here are the strategies ordered by when to reach for them.

Strategy	When to use it	Primary cost	Primary benefit
Prompt caching	Stable prefixes hit more than once	1.25x write cost (5-min)	90% reduction on cached tokens
Retrieval (RAG)	Large or dynamic knowledge bases	Retrieval latency + embedding cost	Only relevant content in context
Server-side compaction	Long conversations approaching context limit	Minimal (Anthropic beta)	Conversation continues beyond window
Summarization	Long conversation history you can't drop	One extra LLM call	Preserves gist, frees tokens
Sliding window	Chat where early history rarely matters	Drops early messages	Simple, zero extra cost
Chunking + map-reduce	Full-document analysis (legal, research)	Multiple API calls	Handles any document length

Server-side compaction (Anthropic, 2026)

Anthropic's server-side compaction automatically condenses earlier parts of a conversation when it approaches the context limit. It is the recommended approach for long-running agentic sessions on Claude Opus 4.7, Opus 4.6, and Sonnet 4.6. Unlike manual summarization, it requires no changes to your prompt logic -- add a single parameter and the API handles the rest.

Provider-agnostic context trimming with token counting (Anthropic API example)

Claude now raises a validation error on context overflow

Newer Claude models (starting with Sonnet 3.7) return an error rather than silently truncating when your prompt plus max_tokens exceeds the context window. This is safer than silent truncation -- you know immediately when you're over budget. Count your tokens before sending with client.messages.count_tokens().

The Context Budget Framework

Treat your context window as a budget with fixed allocations for each category. Silent overflow -- where the model drops content to fit -- is a silent quality bug. Explicit budgeting prevents it.

Budget your context window: system prompt is fixed, history grows, available space shrinks

Category	Typical allocation	Notes
System prompt	5-15%	Fixed. Cache this. It should rarely change.
Retrieved context (RAG)	20-40%	Variable per request. Rank and truncate before injection.
Conversation history	20-35%	Grows over time. Summarize or trim aggressively.
Tool definitions + results	5-15%	Often underestimated. Count tools in your budget.
Output reserve	15-25%	Never skip this. Forgetting it causes truncated responses.

Alert when any budget category exceeds its allocation

Track actual token usage per category in your logging. When retrieved context exceeds 40% of your budget on a regular basis, your retrieval is returning too much or your chunks are too large. When conversation history exceeds 35%, it is time to summarize or introduce compaction. Treat these as health metrics, not one-time configs.

How Much Context Is Actually Useful?

The 1M-token context window exists. But 'fits in context' and 'reliably understood in context' are different things. Here is what the research and production experience actually shows -- without fabricated percentages.

▸RAG: 3-8 highly relevant chunks (2K-6K tokens) consistently outperforms 30+ marginally relevant chunks (30K+ tokens). More chunks means more noise and more mid-context content that the model under-weights.
▸Conversation history: summarize after roughly 15-20 turns. Beyond this, the model reliably attends to recent turns and underweights earlier ones -- so preserving verbatim history wastes tokens you could use for recent context.
▸Code context: providing a full file is often better than isolated snippets because the model needs structural context (imports, class definitions). But providing 50 files rarely helps for a single-function task.
▸Long-form analysis (legal, research): models can reliably extract specific facts from very long documents. They struggle to synthesize insights across the full document from a single pass. For synthesis tasks, break it into chunks.
▸The practical ceiling: diminishing returns set in well below the model's maximum context. The correct ceiling depends on your task -- measure quality at different context sizes rather than assuming more is better.

Run a context size ablation before you ship

Take 20 representative production queries. Run them at 25%, 50%, 75%, and 100% of your current context usage by trimming retrieved chunks. Compare output quality. You will often find that 50% of the context produces 90%+ of the quality -- and halves your cost and latency. This is the cheapest optimization available.

Context window ≠ memory

LLMs have no persistent memory. Every API call starts fresh. The context window is working memory that resets on every call. If your application needs to remember things across conversations, you must implement that yourself -- through a database, vector store, or conversation summary stored externally.

Best Practices

✓Implement prompt caching on any stable prefix longer than a few thousand tokens -- it is the highest-ROI context optimization
✓Place critical information at the beginning and end of the context, not in the middle
✓Use client.messages.count_tokens() (Anthropic) or equivalent before sending to avoid validation errors
✓Budget your context window explicitly: allocate fixed portions for system prompt, retrieved context, history, tool results, and output
✓Use retrieval (RAG) with 3-8 high-quality chunks rather than stuffing the full knowledge base into context
✓Set up server-side compaction or manual summarization for conversations that run longer than 15-20 turns
✓Track token usage per budget category as a health metric in your production logs
✓Run a context size ablation before shipping -- measure quality at 25/50/75/100% of context to find the optimal level

Don’t

✗Don't assume a 1M-token context window means 1M tokens of reliable comprehension -- degradation is real and well-documented
✗Don't ignore the GPT-5.4 context surcharge: input price doubles above 272K tokens per session
✗Don't dump entire documents into context when only specific sections are relevant -- use retrieval to pre-filter
✗Don't preserve full conversation history indefinitely -- summarize or trim after 15-20 turns
✗Don't skip the output token reserve -- context overflow at generation time silently truncates your response
✗Don't structure your cached prefix so dynamic content comes before static content -- caching only works on prefixes
✗Don't confuse context window size with the model's ability to reason over that context at any position
✗Don't pay for re-processing the same static content on every request -- implement caching before any other optimization

Key Takeaways

✓Context window = total input + output tokens. Marketing numbers overstate practical utility: every frontier model still degrades for information in the middle of long contexts.
✓The 2026 landscape: Claude Opus 4.7 / Sonnet 4.6 have 1M tokens. GPT-5.4 has ~1.05M (with a cost cliff above 272K). Gemini 2.5 Pro/Flash have 1M. Haiku 4.5 has 200K.
✓Prompt caching is the highest-ROI context optimization: cache hits cost 10% of standard input rates on Anthropic, 50% on OpenAI. Implement this before anything else.
✓Treat context as a budget: allocate fixed portions for system prompt, retrieved docs, history, tool results, and output reserve. Track overages as a health metric.
✓3-8 high-quality retrieved chunks outperforms 30+ marginally relevant ones. More context is not better -- more relevant context is better.
✓LLMs have no persistent memory. The context window is working memory that resets on every API call -- persistent state requires external storage.

Video on this topic

Context windows explained: what 1M tokens really means in 2026

instagram

←

The Inference Pipeline

Hallucinations: The Engineering Response

→