Advanced18 min

Context Window Management

When to manage context, how context rot degrades agents before you hit any limit, and the full strategy stack — server-side compaction, context editing, trimming, and summarization — with cost math and production failure modes.

Quick Reference

→Context rot: model recall degrades non-linearly past ~60% utilization — more tokens is not always better
→1M context windows are standard on Opus 4.6/4.7 and Sonnet 4.6 at no premium — don't manage what you don't need to
→Server-side compaction (compact_20260112 beta) is Anthropic's recommended approach for long-running conversations
→Context editing (context-management-2025-06-27 beta) clears tool results and thinking blocks server-side — no client state sync needed
→LangMem SummarizationNode is the production-grade LangGraph equivalent of hand-rolled summarization
→trim_messages() triggers on token count, not message count — a single tool result can be 5K tokens
→Compaction adds one extra LLM sampling step; build the cost into your budget before enabling it
→Use the token counting API (free) to measure utilization before every turn — your alerting gate should fire at 60%

When Context Management Matters

The first question is whether you need active context management at all. Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 all include a 1M-token context window at standard pricing — no surcharge. A 1M-token window holds roughly 3,000 pages of text. If your agent runs sessions under 15 minutes with light tool use, it may never approach 60% utilization, and implementing trim_messages or compaction just adds latency and complexity for no benefit.

Signal	Implication
Sessions consistently under 200K tokens	No management needed — monitor and revisit at 400K
Heavy tool use (≥5 search/retrieval calls per session)	Tool results are the primary budget risk — enable context editing
Sessions run >20 minutes or >30 turns	Context rot becomes measurable — enable compaction
Using extended thinking (claude-opus-4-7)	Thinking blocks are stripped automatically by the API — don't manually manage them
Cost is primary constraint, not latency	Monitor input tokens per turn — at $3/MTok, 600K input tokens = $1.80/turn on Sonnet 4.6

Context awareness is built in for Claude 4.5+ models

Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 receive internal token budget tracking: the model is told its total window and receives a remaining-token update after each tool call. This doesn't replace active management, but it means these models won't generate runaway output that blows the limit — they'll self-regulate.

Context Rot: Why Bigger Isn't Better

Anthropic's own documentation explicitly warns: 'more context isn't automatically better. As token count grows, accuracy and recall degrade.' This phenomenon — context rot — means that stuffing a 1M window to capacity produces worse results than a well-curated 300K window. The MRCR v2 benchmark shows Claude Opus 4.6 scoring 78.3% at 1M tokens — impressive, but meaningfully below its performance on shorter contexts. The degradation is non-linear: it's manageable up to ~60% utilization, then accelerates.

Token Budget Allocation

Budget your context window: leave room for the model to reason and respond

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.