Context Window Management
When to manage context, how context rot degrades agents before you hit any limit, and the full strategy stack — server-side compaction, context editing, trimming, and summarization — with cost math and production failure modes.
Quick Reference
- →Context rot: model recall degrades non-linearly past ~60% utilization — more tokens is not always better
- →1M context windows are standard on Opus 4.6/4.7 and Sonnet 4.6 at no premium — don't manage what you don't need to
- →Server-side compaction (compact_20260112 beta) is Anthropic's recommended approach for long-running conversations
- →Context editing (context-management-2025-06-27 beta) clears tool results and thinking blocks server-side — no client state sync needed
- →LangMem SummarizationNode is the production-grade LangGraph equivalent of hand-rolled summarization
- →trim_messages() triggers on token count, not message count — a single tool result can be 5K tokens
- →Compaction adds one extra LLM sampling step; build the cost into your budget before enabling it
- →Use the token counting API (free) to measure utilization before every turn — your alerting gate should fire at 60%
When Context Management Matters
The first question is whether you need active context management at all. Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 all include a 1M-token context window at standard pricing — no surcharge. A 1M-token window holds roughly 3,000 pages of text. If your agent runs sessions under 15 minutes with light tool use, it may never approach 60% utilization, and implementing trim_messages or compaction just adds latency and complexity for no benefit.
| Signal | Implication |
|---|---|
| Sessions consistently under 200K tokens | No management needed — monitor and revisit at 400K |
| Heavy tool use (≥5 search/retrieval calls per session) | Tool results are the primary budget risk — enable context editing |
| Sessions run >20 minutes or >30 turns | Context rot becomes measurable — enable compaction |
| Using extended thinking (claude-opus-4-7) | Thinking blocks are stripped automatically by the API — don't manually manage them |
| Cost is primary constraint, not latency | Monitor input tokens per turn — at $3/MTok, 600K input tokens = $1.80/turn on Sonnet 4.6 |
Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 receive internal token budget tracking: the model is told its total window and receives a remaining-token update after each tool call. This doesn't replace active management, but it means these models won't generate runaway output that blows the limit — they'll self-regulate.