Agent Architecture/System Design
Advanced18 min

Context Window Management

When to manage context, how context rot degrades agents before you hit any limit, and the full strategy stack — server-side compaction, context editing, trimming, and summarization — with cost math and production failure modes.

Quick Reference

  • Context rot: model recall degrades non-linearly past ~60% utilization — more tokens is not always better
  • 1M context windows are standard on Opus 4.6/4.7 and Sonnet 4.6 at no premium — don't manage what you don't need to
  • Server-side compaction (compact_20260112 beta) is Anthropic's recommended approach for long-running conversations
  • Context editing (context-management-2025-06-27 beta) clears tool results and thinking blocks server-side — no client state sync needed
  • LangMem SummarizationNode is the production-grade LangGraph equivalent of hand-rolled summarization
  • trim_messages() triggers on token count, not message count — a single tool result can be 5K tokens
  • Compaction adds one extra LLM sampling step; build the cost into your budget before enabling it
  • Use the token counting API (free) to measure utilization before every turn — your alerting gate should fire at 60%

When Context Management Matters

The first question is whether you need active context management at all. Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 all include a 1M-token context window at standard pricing — no surcharge. A 1M-token window holds roughly 3,000 pages of text. If your agent runs sessions under 15 minutes with light tool use, it may never approach 60% utilization, and implementing trim_messages or compaction just adds latency and complexity for no benefit.

SignalImplication
Sessions consistently under 200K tokensNo management needed — monitor and revisit at 400K
Heavy tool use (≥5 search/retrieval calls per session)Tool results are the primary budget risk — enable context editing
Sessions run >20 minutes or >30 turnsContext rot becomes measurable — enable compaction
Using extended thinking (claude-opus-4-7)Thinking blocks are stripped automatically by the API — don't manually manage them
Cost is primary constraint, not latencyMonitor input tokens per turn — at $3/MTok, 600K input tokens = $1.80/turn on Sonnet 4.6
Context awareness is built in for Claude 4.5+ models

Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 receive internal token budget tracking: the model is told its total window and receives a remaining-token update after each tool call. This doesn't replace active management, but it means these models won't generate runaway output that blows the limit — they'll self-regulate.