Managing Message History

Message history is your biggest uncontrolled cost in production agents. This article covers the decision between transient and persistent trimming, when summarization beats deletion, and the four failure modes that produce wrong answers without throwing exceptions.

Quick Reference

→wrap_model_call — transient trim: model sees less, checkpointer is unchanged
→@before_model + RemoveMessage(REMOVE_ALL_MESSAGES) — persistent trim: state is permanently rewritten
→RemoveMessage(id=m.id) — surgical delete of a specific message by ID
→SummarizationMiddleware(model, trigger={'tokens': N}, keep={'messages': N}) — auto-compress old turns
→trim_messages(max_tokens, strategy='last', start_on='human') — LCEL chain utility
→Always delete tool-call messages and their ToolMessage results together
→Log token count per turn — silent truncation produces wrong answers, not exceptions

What Message History Actually Costs You

A typical agent turn — user message + tool calls + tool results + AI reply — runs 600–1,200 tokens. At 50 turns, that's 30,000–60,000 tokens per call. With claude-opus-4-7 at $15/$75 per million tokens in/out, a 50-turn conversation costs roughly $0.54–$1.08 per call — before any actual work. At 100 turns, double that. Most agent bugs in production aren't logic errors; they're token budget failures that show up as degraded reasoning or silent truncation.

Turns	Tokens (low est.)	Tokens (high est.)	Cost @ claude-opus-4-7
10	6,000	12,000	$0.09–$0.18
30	18,000	36,000	$0.27–$0.54
50	30,000	60,000	$0.54–$1.08
100	60,000	120,000	$1.08–$2.16

Assumptions: 600–1,200 tokens/turn average. claude-opus-4-7 input at $15/M, output at $75/M, estimated 90% input / 10% output split. These are floor estimates — tool-heavy agents with large tool results can exceed 3,000 tokens/turn easily.

Budget your context window: system prompt is fixed, history grows, available space shrinks

Choosing a Strategy: Trim vs. Summarize vs. Offload

Three strategies exist, and picking the wrong one costs you either money or answer quality. The right choice depends on conversation length and whether early decisions still drive current answers.

Transient vs. Persistent Trimming

The most important architectural decision in message history management is whether your trim changes stored state. Transient trimming (wrap_model_call) modifies only what goes to the model for one call. Persistent trimming (@before_model) rewrites the checkpointer — the full conversation history is gone after the call completes.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.