Prompt Compression

Reduce input tokens without losing quality: conversation summarization, context pruning, document compression, and progressive detail reduction. Cut costs 40-60% on long conversations.

Quick Reference

→Conversation summarization: compress old messages into a brief summary — SummarizationMiddleware
→Context pruning: remove irrelevant retrieved documents before passing to the model
→Document compression: extract only the relevant sentences from long documents
→Progressive detail: recent messages in full, older messages summarized, oldest dropped
→trim_messages: simple token-based trimming — keep last N tokens of conversation
→Compression ratio: measure tokens saved vs quality impact — aim for 60% reduction, <5% quality loss

Why Compress?

Scenario	Without Compression	With Compression	Savings
50-turn conversation	~100K input tokens	~15K tokens (summary + recent)	85%
RAG with 10 retrieved docs	~20K tokens	~5K tokens (relevant sentences only)	75%
Agent with 5 tool results	~30K tokens	~8K tokens (offloaded to files)	73%
System prompt + few-shot	~8K tokens	~3K tokens (cached prefix)	63%

Input tokens are the largest cost component for most agents. A 50-turn customer support conversation accumulates ~100K tokens — at $3/1M tokens (Sonnet), that's $0.30 per message just for input. Compression to 15K tokens cuts that to $0.045 — an 85% savings that compounds across thousands of daily conversations.

Conversation Summarization

SummarizationMiddleware — automatic conversation compression

Context Pruning for RAG

When RAG retrieves 10 documents, not all are equally relevant. Context pruning removes low-relevance documents and extracts only the relevant sentences from the remaining ones — dramatically reducing input tokens while maintaining answer quality.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.