Intermediate8 min
Prompt Compression
Reduce input tokens without losing quality: conversation summarization, context pruning, document compression, and progressive detail reduction. Cut costs 40-60% on long conversations.
Quick Reference
- →Conversation summarization: compress old messages into a brief summary — SummarizationMiddleware
- →Context pruning: remove irrelevant retrieved documents before passing to the model
- →Document compression: extract only the relevant sentences from long documents
- →Progressive detail: recent messages in full, older messages summarized, oldest dropped
- →trim_messages: simple token-based trimming — keep last N tokens of conversation
- →Compression ratio: measure tokens saved vs quality impact — aim for 60% reduction, <5% quality loss
Why Compress?
| Scenario | Without Compression | With Compression | Savings |
|---|---|---|---|
| 50-turn conversation | ~100K input tokens | ~15K tokens (summary + recent) | 85% |
| RAG with 10 retrieved docs | ~20K tokens | ~5K tokens (relevant sentences only) | 75% |
| Agent with 5 tool results | ~30K tokens | ~8K tokens (offloaded to files) | 73% |
| System prompt + few-shot | ~8K tokens | ~3K tokens (cached prefix) | 63% |
Input tokens are the largest cost component for most agents. A 50-turn customer support conversation accumulates ~100K tokens — at $3/1M tokens (Sonnet), that's $0.30 per message just for input. Compression to 15K tokens cuts that to $0.045 — an 85% savings that compounds across thousands of daily conversations.