AI Engineering Judgment/Cost Engineering
Intermediate8 min

Prompt Compression

Reduce input tokens without losing quality: conversation summarization, context pruning, document compression, and progressive detail reduction. Cut costs 40-60% on long conversations.

Quick Reference

  • Conversation summarization: compress old messages into a brief summary — SummarizationMiddleware
  • Context pruning: remove irrelevant retrieved documents before passing to the model
  • Document compression: extract only the relevant sentences from long documents
  • Progressive detail: recent messages in full, older messages summarized, oldest dropped
  • trim_messages: simple token-based trimming — keep last N tokens of conversation
  • Compression ratio: measure tokens saved vs quality impact — aim for 60% reduction, <5% quality loss

Why Compress?

ScenarioWithout CompressionWith CompressionSavings
50-turn conversation~100K input tokens~15K tokens (summary + recent)85%
RAG with 10 retrieved docs~20K tokens~5K tokens (relevant sentences only)75%
Agent with 5 tool results~30K tokens~8K tokens (offloaded to files)73%
System prompt + few-shot~8K tokens~3K tokens (cached prefix)63%

Input tokens are the largest cost component for most agents. A 50-turn customer support conversation accumulates ~100K tokens — at $3/1M tokens (Sonnet), that's $0.30 per message just for input. Compression to 15K tokens cuts that to $0.045 — an 85% savings that compounds across thousands of daily conversations.