Advanced10 min
Cost Optimization & Caching
Reducing LLM costs by 60-90%: prompt caching, model tiering, semantic caching, and token budget management.
Quick Reference
- →Anthropic prompt caching reduces costs up to 90% for repeated system prompts and tool definitions
- →Model tiering: use Haiku for classification/routing, Sonnet for general tasks, Opus for complex reasoning
- →Semantic caching: embed the query, check if a similar query was answered recently, return cached response
- →Set per-request token budgets using max_tokens and track cumulative cost per conversation
- →Compress tool outputs before passing back to the LLM — most tool results contain redundant data
Cost Breakdown
Agents are 10-100x more expensive than single LLM calls
A single agent turn may involve multiple LLM calls (reasoning, tool selection, response generation), each carrying the full conversation context. Multi-turn conversations compound this multiplicatively.
| Cost Component | % of Total (typical) | Optimization Lever |
|---|---|---|
| Input tokens (system prompt + history) | 40-60% | Prompt caching, context compression, summarization |
| Output tokens (reasoning + response) | 15-25% | Model tiering, max_tokens limits, concise prompts |
| Tool call overhead (tool descriptions) | 10-20% | Cache tool definitions, lazy-load tool schemas |
| Retry/fallback tokens | 5-15% | Better error classification, fewer unnecessary retries |
| Eval & monitoring | 2-5% | Sample production traces instead of logging 100% |
Input tokens dominate cost because every turn resends the full system prompt, tool definitions, and conversation history. A 10-turn conversation with a 2K system prompt sends that prompt 10 times. This is why prompt caching has the highest ROI of any optimization.