Production & Scale/Production Operations
Advanced10 min

Cost Optimization & Caching

Reducing LLM costs by 60-90%: prompt caching, model tiering, semantic caching, and token budget management.

Quick Reference

  • Anthropic prompt caching reduces costs up to 90% for repeated system prompts and tool definitions
  • Model tiering: use Haiku for classification/routing, Sonnet for general tasks, Opus for complex reasoning
  • Semantic caching: embed the query, check if a similar query was answered recently, return cached response
  • Set per-request token budgets using max_tokens and track cumulative cost per conversation
  • Compress tool outputs before passing back to the LLM — most tool results contain redundant data

Cost Breakdown

Agents are 10-100x more expensive than single LLM calls

A single agent turn may involve multiple LLM calls (reasoning, tool selection, response generation), each carrying the full conversation context. Multi-turn conversations compound this multiplicatively.

Cost Component% of Total (typical)Optimization Lever
Input tokens (system prompt + history)40-60%Prompt caching, context compression, summarization
Output tokens (reasoning + response)15-25%Model tiering, max_tokens limits, concise prompts
Tool call overhead (tool descriptions)10-20%Cache tool definitions, lazy-load tool schemas
Retry/fallback tokens5-15%Better error classification, fewer unnecessary retries
Eval & monitoring2-5%Sample production traces instead of logging 100%

Input tokens dominate cost because every turn resends the full system prompt, tool definitions, and conversation history. A 10-turn conversation with a 2K system prompt sends that prompt 10 times. This is why prompt caching has the highest ROI of any optimization.