Production & Scale/Production Operations
Advanced14 min

Cost Optimization & Caching

Diagnose where your agent spends money, then apply optimizations in ROI order: prompt caching first, batch API for offline work, model tiering for per-task routing, context compression for long conversations — with computed cost math using April 2026 pricing.

Quick Reference

  • Prompt caching: 90% savings on cached tokens — no beta header needed, structure static content first (system prompt > tools > history)
  • Minimum cacheable: 4,096 tokens for Opus 4.7/4.6/4.5 and Haiku 4.5; 2,048 for Sonnet 4.6; 1,024 for Sonnet 4.5 and older
  • Batch API: 50% off all tokens for offline workloads that tolerate ~1 hour latency
  • Model tiering: Haiku 4.5 ($1/MTok) for classification/routing, Sonnet 4.6 ($3/MTok) for general tasks, Opus ($5/MTok) for complex reasoning
  • 1-hour cache: cache_control.ttl = '1h' at 2x write cost — breaks even after two reads in an hour
  • Opus 4.7 tokenizer produces up to 35% more tokens for the same text — upgrading can increase costs even at the same per-token rate
  • Track cost per conversation from response.response_metadata['usage'] — alert when it spikes 2x the 7-day average

Should You Optimize Yet?

Fix behavior before fixing cost

If your agent does not work correctly yet, stop. A cheap broken agent is still broken. Optimize cost after your agent passes its eval suite and handles edge cases reliably. Premature cost optimization delays shipping and creates technical debt in the wrong direction.

The two most common mistakes: (1) shipping an agent with no cost controls and getting a $10K surprise bill, and (2) spending two weeks on cost optimization before the agent is accurate enough to deploy. The right time to optimize is after behavior works — before you scale to production traffic.

  • Optimize now if: cost per conversation exceeds your unit economics, you are scaling beyond dev traffic, or you see runaway token usage in logs
  • Optimize now if: you have working evals and your agent consistently passes them at your quality threshold
  • Wait if: your agent accuracy is below your quality gate — cheaper wrong answers are not better
  • Wait if: you are still iterating on the system prompt — cache structure changes with every prompt iteration
  • Wait if: you have not measured where your tokens actually go — measure first, optimize second