AI Engineering Judgment/AI Debugging & Troubleshooting
Advanced10 min

Cost Forensics

Token accounting reveals where your AI budget actually goes. Learn to track costs per user, per feature, and per conversation — then optimize with context trimming, model tiering, and caching to cut costs by 50-80% without sacrificing quality.

Quick Reference

  • System prompts are charged on EVERY call — a 2000-token system prompt at 100 calls/user/day adds up fast
  • Conversation history is the #1 hidden cost — it grows linearly with each turn and is re-sent every time
  • Retries, tool call overhead, and embedding calls are often invisible in billing dashboards
  • Track cost per conversation (not per API call) to understand true unit economics
  • The biggest savings come from: shorter system prompts, history trimming, and using cheaper models for routing
  • Set per-user and per-feature budget limits to prevent runaway costs

Where Your Tokens Actually Go

Most teams look at their monthly LLM bill and have no idea where the money went. Token accounting breaks down every request into its components: system prompt, conversation history, retrieved context, user message, and response. The results are often surprising — system prompts and conversation history dominate, not the user's question.

Token SourceTypical SizeSent Every Call?Cost Impact
System prompt500-3000 tokensYes — every single callHIGH — multiplied by call count
Conversation history100-10000+ tokens (grows)Yes — grows with each turnHIGHEST — exponential growth
Retrieved context (RAG)500-2000 tokensPer RAG callMedium — proportional to chunk count
User message20-200 tokensOnce per turnLow
Tool call descriptions200-1000 tokensEvery agent callMedium — often overlooked
LLM response100-2000 tokensOnce per generationMedium — output tokens cost 2-5x more
Output Tokens Cost 2-5x More Than Input Tokens

Every major provider charges more for output tokens than input tokens. GPT-5.4 charges $2.50/M input but $10/M output. Claude Sonnet 4.6 charges $3/M input but $15/M output. A verbose response costs 2-5x more per token than the prompt that generated it. Instruct models to be concise when you do not need verbose output.