Cost Forensics
Token accounting reveals where your AI budget actually goes. Learn to track costs per user, per feature, and per conversation — then optimize with context trimming, model tiering, and caching to cut costs by 50-80% without sacrificing quality.
Quick Reference
- →System prompts are charged on EVERY call — a 2000-token system prompt at 100 calls/user/day adds up fast
- →Conversation history is the #1 hidden cost — it grows linearly with each turn and is re-sent every time
- →Retries, tool call overhead, and embedding calls are often invisible in billing dashboards
- →Track cost per conversation (not per API call) to understand true unit economics
- →The biggest savings come from: shorter system prompts, history trimming, and using cheaper models for routing
- →Set per-user and per-feature budget limits to prevent runaway costs
Where Your Tokens Actually Go
Most teams look at their monthly LLM bill and have no idea where the money went. Token accounting breaks down every request into its components: system prompt, conversation history, retrieved context, user message, and response. The results are often surprising — system prompts and conversation history dominate, not the user's question.
| Token Source | Typical Size | Sent Every Call? | Cost Impact |
|---|---|---|---|
| System prompt | 500-3000 tokens | Yes — every single call | HIGH — multiplied by call count |
| Conversation history | 100-10000+ tokens (grows) | Yes — grows with each turn | HIGHEST — exponential growth |
| Retrieved context (RAG) | 500-2000 tokens | Per RAG call | Medium — proportional to chunk count |
| User message | 20-200 tokens | Once per turn | Low |
| Tool call descriptions | 200-1000 tokens | Every agent call | Medium — often overlooked |
| LLM response | 100-2000 tokens | Once per generation | Medium — output tokens cost 2-5x more |
Every major provider charges more for output tokens than input tokens. GPT-5.4 charges $2.50/M input but $10/M output. Claude Sonnet 4.6 charges $3/M input but $15/M output. A verbose response costs 2-5x more per token than the prompt that generated it. Instruct models to be concise when you do not need verbose output.