Latency Profiling
Most AI system latency comes from LLM calls and tool execution — not your code. Learn to profile every stage of an agent run, understand TTFT vs TPS, optimize streaming pipelines, and find the biggest wins by reducing LLM calls rather than optimizing code.
Quick Reference
- →In a typical agent, 80-95% of latency is LLM inference time — optimizing your code yields marginal gains
- →Time-to-first-token (TTFT) is what users feel; total generation time determines throughput
- →Streaming transforms a 5-second wait into a 500ms perceived latency — always stream for user-facing responses
- →The biggest optimization: reduce the NUMBER of LLM calls (combine steps, cache, use rules for simple cases)
- →Tool execution (DB queries, API calls) is the second biggest bottleneck — parallelize independent tool calls
- →Profile in production, not just development — latency varies significantly with provider load
Where Time Actually Goes
Before optimizing, you need to know where time is actually spent. In a typical AI agent handling a customer query, the breakdown looks roughly like this: LLM inference takes 60-80% of total time, tool execution (database queries, API calls) takes 15-30%, and your application code takes 2-5%. This means optimizing your Python code has almost no impact — the gains come from reducing LLM calls and parallelizing tool execution.
| Stage | Typical Latency | % of Total | Optimization Lever |
|---|---|---|---|
| LLM routing/classification call | 200-500ms | 10-15% | Replace with rules or embedding classifier |
| LLM main generation call | 1-5s | 40-60% | Shorter prompts, smaller models, streaming |
| Vector search retrieval | 50-200ms | 2-5% | Index optimization, caching |
| Database queries (tool) | 50-500ms | 5-15% | Query optimization, connection pooling |
| External API calls (tool) | 100-2000ms | 5-20% | Caching, parallel execution |
| Serialization/parsing | 5-20ms | < 1% | Not worth optimizing |
| Application logic | 10-50ms | 1-3% | Rarely the bottleneck |
Time-to-first-token (TTFT) is the delay before any output appears — this is what the user 'feels'. Tokens-per-second (TPS) determines how fast the streaming text renders. Total time = TTFT + (output_tokens / TPS). For user experience, TTFT matters most. For throughput and cost, total time matters most.