AI Engineering Judgment/AI Debugging & Troubleshooting
Advanced10 min

Latency Profiling

Most AI system latency comes from LLM calls and tool execution — not your code. Learn to profile every stage of an agent run, understand TTFT vs TPS, optimize streaming pipelines, and find the biggest wins by reducing LLM calls rather than optimizing code.

Quick Reference

  • In a typical agent, 80-95% of latency is LLM inference time — optimizing your code yields marginal gains
  • Time-to-first-token (TTFT) is what users feel; total generation time determines throughput
  • Streaming transforms a 5-second wait into a 500ms perceived latency — always stream for user-facing responses
  • The biggest optimization: reduce the NUMBER of LLM calls (combine steps, cache, use rules for simple cases)
  • Tool execution (DB queries, API calls) is the second biggest bottleneck — parallelize independent tool calls
  • Profile in production, not just development — latency varies significantly with provider load

Where Time Actually Goes

Before optimizing, you need to know where time is actually spent. In a typical AI agent handling a customer query, the breakdown looks roughly like this: LLM inference takes 60-80% of total time, tool execution (database queries, API calls) takes 15-30%, and your application code takes 2-5%. This means optimizing your Python code has almost no impact — the gains come from reducing LLM calls and parallelizing tool execution.

StageTypical Latency% of TotalOptimization Lever
LLM routing/classification call200-500ms10-15%Replace with rules or embedding classifier
LLM main generation call1-5s40-60%Shorter prompts, smaller models, streaming
Vector search retrieval50-200ms2-5%Index optimization, caching
Database queries (tool)50-500ms5-15%Query optimization, connection pooling
External API calls (tool)100-2000ms5-20%Caching, parallel execution
Serialization/parsing5-20ms< 1%Not worth optimizing
Application logic10-50ms1-3%Rarely the bottleneck
TTFT vs TPS vs Total Time

Time-to-first-token (TTFT) is the delay before any output appears — this is what the user 'feels'. Tokens-per-second (TPS) determines how fast the streaming text renders. Total time = TTFT + (output_tokens / TPS). For user experience, TTFT matters most. For throughput and cost, total time matters most.