Latency Profiling

Most AI system latency comes from LLM calls and tool execution — not your code. Learn to profile every stage of an agent run, understand TTFT vs TPS, optimize streaming pipelines, and find the biggest wins by reducing LLM calls rather than optimizing code.

Quick Reference

→In a typical agent, 80-95% of latency is LLM inference time — optimizing your code yields marginal gains
→Time-to-first-token (TTFT) is what users feel; total generation time determines throughput
→Streaming transforms a 5-second wait into a 500ms perceived latency — always stream for user-facing responses
→The biggest optimization: reduce the NUMBER of LLM calls (combine steps, cache, use rules for simple cases)
→Tool execution (DB queries, API calls) is the second biggest bottleneck — parallelize independent tool calls
→Profile in production, not just development — latency varies significantly with provider load

Where Time Actually Goes

Before optimizing, you need to know where time is actually spent. In a typical AI agent handling a customer query, the breakdown looks roughly like this: LLM inference takes 60-80% of total time, tool execution (database queries, API calls) takes 15-30%, and your application code takes 2-5%. This means optimizing your Python code has almost no impact — the gains come from reducing LLM calls and parallelizing tool execution.

Stage	Typical Latency	% of Total	Optimization Lever
LLM routing/classification call	200-500ms	10-15%	Replace with rules or embedding classifier
LLM main generation call	1-5s	40-60%	Shorter prompts, smaller models, streaming
Vector search retrieval	50-200ms	2-5%	Index optimization, caching
Database queries (tool)	50-500ms	5-15%	Query optimization, connection pooling
External API calls (tool)	100-2000ms	5-20%	Caching, parallel execution
Serialization/parsing	5-20ms	< 1%	Not worth optimizing
Application logic	10-50ms	1-3%	Rarely the bottleneck

TTFT vs TPS vs Total Time

Time-to-first-token (TTFT) is the delay before any output appears — this is what the user 'feels'. Tokens-per-second (TPS) determines how fast the streaming text renders. Total time = TTFT + (output_tokens / TPS). For user experience, TTFT matters most. For throughput and cost, total time matters most.

Building a Profiling Middleware

To optimize latency, you need to measure it precisely at every stage. A profiling middleware wraps every component of your AI pipeline and records timing data. This gives you a waterfall view of where time is spent for each request.

Streaming & Perceived Latency

Streaming is the single most impactful latency optimization for user-facing AI features. Without streaming, the user stares at a spinner for 3-5 seconds. With streaming, they see the first word in 200-500ms and the response builds in real time. The total generation time is the same, but the perceived experience is dramatically better.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.