Evaluating Agent Trajectories

Evaluating an agent on its final answer misses most of the story. An agent that stumbles through 15 wrong steps before reaching the right answer is not the same as one that reaches it in 3 clean steps. This article covers trajectory evaluation: scoring the reasoning path, measuring efficiency, evaluating decision quality at each step, and tracking cost per trajectory.

Quick Reference

→Trajectory = the complete sequence of thoughts, tool calls, and observations an agent makes
→Final-answer-only evaluation misses efficiency, reasoning quality, and cost — evaluate the path too
→Trajectory efficiency: did the agent take unnecessary steps, redundant tool calls, or circular reasoning?
→Decision quality: at each branching point, did the agent make the right choice?
→Cost per trajectory: total tokens consumed across ALL LLM calls in one user interaction
→Use LLM judges to score individual steps and aggregate into trajectory-level metrics

Beyond the Final Answer: Why Trajectory Matters

Traditional evaluation asks: 'Did the agent get the right answer?' Trajectory evaluation asks: 'How did the agent get there?' Two agents can produce the same correct answer, but one may have done it in 2 tool calls costing $0.01, while the other used 15 tool calls costing $0.50 with 3 dead-end branches. In production, the path matters as much as the destination because it determines cost, latency, and reliability.

What a trajectory contains

A trajectory is the complete record of an agent's execution: (1) The initial query, (2) Each reasoning/thinking step, (3) Each tool call with arguments and results, (4) Each intermediate decision or branching point, (5) The final response. Think of it as the agent's execution trace — everything that happened between receiving the query and producing the answer.

Data model for agent trajectories

Trajectory Efficiency: Measuring Waste

An efficient trajectory reaches the correct answer using the minimum necessary steps. Inefficiency manifests as: redundant tool calls (querying the same thing twice), dead-end branches (exploring a path then abandoning it), unnecessary reasoning loops (re-deriving information already available), and excessive caution (double-checking when confidence should be high).

Decision Quality: Evaluating Each Step

At each step, the agent makes a decision: which tool to call, what arguments to pass, how to interpret results, when to ask for clarification. Evaluating these individual decisions gives you much richer feedback than evaluating only the final outcome. It tells you where the agent is strong and where it needs improvement.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.