Evaluating Agent Trajectories
Evaluating an agent on its final answer misses most of the story. An agent that stumbles through 15 wrong steps before reaching the right answer is not the same as one that reaches it in 3 clean steps. This article covers trajectory evaluation: scoring the reasoning path, measuring efficiency, evaluating decision quality at each step, and tracking cost per trajectory.
Quick Reference
- →Trajectory = the complete sequence of thoughts, tool calls, and observations an agent makes
- →Final-answer-only evaluation misses efficiency, reasoning quality, and cost — evaluate the path too
- →Trajectory efficiency: did the agent take unnecessary steps, redundant tool calls, or circular reasoning?
- →Decision quality: at each branching point, did the agent make the right choice?
- →Cost per trajectory: total tokens consumed across ALL LLM calls in one user interaction
- →Use LLM judges to score individual steps and aggregate into trajectory-level metrics
Beyond the Final Answer: Why Trajectory Matters
Traditional evaluation asks: 'Did the agent get the right answer?' Trajectory evaluation asks: 'How did the agent get there?' Two agents can produce the same correct answer, but one may have done it in 2 tool calls costing $0.01, while the other used 15 tool calls costing $0.50 with 3 dead-end branches. In production, the path matters as much as the destination because it determines cost, latency, and reliability.
A trajectory is the complete record of an agent's execution: (1) The initial query, (2) Each reasoning/thinking step, (3) Each tool call with arguments and results, (4) Each intermediate decision or branching point, (5) The final response. Think of it as the agent's execution trace — everything that happened between receiving the query and producing the answer.