Evaluation & Quality/Agent-Specific Evaluation
Advanced9 min

Trajectory Match: AgentEvals Package

The agentevals package provides formal trajectory evaluation with four match modes (strict, unordered, subset, superset) and LLM-as-judge trajectory scoring.

Quick Reference

  • agentevals is a dedicated package for evaluating agent tool-call trajectories
  • Four match modes: strict (exact order), unordered (same calls, any order), subset, superset
  • LLM-as-judge trajectory evaluator for semantic comparison of agent behavior
  • Works with or without reference trajectories — reference-free for production monitoring
  • Integrates with LangSmith datasets and experiments for systematic evaluation
  • pip install agentevals to get started

What Is Trajectory Evaluation?

Definition

Trajectory evaluation assesses not just the final answer but the sequence of actions (tool calls) an agent took to get there. Two agents might produce the same answer, but one made 3 efficient tool calls while the other made 15 redundant ones. Trajectory eval catches this difference.

A trajectory is the ordered list of tool calls an agent made during execution: [{tool: 'search', args: {query: 'LangGraph state'}}, {tool: 'read_file', args: {path: 'docs.md'}}, ...]. Trajectory evaluation compares actual trajectories against reference trajectories or evaluates them against quality criteria.