Trajectory Match: AgentEvals Package

The agentevals package provides formal trajectory evaluation with four match modes (strict, unordered, subset, superset) and LLM-as-judge trajectory scoring.

Quick Reference

→agentevals is a dedicated package for evaluating agent tool-call trajectories
→Four match modes: strict (exact order), unordered (same calls, any order), subset, superset
→LLM-as-judge trajectory evaluator for semantic comparison of agent behavior
→Works with or without reference trajectories — reference-free for production monitoring
→Integrates with LangSmith datasets and experiments for systematic evaluation
→pip install agentevals to get started

What Is Trajectory Evaluation?

Definition

Trajectory evaluation assesses not just the final answer but the sequence of actions (tool calls) an agent took to get there. Two agents might produce the same answer, but one made 3 efficient tool calls while the other made 15 redundant ones. Trajectory eval catches this difference.

A trajectory is the ordered list of tool calls an agent made during execution: [{tool: 'search', args: {query: 'LangGraph state'}}, {tool: 'read_file', args: {path: 'docs.md'}}, ...]. Trajectory evaluation compares actual trajectories against reference trajectories or evaluates them against quality criteria.

Four Match Modes

Mode	Order Matters?	Count Matters?	Use When
strict	Yes	Yes — must be identical	Deterministic workflows with fixed steps
unordered	No	Yes — same calls, any order	Order doesn't matter but all steps required
subset	No	Actual ⊆ Reference	Agent should do at least these calls
superset	No	Actual ⊇ Reference	Agent may do more but must include these

LLM-as-Judge Trajectory Evaluator

For cases where exact match is too rigid, the LLM-as-judge trajectory evaluator semantically compares trajectories. It can assess whether the agent took a reasonable path even if the specific tool calls differ from the reference.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.