Evaluation & Quality/Agent-Specific Evaluation
Advanced10 min

Tool Use Evaluation

Agents interact with the world through tools. Evaluating tool use means checking whether the agent selected the right tool, passed correct arguments, called tools in the right order, and handled errors gracefully. This article builds a complete tool use evaluator with per-step scoring and production-relevant examples.

Quick Reference

  • Tool selection accuracy: did the agent pick the right tool for the task?
  • Argument correctness: were the tool arguments valid and well-formed?
  • Sequence evaluation: were tools called in a logical, efficient order?
  • Hallucinated tool calls: calling tools that do not exist is a critical failure mode
  • Error handling: how does the agent respond when a tool call fails?
  • Evaluate tool use independently from final answer — the agent can reach the right answer despite wrong tool usage

Tool Selection Accuracy

The most fundamental tool use question: given the user's query and the available tools, did the agent select the appropriate tool? A code search agent that calls a web search tool instead of the code search tool has a tool selection error. This is the highest-impact evaluation because a wrong tool selection usually means a wrong answer, regardless of everything else.

Tool selection evaluation with precision, recall, and hallucination detection
Hallucinated tool calls are critical failures

When an agent calls a tool that does not exist in its available tool set, it reveals a fundamental confusion about its capabilities. Track hallucinated tool call rate separately and treat any hallucination as a critical bug. Common causes: tool names from training data that are not in your actual tool set, or confusing similarly-named tools.