Tool Use Evaluation

Agents interact with the world through tools. Evaluating tool use means checking whether the agent selected the right tool, passed correct arguments, called tools in the right order, and handled errors gracefully. This article builds a complete tool use evaluator with per-step scoring and production-relevant examples.

Quick Reference

→Tool selection accuracy: did the agent pick the right tool for the task?
→Argument correctness: were the tool arguments valid and well-formed?
→Sequence evaluation: were tools called in a logical, efficient order?
→Hallucinated tool calls: calling tools that do not exist is a critical failure mode
→Error handling: how does the agent respond when a tool call fails?
→Evaluate tool use independently from final answer — the agent can reach the right answer despite wrong tool usage

Tool Selection Accuracy

The most fundamental tool use question: given the user's query and the available tools, did the agent select the appropriate tool? A code search agent that calls a web search tool instead of the code search tool has a tool selection error. This is the highest-impact evaluation because a wrong tool selection usually means a wrong answer, regardless of everything else.

Tool selection evaluation with precision, recall, and hallucination detection

Hallucinated tool calls are critical failures

When an agent calls a tool that does not exist in its available tool set, it reveals a fundamental confusion about its capabilities. Track hallucinated tool call rate separately and treat any hallucination as a critical bug. Common causes: tool names from training data that are not in your actual tool set, or confusing similarly-named tools.

Argument Correctness: Valid and Well-Formed Inputs

Selecting the right tool is necessary but not sufficient. The agent must also pass correct arguments. A database query tool called with the wrong table name, a search tool called with an overly broad query, or an API tool called with a malformed date format are all argument errors. These are often subtle and require domain-specific validation.

Tool Sequence Evaluation: Right Tools in Right Order

Some tasks require tools to be called in a specific order. You must search before you can summarize search results. You must read a file before you can modify it. Sequence evaluation checks whether the agent called tools in a logical order that respects these dependencies.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.