Agent Architecture/Prompt Engineering for Agents
Intermediate16 min

Writing Effective Tool Descriptions

Tool descriptions are serialized into every request as the model's only guide for tool selection. This article covers the full anatomy of a production-grade description, the token cost math, disambiguation patterns, schema enforcement with strict mode, and how to measure and debug tool selection quality.

Quick Reference

  • Tool descriptions are serialized into every request as JSON schema — the model reads them at every turn, not just at setup
  • The Anthropic API adds 346 tokens of base overhead per request for tool use; each tool schema adds ~100–300 tokens on top
  • A production-grade description has 6 components: name, purpose, USE WHEN, DO NOT USE WHEN, parameters, and returns
  • DO NOT USE WHEN blocks prevent more wrong calls than positive examples alone — add one to every tool that has a similar counterpart
  • Add strict: true to tool definitions (Anthropic and OpenAI) to prevent hallucinated parameters at no accuracy cost
  • Selection accuracy degrades at 15+ tools regardless of description quality — use deferred loading or prefix-grouped categorization
  • Measure tool selection accuracy with a test suite before shipping — fix descriptions before blaming the model

Why Tool Descriptions Are Your Highest-Leverage Investment

Most engineers treat tool descriptions as lightweight config — a docstring to fill in before moving on. They are not. Every tool definition is serialized into the system prompt and sent with every API request. The description field is the model's only signal for deciding which tool to call, when to call it, and what arguments to pass. Writing a vague description is equivalent to shipping an API without documentation — except the consumer never crashes, it just quietly uses the wrong endpoint.

Token overhead formula — compute this before binding tools
Bad descriptions cost twice

Directly: vague descriptions are often longer but less useful — more tokens for the same low quality. Indirectly: every wrong tool selection produces an incorrect observation that the model tries to recover from, at full turn cost. A 5% misrouting rate at 10K turns/day means 500 extra turns — plus the cost of wrong answers reaching users.

Token budget per API call (gpt-5.4 · $2.50/M input · $15/M output)No toolsSystem500TMessage100T600T input5 tools boundSystem500TTool schemas ×51000TMessage100T~1,600T inputTool round-trip…prev input…600TTool call50TToolMessage200T+250T overhead5 tools add ~1,000 input tokens per call · at $2.50/M = $0.0025 extra · 10,000 calls/day = $25/day schema overhead alone

tool schemas are paid upfront on every call, even when no tool is invoked