Evaluation & Quality/Evaluation Foundations
Intermediate11 min

Choosing What to Measure

Not everything that can be measured matters, and not everything that matters can be easily measured. This article builds a framework for selecting the right evaluation metrics for your specific AI system — from task completion and faithfulness to operational metrics like cost and latency.

Quick Reference

  • Task completion rate is the most important metric for any AI system — did it do the job?
  • Faithfulness measures whether the output sticks to provided context (critical for RAG)
  • Operational metrics (latency, cost, tokens) are evaluation metrics too — not just infrastructure concerns
  • Metric hierarchy: when metrics conflict, you need an explicit priority ordering
  • Different use cases demand different metric weightings — a chatbot is not a code generator
  • Composite scorecards combine multiple metrics into a single decision framework

Task Completion: Did the System Do the Job?

Before measuring how well your AI responds, ask the most basic question: did it accomplish what the user wanted? Task completion rate is the single most important metric for any AI system because it directly maps to user value. A beautifully written response that does not answer the question is a failure. A rough but correct answer is a partial success.

Define task completion per use case

Task completion means different things for different systems. For a code assistant, it means the code compiles and passes tests. For a RAG system, it means the answer is found in the retrieved documents. For a customer service bot, it means the issue is resolved without escalation. Write down your definition before measuring.

Task completion tracking with granular status categories

Notice that we distinguish between 'failed' and 'refused.' A system that refuses to answer a dangerous question is working correctly. A system that refuses a legitimate question has a different problem. Tracking these separately gives you actionable information.