Choosing What to Measure

Not everything that can be measured matters, and not everything that matters can be easily measured. This article builds a framework for selecting the right evaluation metrics for your specific AI system — from task completion and faithfulness to operational metrics like cost and latency.

Quick Reference

→Task completion rate is the most important metric for any AI system — did it do the job?
→Faithfulness measures whether the output sticks to provided context (critical for RAG)
→Operational metrics (latency, cost, tokens) are evaluation metrics too — not just infrastructure concerns
→Metric hierarchy: when metrics conflict, you need an explicit priority ordering
→Different use cases demand different metric weightings — a chatbot is not a code generator
→Composite scorecards combine multiple metrics into a single decision framework

Task Completion: Did the System Do the Job?

Before measuring how well your AI responds, ask the most basic question: did it accomplish what the user wanted? Task completion rate is the single most important metric for any AI system because it directly maps to user value. A beautifully written response that does not answer the question is a failure. A rough but correct answer is a partial success.

Define task completion per use case

Task completion means different things for different systems. For a code assistant, it means the code compiles and passes tests. For a RAG system, it means the answer is found in the retrieved documents. For a customer service bot, it means the issue is resolved without escalation. Write down your definition before measuring.

Task completion tracking with granular status categories

Notice that we distinguish between 'failed' and 'refused.' A system that refuses to answer a dangerous question is working correctly. A system that refuses a legitimate question has a different problem. Tracking these separately gives you actionable information.

Quality Metrics: Faithfulness, Relevance, and Safety

Once you know the system completed the task, the next question is how well. Quality metrics break down into several dimensions that are often in tension with each other.

Operational Metrics: Latency, Cost, and Token Usage

Quality metrics tell you if the output is good. Operational metrics tell you if the system is viable. A perfect response that takes 30 seconds and costs $0.50 per query is not a production system — it is a demo. Operational metrics are first-class evaluation metrics, not afterthoughts.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.