Choosing What to Measure
Not everything that can be measured matters, and not everything that matters can be easily measured. This article builds a framework for selecting the right evaluation metrics for your specific AI system — from task completion and faithfulness to operational metrics like cost and latency.
Quick Reference
- →Task completion rate is the most important metric for any AI system — did it do the job?
- →Faithfulness measures whether the output sticks to provided context (critical for RAG)
- →Operational metrics (latency, cost, tokens) are evaluation metrics too — not just infrastructure concerns
- →Metric hierarchy: when metrics conflict, you need an explicit priority ordering
- →Different use cases demand different metric weightings — a chatbot is not a code generator
- →Composite scorecards combine multiple metrics into a single decision framework
Task Completion: Did the System Do the Job?
Before measuring how well your AI responds, ask the most basic question: did it accomplish what the user wanted? Task completion rate is the single most important metric for any AI system because it directly maps to user value. A beautifully written response that does not answer the question is a failure. A rough but correct answer is a partial success.
Task completion means different things for different systems. For a code assistant, it means the code compiles and passes tests. For a RAG system, it means the answer is found in the retrieved documents. For a customer service bot, it means the issue is resolved without escalation. Write down your definition before measuring.
Notice that we distinguish between 'failed' and 'refused.' A system that refuses to answer a dangerous question is working correctly. A system that refuses a legitimate question has a different problem. Tracking these separately gives you actionable information.