Why AI Evaluation Is Different
Traditional software testing gives you deterministic pass/fail. AI evaluation is fundamentally different: outputs are non-deterministic, quality is subjective, and production data drifts away from your test set. This article reframes your mental model from 'testing' to 'measurement science.'
Quick Reference
- →Same prompt + same model + same temperature > 0 can produce different outputs every time
- →AI evaluation is measurement, not assertion — you track distributions, not exact matches
- →Production data always drifts from your eval set — continuous evaluation is mandatory
- →Evaluation must cover correctness, safety, cost, and latency simultaneously
- →A single metric is never enough — use composite scorecards
- →Start evaluating before you ship, but never stop evaluating after you ship
Non-Determinism: The Same Input Gives Different Outputs
In traditional software, f(x) always returns the same y. You write an assertion, it passes or fails, and you move on. LLMs break this contract fundamentally. Even with temperature set to 0, different providers handle sampling differently — some still introduce tiny variations due to floating-point non-determinism in GPU operations. At any temperature above 0, the same prompt can produce meaningfully different outputs across runs.
Many teams assume temperature=0 guarantees identical outputs. It does not. GPU floating-point arithmetic, batching strategies, and provider-side changes (model updates, infrastructure changes) can all introduce variation. Always design your evaluation to handle variance.
This variance is not a bug — it is a feature of probabilistic models. But it means you cannot evaluate a single output. You must evaluate distributions of outputs, which requires running evaluations multiple times and reasoning about statistical properties rather than individual pass/fail results.