Statistical Rigor for AI

Most AI teams make deployment decisions based on 'it feels better.' This article equips you with the statistical tools to make defensible claims: confidence intervals, hypothesis testing, bootstrap methods, and variance reduction for non-deterministic AI systems.

Quick Reference

→Never report a single eval score — always include confidence intervals
→Bootstrap resampling: resample your eval results with replacement to estimate variance
→For comparing two systems, use paired statistical tests, not just average scores
→Multiple comparisons (testing many models at once) inflate false positive rates — apply Bonferroni correction
→Run the same eval 3-5 times and average to reduce variance from model non-determinism

Why You Need More Than 'It Feels Better'

You run your new prompt on 100 eval cases. System A scores 78%. System B scores 82%. You deploy System B. Two weeks later, users complain. What went wrong? The 4-point difference was within the margin of error. You deployed a coin flip as a product decision. This happens constantly in AI teams because engineers are trained to think in deterministic terms: 82 > 78, therefore B is better. But with 100 samples and non-deterministic outputs, that 4-point gap may not be statistically significant.

Real numbers from practice

A 5% score difference on 100 samples has roughly a 30% chance of being noise (not a real improvement). On 500 samples, that drops to about 5%. On 50 samples, it is basically a coin flip. If you are not calculating confidence intervals, you are guessing.

▸Point estimates (single scores) are meaningless without uncertainty ranges
▸Small eval sets amplify randomness — 50 examples can swing 10+ points between runs
▸Non-deterministic models add another layer of variance on top of sampling variance
▸Stakeholders trust numbers with confidence intervals more than bare percentages
▸Statistical rigor prevents the expensive mistake of deploying regressions

Confidence Intervals for AI Metrics

A confidence interval tells you the range within which the true metric value likely falls. Instead of saying 'task completion is 82%', you say 'task completion is 82% +/- 4% (95% CI: 78% to 86%).' This gives decision-makers the information they need: if the lower bound of System B's CI overlaps with the upper bound of System A's CI, the difference is not statistically meaningful.

Comparing Two Systems: Paired Tests

The most common evaluation question is: 'Is System B better than System A?' This is a hypothesis testing problem. The key insight is to use paired tests when both systems are evaluated on the same examples. Paired tests are more powerful because they control for per-example difficulty — a hard question is hard for both systems, so the pair-wise difference has less variance than the individual scores.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.