Evaluation & Quality/Evaluation Foundations
Advanced10 min

Statistical Rigor for AI

Most AI teams make deployment decisions based on 'it feels better.' This article equips you with the statistical tools to make defensible claims: confidence intervals, hypothesis testing, bootstrap methods, and variance reduction for non-deterministic AI systems.

Quick Reference

  • Never report a single eval score — always include confidence intervals
  • Bootstrap resampling: resample your eval results with replacement to estimate variance
  • For comparing two systems, use paired statistical tests, not just average scores
  • Multiple comparisons (testing many models at once) inflate false positive rates — apply Bonferroni correction
  • Run the same eval 3-5 times and average to reduce variance from model non-determinism

Why You Need More Than 'It Feels Better'

You run your new prompt on 100 eval cases. System A scores 78%. System B scores 82%. You deploy System B. Two weeks later, users complain. What went wrong? The 4-point difference was within the margin of error. You deployed a coin flip as a product decision. This happens constantly in AI teams because engineers are trained to think in deterministic terms: 82 > 78, therefore B is better. But with 100 samples and non-deterministic outputs, that 4-point gap may not be statistically significant.

Real numbers from practice

A 5% score difference on 100 samples has roughly a 30% chance of being noise (not a real improvement). On 500 samples, that drops to about 5%. On 50 samples, it is basically a coin flip. If you are not calculating confidence intervals, you are guessing.

  • Point estimates (single scores) are meaningless without uncertainty ranges
  • Small eval sets amplify randomness — 50 examples can swing 10+ points between runs
  • Non-deterministic models add another layer of variance on top of sampling variance
  • Stakeholders trust numbers with confidence intervals more than bare percentages
  • Statistical rigor prevents the expensive mistake of deploying regressions