Evaluation & Quality/Human Evaluation & Experimentation
Advanced11 min

A/B Testing AI Features

Offline evaluation tells you a system should be better. A/B testing tells you it actually is better for real users. This article covers experiment design for AI features: proper randomization, choosing metrics that matter, guardrail metrics that must not regress, sample size planning, and common pitfalls that invalidate results.

Quick Reference

  • A/B test = randomized controlled experiment: users are randomly assigned to control (old) or treatment (new)
  • Primary metric: the user behavior you want to improve (task completion, satisfaction)
  • Guardrail metrics: safety, latency, cost — must not regress even if primary improves
  • Sample size depends on the minimum detectable effect and baseline conversion rate
  • Run for at least 1-2 weeks to capture weekly usage patterns and avoid novelty effects
  • Never peek at results before the planned analysis time — it inflates false positive rates

Experiment Design: Control, Treatment, and Randomization

An A/B test compares two versions of your system: the control (current production system) and the treatment (the new version). Users are randomly assigned to one group, and you measure the difference in key metrics. Random assignment is critical — it ensures that any difference in metrics is caused by the system change, not by differences in the user populations.

Randomize at the user level, not the request level

If you randomize per-request, a single user might see both versions in the same session — creating a confusing experience and contaminating your results. Randomize per-user (using a hash of user ID) so each user consistently sees the same version throughout the experiment.

User-level experiment assignment with hash-based randomization