Running the experiment

Run the experiment

In some respects, this is the easiest part of the experimental process as you are just collecting data and waiting for the experiment to finish. In other respects, the waiting requires great discipline and patience.

Before running the experiment with the full percentage of traffic used in your power analysis, it might be wise to start with a smaller percentage and monitor the data for a day. You can’t draw any conclusions about the null hypothesis but you can check that nothing is broken and that your change hasn’t had a disastrous effect (for example, nobody is making a purchase any more). In other words, this is a sanity check. Look for warning signs but bear in mind there is a lot of statistical noise during the early stages of experiments.

Whether you ramp up the traffic or not will depend on how risky you judge your change is and what the consequences of a poorly performing challenger are. Testing a creative in a social media campaign might be less risky than changing the UI of a payment process, for example.

Once you are confident there are no big issues (or you have fixed any you identified!), you can start using the full percentage of traffic you need for your required power (remember: the power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true).

Stopping an experiment prematurely (before you reach the time used in the power analysis) is a common mistake. You must resist the temptation to stop a test early if you see the challenger performing better than the control. Even if the difference is greater than the Minimum Detectable Effect, it is highly likely to be a false positive. This is because there are fewer overall sessions earlier in the experiment, so any extra conversions (clicks, purchases, etc) in the challenger group appear larger than they actually are.

For example, say you are testing 2 creatives in a paid media campaign. After one day group A has 10 visitors and 1 click (a Click Through Rate, CTR, of 10%), while group B has 10 visitors and 2 clicks, one more than A. So B has a CTR of 20% and you may be tempted to say that the challenger has doubled CTR. However, as you get more visitors, a single click has less impact. After a few more days group A has 100 visitors and 10 clicks (a CTR of 10%), while group B has 100 visitors and 11 clicks, one more than A. Now B only has a CTR of 11%. Over time, fluctuations in the conversion rates settle down and they converge to their true, long-term values, because they are averaged over a larger number of visitors. This is called “regression to the mean”.

Regression to the mean

The best way to avoid false positives is to state how long the experiment will run for before you start it. Then let the experiment run and don’t stop it until after that amount of time has elapsed. Only then can you start to analyse it.