Analysing the experiment

Analyse the experiment

Over thousands of years natural selection has made humans very good at seeing patterns where none exist, and making causal relations that are not true. Early humans dismissing a wolf in the forest as just a shadow were far less likely to survive and reproduce. Those who thought they saw a wolf, when it was actually a trick of the light, still ran away. The cost of this false positive was low and enabled those humans to pass on their pattern-recognition genes to future generations.

This is easy to demonstrate: if you flip a coin 10 times and get 9 heads, a part of you is surprised and is tempted to say the coin is rigged, despite knowing that if you flipped it another 90 times the heads and tails count would even out to around 50 each. So how can you tell whether the value of your key metric in the B group is different or not from the A group? Or whether any difference is due to random chance (like getting 9 heads)? The answer is to calculate statistical significance. And to think like a scientist when analysing the data.

“The first principle is that you must not fool yourself - and you are the easiest person to fool.” - Richard Feynman

In the previous section we said we could use “null hypothesis significance testing” to assess whether an observed difference in the key metrics is more interesting than what chance could reasonably produce. In other words: how surprising is this data? We want to avoid false positives (type I errors) and the probability of making a type I error is called \(\alpha\) (pronounced ‘alpha’), the significance level. The probability of chance producing the observed difference between A and B (or a greater difference), when the null hypothesis is true, is called the p-value. You can also think of the p-value as the probability that the observed difference is not real (ie. it is just statistical noise). The difference between A and B is statistically significant when the p-value is smaller than the significance level (p-value < \(\alpha\)), ie. the difference is unlikely to occur by chance alone. If the difference is statistically significant (p-value < \(\alpha\)) then we can reject the null hypothesis and accept the alternative hypothesis. If the difference is not statistically significant (p-value > \(\alpha\)) then we cannot reject the null hypothesis, even if the key metrics appear different (ie. the “difference” is probably just statistical noise). This doesn’t necessarily mean there is definitely no difference between A and B, it just means any possible difference is smaller than your MDE. To detect a smaller effect, you would need to run another experiment with a larger sample size (ie. more traffic and/or a longer time).

Remember: the p-value is not the probability of your alternative hypothesis being correct; it is the probability of differences in the observed data. It is just a tool to help you decide whether to reject the null hypothesis.

“In some sense [the p-value] offers a first line of defense against being fooled by randomness, separating signal from noise” – Yoav Benjamini, It’s not the p-values’ fault

The significance level is not measured or calculated, it is chosen in advance. Because we can never completely eliminate false positives (see “I got the Power” section) there is always a non-zero probability of making a type I error. We need to set a value for the significance level that we consider acceptable odds. Conventionally 0.05 is chosen for \(\alpha\). The p-value is calculated from the number of people in each variant and the number of people who perform your goal, which is commonly known as the number of conversions. For example, the number of people who open your email and the number of people who click a Call To Action in your email, like a button offering a deal or a link to “read more”. It is a statistical calculation but thankfully there are many online calculators you can use (like this one on Experimentation Hub) or Excel sheet templates (like the one in this article).

Since the significance level, \(\alpha\), is the probability of incorrectly rejecting the null hypothesis (when actually it’s true), then 1 - \(\alpha\) is the probability of correctly not rejecting the null hypothesis (because it is true). This is known as the confidence level, \(\gamma\) (pronounced 'gamma'). When you declare a statistically significant result it is good form to state the p-value and the significance level you chose in advance (or the confidence level). If we chose 0.05 as our significance level and calculated a p-value of 0.03 from our experimental data then the p-value is less than 0.05 and we could say: “the difference between the key metrics is statistically significant at a 0.05 significance level, with a p-value of 0.03”. You may also see this written as “the difference between the key metrics is statistically significant at a 95% confidence level, with a p-value of 0.03”. Regardless of how you say it, there is less than a 5% chance that the difference in the observed key metrics is due to random statistical noise. So we can reasonably safely reject the null hypothesis and accept the alternative hypothesis.

Examples of p-value calculations and statistical significance

Visitors A: 5000 Conversions A: 600
Visitors B: 5000 Conversions B: 675

p-value = 0.024 (calculated here). The difference in the rate of conversions is statistically significant (at a 0.05 significance level) so we can reject the null hypothesis and accept the alternative hypothesis. The B variant has a 12.5% higher rate of conversion than A.

Visitors A: 5000 Conversions A: 600
Visitors B: 5000 Conversions B: 650

p-value = 0.13 The difference in the rate of conversions is not statistically significant (at a 0.05 significance level). We cannot reject the null hypothesis. There is no difference in the rate of conversion between A and B (the numbers may appear different but that is just statistical noise).

Visitors A: 5000 Conversions A: 600
Visitors B: 5000 Conversions B: 665

p-value = 0.051 The difference in the rate of conversions is not statistically significant (at a 0.05 significance level). We cannot reject the null hypothesis. There is no difference in the rate of conversion between A and B (the numbers may appear different but that is just statistical noise).

In the last example, the p-value was only just greater than 0.05. However, it is either statistically significant or it is not statistically significant. There is no grey area in between. We cannot say “the results are almost statistically significant” or “the results are trending towards significance”. We cannot say “the key metric in B is leaning towards a slight improvement” or “the key metric in B was greater but it is not statistically significant”. These are all incorrect statements. If the result is not statistically significant then there is no difference in the key metrics (even if the numbers appear different).

We can illustrate this with a coin analogy. Say we took 2 brand new £1 coins and flipped each coin 100 times. The first coin gave 50 heads and 50 tails, while the second coin gave 63 heads. Because we know both coins are the same we would probably dismiss this difference as just a fluke, just statistical noise. But what if it was an experiment testing different paid media creatives? We can calculate the p-value, which is 0.061. This is greater than 0.05 so the difference between the coins is not statistically significant (at a 0.05 significance level). In the case of the coins there is no temptation to say the result is “almost significant”. In experiments where we are testing different variants we must resist any temptation to view the p-value as being “close” to the significance level. If the p-value is greater than the significance level, the result is simply not statistically significant and we cannot reject the null hypothesis.

It is worth taking a moment to look at statistical significance using bell curves because this can help you visualise why two key metrics that appear “different” may actually not be statistically significantly different (even if the p-value is “close” to the significance level).

As an example, say we’re testing email subject lines and our key metric is email open rate (let’s call this rate E: the number of emails opened per number of emails delivered, and let’s call EA the email open rate in our control group A, while EB is the email open rate in our challenger group B which has a different subject line to A). Once our experiment has finished we can plot these rates on a line:

Email open rate

How can we tell whether the difference between EA and EB is statistically significant? In other words, are they far enough apart on the line? The answer is that we look at their bell curves. For that, we need a small diversion into the world of the normal distribution.

The normal distribution

Imagine we were counting the number of people in the street and keeping track of their height. We add a dot on a graph for each person, next to their height. After 20 people the graph might look like this:

Normal Distribution 10's of data points

If we keep counting we’d notice that on average most people are around 165cm tall and far fewer people are 145cm or 185cm tall:

Normal Distribution 100's of data points

If we keep counting thousands of people, and instead of showing each dot we just trace a line over the top of the highest dots, we would see something like this:

Normal Distribution 1000's of data points

This sort of graph is commonly known as a bell curve due to its shape. Another way of looking at this is that we have counted more people who are 165cm tall than people who are 145cm or 185cm tall. So the probability of seeing somebody 165cm tall is greater than the probability of seeing somebody 145cm tall or 185cm tall. In fact, we can use exactly the same curve to model the probability of seeing people with certain heights. This curve is called a normal distribution. It looks identical but the values on the vertical axis are related to the probability of each height and are calculated using the mean and standard deviation of the data. You can think of the mean as the average, which is 165cm in our case. The standard deviation (\(\sigma\), pronounced 'sigma') is the distance from the mean to the point where the curve goes from convex to concave (shown on the image below as arrow from the middle, of length 1\(\sigma\)). The standard deviation for our data is 8.33cm. Many natural phenomena can be modelled with the normal distribution, as long as you have a large number of observations (ie. a large sample size). You don’t need to know the formula for the calculation but if you’re really curious it’s \(f(Height) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^\frac{(Height - \mu)^2}{2 \sigma^2} \) , where µ (pronounced ‘meu’) is the mean.

Normal Distribution showing standard deviation

The area under this curve between two heights gives you the probability of seeing someone whose height is between those heights. The total area under the curve is 1 because the probability of seeing someone between the smallest and tallest heights we observed is 100%. There is a 34% probability of seeing someone between 165cm and 173.33cm (the area under the curve between the mean and 1 standard deviation greater). The most unlikely (or surprising) heights are more than 2 standard deviations higher or lower than the mean. Together they have a 5% probability, which is the same as the significance level we chose in the previous section (note: actually those areas are 1.96 standard deviations higher or lower than the mean, but 2 standard deviations is a useful approximation).

Normal Distribution showing probabilities

Conversion rates and the normal distribution

Let’s return to our email subject lines example. This is a little different to the heights example because height is a continuous variable, whereas you either open an email or you don’t. The email open rate (our conversion rate) is an average measure for a sample of people we look at. For each sample we look at, we could see a slightly different open rate (in other words, we could pick a hundred random people and see an open rate of 33% but picking another hundred random people might give us an open rate of 37%). So we build up a picture from many samples drawn from each group (the complete set of people in each group is called the population). With a large enough number of samples, we can model this as a normal distribution. The open rate we measure in the experiment is actually just the most likely rate we would observe if we took lots of random samples of the population. So now, instead of 2 single values for the email open rates we have 2 distributions, each centred on the average conversion rate we measured:

Email open rate with probability distributions

Notice the way the distributions overlap. The larger this overlap, the less likely the difference between the email open rates is statistically significant because the probability of seeing the other email open rate is reasonably high. The overlap between conversion rate normal distribution curves depends on two things: the size of the test groups (ie. the number of samples) and the size of the conversion rate (ie. the number of conversions/opens/clicks/purchases/etc). The larger the difference in conversion rates, the smaller the overlap (in other words it is easier to detect changes that have a large impact on metrics). The larger the sample sizes, the smaller the overlap (in other words it is easier to detect changes when you have more traffic, but don’t forget that you need to do a Power Analysis to work out the appropriate level of traffic for your experiment).

Comparing distributions with different traffic and conversions

Instead of the looking at the difference between email open rates we need to look at the difference between their normal distributions. Luckily the difference between two normal distributions is also a normal distribution, centred on the difference between the mean conversion rates (ie. \(E_{(B-A)} = EB - EA\)) of the two distributions (with a standard deviation based on the sum of the squared individual standard deviations: \(\sigma_{\small{E_{(B-A)}}}=\sqrt{\sigma_{EA}^2+\sigma_{EB}^2 } \)).

The null hypothesis states that the difference between the mean conversion rates is zero and so the normal distribution of the difference is centred on zero. The alternative hypothesis states that the difference between the means is non-zero. In order for us to reject the null hypothesis (with a significance level of 0.05) this difference must be at least 2 standard deviations (or more correctly 1.96 standard deviations) away from zero so that the p-value is less than 0.05 (or precisely equal to 0.05). Remember that the p-value is the probability of chance producing the observed difference between A and B (or a greater difference) when the null hypothesis is true, and that the probability of measuring a value over Z standard deviations away from the mean is the area underneath the normal distribution curve past Z. So the p-value is the area in the 2 tails of the normal distribution curve. We use both tails because we want to detect whether the email subject line has had any effect, good or bad, so we need to be able to detect whether EB is greater or less than EA (in other words, to get a non-zero difference between the means EB can be greater or less than EA).

Difference between distributions

For more details on this you can read the “Some maths (optional but worth it :-)” section below. The good news is that online calculators (like this one on ExperimentationHub.com) give you the p-value directly from the sample sizes and number of conversions!

Note that it is not as simple as rejecting the null hypothesis whenever EB-EA is non-zero (even though that is what the alternative hypothesis requires) because the rates are not exact, single values. They are the most likely value within a range of other values that have a lower probability. An analogy is tolerance. If you measure the length of 2 official league football fields they are unlikely to be precisely the same length, but within a certain acceptable tolerance (stated upfront, for example 10cm) they are effectively the same. So the difference in the length of those two similar fields could be written as \(L_2 – L_1 = 0 \pm 10cm\). If the difference in length was outside this tolerance we would say the fields were different lengths. Similarly, the “tolerance” for our experiments (at a significance level of 0.05) is a difference in means of less than \(\pm2\) standard deviations of the normal distribution of differences. Anything greater than that would result in a p-value of less than 0.05 and we could reject the null hypothesis and say the rates are different.

An important note on sample sizes

As we can see above, calculating the p-value is based on modelling the measured conversion rates as normal distributions. This is a good approximation as long as the sample size is large enough. A good rule of thumb is requiring at least 300 samples (in each group) for this to be a decent approximation.

The control and challenger groups should also be approximately the same size. This means the distribution of the difference between the means’ normal distributions will be roughly symmetric and more suited to being modelled by a normal distribution itself.

A tale of two tails

In most experiments, we want to know whether our change had any impact on our key conversion rate, either positive or negative. The null hypothesis states that the mean conversion rate in our control group is the same as the mean conversion rate in our challenger group (\(\mu_B = \mu_A\)). The alternative hypothesis states that they are not the same (\(\mu_B \ne \mu_A\)). To reject the null hypothesis, we still need a p-value of less than 0.05 but it doesn’t matter which side of the normal distribution of the difference of means it lies. This means we need 0.025 on either side of the distribution and the difference between the conversion rates must be at least 1.96 standard deviations (\(1.96\sigma\)) away from the centre (in either direction). This is called a two-tailed test.

Two tailed p-value distribution

If all we care about is whether our change has had a positive impact on the key conversion rate then we can use a one-tailed upper test. This time, we are also predicting a direction of change and the alternative hypothesis states that the mean conversion rate in our challenger group is greater than the mean conversion rate in our control group (\(\mu_B > \mu_A\)). We still need a p-value of less than 0.05 to reject the null hypothesis but this time all the area is in the right tail of the normal distribution of the difference of means. Because this area is larger than the 0.025 area we only need a standard deviation of \(1.64\sigma\). However, if the key conversion rate is not statistically significantly larger, we cannot make any statements about whether it is lower or not. This test only gives us information about one direction.

One tailed upper p-value distribution

Similarly, if all we care about is whether our change has had a negative impact on the key conversion rate (eg. aiming to reduce bounce rate) then the p-value is located in the left tail. The area (ie. p-value) is still 0.05 so the standard deviation is the same magnitude but is now negative: \(-1.64\sigma\). However, if the key conversion rate is not statistically significantly smaller, we cannot make any statements about whether it is larger or not. Like the previous one-tailed test, this only gives us information about one direction.

One tailed lower p-value distribution

Which tailed test should you chose?

With a two-tailed test, the power is split between both directions (an increase or a decrease in your key metric), so it is harder to detect a change (in either direction). If you have a hypothesis about the direction of change in your key metric, you could consider using a one-tailed test because it has all its power in one direction, which means your Minimum Detectable Effect is larger in that direction. However, you cannot make any conclusions about changes in the other direction, so you risk missing an effect in that direction. The consequences of this depend on your product and your key metric. As an extreme example, let’s return to our cancer analogy: if we were testing a drug that we believed performed better than an existing drug, we could use a one tailed test but then we would not be able to detect the possibility that it was actually less effective than the existing drug. However, if our new drug had fewer side effects and all we wanted to test was that it was no less effective than the existing drug (it could be more effective but that would be a bonus, not the priority), then we could use a one-tailed test. So there are scenarios where a one-tailed test could be appropriate but if you are in doubt a two-tailed test would be safer.

The variance (\(\sigma^2\)) is a measure of how far each value in the data set is from the mean (\(\mu\)), where there are N samples in the data set (ie. N is the sample size).

$$\mu = \frac{\Sigma X}{N}$$ $$\sigma^2 = \frac{\Sigma(X - \mu)^2}{N} = \frac{\Sigma X^2}{N} - \mu^2$$

Since we are talking about conversion rates, the mean (\(\mu\)) is our average conversion rate for the sample. And since people either convert or they don’t, X is either 1 or 0. Since 12=1 and 02=0 then we can simplify this formula:

$$\sigma^2 = \frac{\Sigma X^2}{N} - \mu^2 = \frac{\Sigma X}{N} - \mu^2 = \mu - \mu^2 = \mu(1 - \mu)$$

The standard deviation (\(\sigma\)) is the square root of the variance:

$$\sigma = \sqrt{\mu(1 - \mu)}$$

One standard deviation is the distance from the mean to the point of inflection on the normal distribution curve (ie. where the curve goes from convex to concave).

When we look at a test group we are looking at a sample of the whole population. The standard error is the standard deviation of the sample distribution (ie. based on the sample mean \(\mu\)):

$$std.err. = \frac{\sigma}{\sqrt{N}} = \sqrt{\frac{\mu(1 - \mu)}{N}}$$

Where \(\sigma\) is the standard deviation of the population. In our experiment, we have 2 sample (ie. ‘test’) groups A and B. So the standard deviation of each sample group (ie. the standard error of each sample group distribution) is \(\sigma_A\) and \(\sigma_B\). The conversion rate for each group (ie. the sample mean for each distribution) is \(\mu_A\) and \(\mu_B\).

Since the conversion rates for each test group are averages based on a sample of data from the whole population there is a level of uncertainty in the value of the mean conversion rate (by ‘uncertainty’ we mean the mean of the sample may differ slightly from the mean of the whole population because we have only looked at a sub-set of all the possible data). If the sample size is large enough (N>300 is a good rule of thumb) then we can model each group’s conversion rate as a normal distribution centred on the measured average conversion rates \(\mu_A\) and \(\mu_B\). The distributions have standard errors (ie. standard deviations for the sample) of:

$$\sigma_{\small{A}} = \sqrt{\frac{\mu_{\small{A}} (1 - \mu_{\small{A}})}{N_A}}$$

and

$$\sigma_{\small{B}} = \sqrt{\frac{\mu_{\small{B}} (1 - \mu_{\small{A}})}{N_B}}$$

When we take the difference of the sample group distributions (ie. effectively the difference between the test groups’ conversion rates) we get another normal distribution with a mean of

$$\mu_{\small{(B-A)}} = \mu_{\small{B}} - \mu_{\small{A}}$$

and a standard error (ie. standard deviation for the difference of sample distributions) of

$$\sigma_{\small{(B-A)}} = \sqrt{\sigma_B^2 + \sigma_A^2}$$

Note that the normal distribution of the difference is wider than each of the original normal distributions because its standard deviation is based on the sum (not the difference) of the original distribution standard deviations.

To determine whether the difference between conversion rates \(\mu_{\small{(B-A)}}\) is large enough to say it could not reasonably be produced by chance alone, and therefore to be able to reject the null hypothesis, we need to calculate the p-value. First, we use a test-statistic to convert the observed data (ie. the means and standard deviations/errors) into a single value from which we can calculate the p-value. There are many different test-statistics (listed here) that are used in different situations. The most applicable test-statistic for us is the Z-test, which can be used when comparing two samples (ie. the control and challenger in our scientifically run experiment) with normal distributions (which we can assume if our sample sizes are large enough), independent observations, and known standard deviations/errors (when measuring conversions, we can determine these from the average conversion rate; when measuring other quantities, we would need to calculate these from the data itself). The formula for the Z-score (the test-statistic) is generally written as:

$$Z = \frac{(\bar{x}_1 - \bar{x}_2)-d_\circ}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

Where \((\bar{x}_1 - \bar{x}_2)\) is the difference of the means of sample 1 and 2, \(d_\circ\) is the hypothesised mean difference (which in our case is 0 because the null hypothesis states that there is no difference between the test groups), and \(\frac{\sigma^2}{n}\) is the standard error for each sample. So, for our experiments we can summarise this in terms of the normal distribution of the difference between the sample mean distributions:

$$Z = \frac{\mu_{\small{(B-A)}}}{\sigma_{\small{(B-A)}}}$$

The Z-score can be thought of as the distance from the null hypothesis mean difference to the mean difference measured in the experiment, in units of standard error (ie. units of standard deviation). To convert this to a p-value, which is a probability, we calculate the area under the standard normal distribution curve (which is a normal distribution with a mean of 0 and a standard deviation of 1) for values greater than Z and less than -Z (ie. the areas in the tips of the curve past these Z values. This area calculation (the cumulative standard normal distribution function) is complex but luckily programs like Excel (NORM.S.DIST) can do it for us, or we can look up tables of pre-computed values (for example we know that a Z-score of 1.96 will give us a p-value of 0.05 in a two-tailed test), but normally we use online calculators to give us a p-value directly from the sample sizes and number of conversions!

Note that the cumulative standard normal distribution function (CDF) gives the area under the curve from – ∞ to Z. We actually need the remaining area (ie. 1 – CDF) because the p-value is the area beyond Z (ie. in the tip of the curve). In a two-tailed test the p-value is double this (ie. \(2\times\)(1-CDF) ) because a non-zero difference in the mean can be achieved if \(\mu_B \gt \mu_A\) (positive Z, right-hand tail) or \(\mu_B \lt \mu_A\) (negative Z, left-hand tail).

What next?

Let’s recap: you’ve crafted a hypothesis based on insights and data, you’ve designed an experiment that will test your hypothesis, you’ve run it for a full business cycle, and you’ve just calculated the p-value. Now your decision essentially boils down to this rule:

For a significance level of 0.05, either:

If p < 0.05 then behave as if the challenger has caused an impact and the difference in the data is real. We can reject the null hypothesis  

Or:

If p > 0.05 then behave as if the challenger has had no effect and any difference in the data is just noise. We cannot reject the null hypothesis  

Note: if the p-value is precisely equal to 0.05 then we can also reject the null hypothesis (ie. p <= 0.05). It is highly unlikely that the p-value will be exactly 0.05 though, so most of the time you will just need consider whether the p-value is less than or greater than the significance level.

Sticking to the rule above will keep you right in the vast majority of cases. As you know though, the p-value is not the probability of your alternative hypothesis being correct, it is the probability of differences in the observed data being caused by chance. So even if the p-value is less than your significance level it doesn’t guarantee that the challenger caused an effect (there is a 5% chance it’s a false positive). Similarly, even if the p-value is greater than your significance level it doesn’t necessarily mean there is definitely no difference between A and B, it’s just very unlikely.

“A wise man once said that you should never believe a thing simply because you want to believe it.” - Tyrion Lannister

At this point it can be very tempting to try and make the data fit with what you want to believe (otherwise known as confirmation bias). Or to look at other metrics, interactions, behaviours, or flows, to find one that could justify keeping the challenger (also known as HARKing, Hypothesising After the Results are Known). Or to dig into specific segments (like devices, markets, new/returning visitors, etc) to find one where there appears to be a beneficial effect. As Colin McFarland puts it:

“There are many ways you can arrive at the wrong conclusion” – Colin McFarland

This is often fed by the pressure of needing to be “successful”, or of not wanting to have “wasted” your time. It is further compounded by some particularly unhelpful language around A/B testing. “Verify” and “validate” are two of the most dangerous words you can use when running experiments, doing product discovery, or testing assumptions and MVPs. It assumes that our ideas are correct and that the point of running an experiment is to find evidence to back up our ideas. Whereas in reality at least 80% of experiments will have no beneficial impact on our key metrics. We shouldn’t think of our experiments as “successes” or “failures”. Each one generates a wealth of valuable insights, even if you cannot reject the null hypothesis. So as long as you learn something, use that new knowledge to run a more informed experiment, and share your results, then you’re not wasting anything. (You can read more about failure here).

"Don't fear failure. Not failure, but low aim, is the crime. In great attempts, it is glorious even to fail." - Bruce Lee

Sharing your results is one of the most important steps in A/B testing. The results of your experiment will give your colleagues new insights, and increase their understanding of your customers, product, and market. As a result, they will be able to make better decisions, run better experiments, and not repeat work. Additionally, they may ask you questions or raise things that you didn’t think about in your first analysis, which will increase your own understanding of the experiment. An A/B test should not be considered finished until you have clearly and concisely written up the results and shared them with all key stakeholders. Ideally the experimental analysis and conclusions should be put in a single repository so that anyone in your company can easily find historical experiments. This becomes particularly important over time, as new people join your company, want to get up to speed, and bring ideas that may have already be tested. Note that this doesn’t imply you should never run the same experiment again, things do change over time, but the insights from the original experiment will help you prioritise the idea against other opportunities, and refine any future experiment.

Prove yourself wrong

Before you share your results, you need to be confident you have done a thorough analysis and drawn objective conclusions. The key word here is objective. The IKEA effect is very common in experimentation: you value your idea more than other people do because you’ve put a lot of effort into creating it (be it a bookcase, a feature, or a marketing campaign). When analysing your experiment, you need to try and remain impartial. Humans have many cognitive biases and it’s easy to ignore inconvenient data. However, as Aldous Huxley says:

“Facts do not cease to exist because they are ignored” – Aldous Huxley

We need to be extra vigilant when analysing experiment results. Twyman’s Law is a useful rule of thumb to bear in mind: “Any data that appears interesting is almost certainly a mistake.” In other words, if it seems too good to be true…it probably is!

This is a challenge research scientists face every day. One method they use is to try and prove that the result is wrong. They analyse the data in lots of different ways, searching for reasons to show the conclusion is incorrect. If they can’t find any data to the contrary then they can happily accept the results. You could simulate this by pretending the A/B test is someone else’s experiment and you don’t believe their conclusions, so you look for ways it might be wrong. In other words, employ a healthy dose of scepticism towards your own conclusions!

Another approach they take is to ask whether the result makes sense. Instead of just accepting it at face value, try to understand everything around it. Are there different, independent ways of measuring the same thing (so you can double-check the conclusion)? If the change is real, what are its consequences (directly, and indirectly in other areas) and can you see those consequences in the data? If an overall funnel conversion has changed, what does that look like at a more fine-grained, step-by-step view? Does the data in that view seem reasonable? Has the flow or behaviour of people in related funnels changed? Are people taking a different path through your product, and is that new path beneficial or detrimental?

Conversion rates can sometimes hide real, meaningful numbers. If checkout funnel conversion has increased by 10%, how many additional purchases does that mean you actually got? Is that a believable number of people buying something per day? If it seems surprisingly high, perhaps you should double check your figures.

It’s good to get into the habit of asking these sort of probing questions. You can even take this one step further and ask a colleague to impartially review your conclusions. In academia, a number of other experts in your field must review your results before they can be submitted to a scientific journal. This is known as Peer Review. It is one of the foundation of scientific research because it maintains high standards of quality and integrity.

Another good way of understanding your experiment data and informing your conclusions is to run usability testing. Watching how participants behave in the challenger and comparing that to how you know people use the product normally can give you valuable insights.

Some common pitfalls

Post-experiment segmentation (a.k.a. “It’s statistically significant for people on iPhones in Corsica”)

Once an experiment has finished it’s important to understand how it performed with different segments of your customers. An increase in your key conversion rate overall may be due to a single market, or to returning users only. Similarly, no change in your key metric overall may hide the fact the change benefited mobile users but had a detrimental effect to tablet experience. However, the risk here is that you look at lots of different slices, notice one that appears to be statistically significant (p < 0.05), and incorrectly claim the challenger caused a change in the key metric for that segment. We need to remember that with a significance level of 5%, there is a 5% chance that any difference between group A and B could be caused by chance (and is just statistical noise). So the more segments you look at, the more likely you are to see a false positive (there is a 5% chance each time you look, and these add up). We can still dig into the data to understand the behaviour of different customers, but if you see something interesting you should repeat the experiment for that particular segment only (including a power calculation for that segment only).

Under-powered tests (a.k.a. “If I run this for longer, I’ll definitely see a change”)

We are normally more concerned with mistakenly claiming the challenger caused an effect (false positive), than we are with mistakenly missing an effect (false negative). However, if you see no statistically significant change in your key metric (p > 0.05) then either a) the challenger had no impact on your key metric, or b) there were not enough sessions in your experiment to see any change. The latter is known as an under-powered experiment because in the power analysis we chose to expose a certain number of users which meant the Minimum Detectable Effect we accepted was higher than the actual, real effect. So, we couldn’t detect the smaller change (ie. a change smaller than the MDE) and we couldn’t reject the null hypothesis.

It is always tempting to re-run an experiment with a larger number of users if the first one didn’t show a statistically significant change in your key metric. However, we need to be wary of the IKEA effect and confirmation bias. There is only a 20% chance this is a false negative. Furthermore, even if the re-run experiment gave us a statistically significant result, the change in the conversion rate would be much lower (remember, we need fewer sessions to detect large changes in your key metric). So, you should ask yourself whether you care about launching a feature that only has a small effect (given that will take time to run and analyse the experiment, and that the change probably increases code and product complexity). Would your time, your designer’s time, and your engineers’ time be better spent on potentially higher impact work?

Multiple challengers (a.k.a. “If I test 7 things, one of them will definitely work”)

Deciding what change to test is hard. It requires a lot of work to understand the problem you want to solve, and to gain insights into the best way to solve it. Hedging your bets and testing multiple challengers at the same time would appear to be a good way of moving quickly. Unfortunately, the more variants you have (ie. an A/B/C/D/etc test), the more likely you are to get false positives.

With 1 challenger (a normal A/B test) there is a 5% chance of any difference between the control and the challenger being a false positive (at a significance level of 5%). However, now we have more than 1 challenger, and since each challenger has a 5% chance of giving a false positive, those chances add up, which means the likelihood of one of the challengers being a false positive is much higher! So, in a similar way to the post-segmentation analysis, by looking at multiple options we have increased our likelihood of wrongly rejecting the null hypothesis.

There are a number of “corrections” you can choose to apply to the p-value calculation for multiple variants, to reduce the risk of false positives. The most common is the Bonferroni correction, which for N variants requires a p-value of p/N for statistical significance. For example, at a significance level of 5%, if you have 3 challengers and your control (N = 3+1 = 4) then to be able to reject the null hypothesis you need to measure a p-value of 0.05/4 = 0.0125. This is much lower than in an A/B test (with 1 challenger), so the criteria for statistical significance is much stricter and your chance of detecting an effect is greatly reduced. Not only that but now that we have split traffic across more variants, we have fewer sessions per challenger so the Minimum Detectable Effect we can see is much higher, and we require our variants to be highly impactful to see any effect. This is unlikely and also defeats the point of having multiple variants (if we were confident of having a high-impact change then we wouldn’t have needed to test lots of other variants)!

In practice, you are better off limiting yourself to 1 or 2 challengers. Use your qualitative insights and previous experiments’ data to narrow down what you want to test to the most likely candidates (and largest impacts). This will also save time and effort when designing and developing the challenger(s) and when analysing them after. In other words, taking a scattershot approach is a false economy.