The probability of making a type I error. Also known as the significance level.

A prediction based on quantitative or qualitative insight that a proposed change in your product will cause a specific impact and measurable change in a metric.

The probability of making a type II error.

The length of your business' natural browsing/purchasing cycle.

Also known as the treatment, this is the B in A/B testing. The version of your experiment with a specific change that you want to test.

The probabiliity of *not* making a type I error. Confidence level = (100% - Significance level)

The temptation to look at data that supports the hypothesis while ignoring data that would argue against it.

The A in A/B testing. The version of your experiment that has no change in it so that you can compare the metrics in A with the metrics in B (the Challenger).

This happens when the null hypothesis is false and we fail to reject it. In other words we fail to detect an effect that is present. Consider the following analogy where we have taken a walk in the forest and the null hypothesis is: "there is no wolf in the forest". So a false negative would be failing to see a wolf (when actually there *is* a wolf there).

This happens when the null hypothesis is true and we incorrectly reject it. In other words we are detecting an effect that's not actually there. Consider the following analogy where we have taken a walk in the forest and the null hypothesis is: "there is no wolf in the forest". So a false positive would be claiming we saw a wolf (when actually there is *no* wolf).

Hypothesising After the Results are Known: the act of forming or changing a hypothesis after having seen the results and presenting that as the original hypothesis.

The Minimum Detectable Effect (MDE) is calculated by doing a power analysis. It is the smallest statistically significant change we can measure between the control and the challenger, given the sample size and significance level.

This is the hypothesis that the change you are testing (ie. B) will have no effect. Therefore there will be no difference between the results of A (the control) and B (the challenger).

The method of running an experiment to compare the results of the control (A) with those of the challenger (B) to determine if we can reject the null hypothesis (and therefore conclude there is an effect).

In addition to predicting an effect you must also predict a certain direction (eg. an *increase* in conversion rate). This means you are completely disregarding the possibility of a change in the other direction. In a one-tailed test all of your significance level (α) is alloted to the predicted direction, which means you need a smaller difference between (for example) conversion rates to reject the null hypothesis. However, it is easy to fall into the trap of confirmation bias and you should only use one-tailed tests in rare circumstances. It is much safer to use a two-tailed test.

The p-value is calculated statistically from the measured experimental data (using a standard normal distribution). It represents the probability that an observed measurement (eg. the difference between conversion rates in A and B) occured by chance. Download a free Excel p-value calculator here.

The probability of correctily rejecting the null hypothesis when the alternative hypothesis is true. In other words, the ability of an AB test to detect a difference between the two groups if that difference actually exists. The power of a statistical test is (1-β), where β is the probability of making a type II error.

A calculation that gives you the minimum sample size required to be reasonably confident of minimising inherent statistical errors (false positives and false negatives) and of detecting an effect of a given size (eg. a 3% increase in conversion rate).

The number of people who see either the control (A) or the challenger (B).

The probability of making a type I error.

Also known as α. It is the probability of making a type I error. It is not measured from the observed data but is chosen as an acceptable threshold before the experiment begins. Conventionally the significance level is taken as 5%. If the p-value is less than the significance level then the results are statistically significant.

Statistical significance helps you understand how compelling your experimental data is and whether you can reject the null hypothesis. The result of an experiment is statistically significant if it is unlikely to occur by chance alone. Conventionally, if the p-value is less than 5% then the results are statistically significant and we reject the null hypothesis.

Another word for the challenger.

In a two-tailed test you only predict that there will be an effect, not the direction of the effect (eg. "conversion rate will change", *not* "conversion rate will increase"). Your significance level is split equally between each direction, which means you need a higher difference between (for example) conversion rates to reject the null hypothesis. However, it is a much more rigorous test than a one-tailed test.

Experimentation Hub was created by Rik Higham, who is a Senior Product Manager at Skyscanner.

Read Rik's Medium posts on experimentation and Product Management here.

Copyright © Rik Higham 2016 - 2017