iIndex: A - B - C - F - H - M - N - O - P - S - T

α

The probability of making a type I error. Also known as the significance level.

Alternative hypothesis

A prediction based on quantitative or qualitative insight that a proposed change in your product will cause a specific impact and measurable change in a metric.

β

The probability of making a type II error.

Business period

The length of your business' natural browsing/purchasing cycle.

Challenger

Also known as the treatment, this is the B in A/B testing. The version of your experiment with a specific change that you want to test.

Confidence level

The probabiliity of not making a type I error. Confidence level = (100% - Significance level)

Confirmation bias

The temptation to look at data that supports the hypothesis while ignoring data that would argue against it.

Control

The A in A/B testing. The version of your experiment that has no change in it so that you can compare the metrics in A with the metrics in B (the Challenger).

False negative

This happens when the null hypothesis is false and we fail to reject it. In other words we fail to detect an effect that is present. Consider the following analogy where we have taken a walk in the forest and the null hypothesis is: "there is no wolf in the forest". So a false negative would be failing to see a wolf (when actually there is a wolf there).

False positive

This happens when the null hypothesis is true and we incorrectly reject it. In other words we are detecting an effect that's not actually there. Consider the following analogy where we have taken a walk in the forest and the null hypothesis is: "there is no wolf in the forest". So a false positive would be claiming we saw a wolf (when actually there is no wolf).

HARKing

Hypothesising After the Results are Known: the act of forming or changing a hypothesis after having seen the results and presenting that as the original hypothesis.

Minimum Detectable Effect

The Minimum Detectable Effect (MDE) is calculated by doing a power analysis. It is the smallest statistically significant change we can measure between the control and the challenger, given the sample size and significance level.

Null hypothesis

This is the hypothesis that the change you are testing (ie. B) will have no effect. Therefore there will be no difference between the results of A (the control) and B (the challenger).

Null hypothesis significance testing

The method of running an experiment to compare the results of the control (A) with those of the challenger (B) to determine if we can reject the null hypothesis (and therefore conclude there is an effect).

One-tailed test

In addition to predicting an effect you must also predict a certain direction (eg. an increase in conversion rate). This means you are completely disregarding the possibility of a change in the other direction. In a one-tailed test all of your significance level (α) is alloted to the predicted direction, which means you need a smaller difference between (for example) conversion rates to reject the null hypothesis. However, it is easy to fall into the trap of confirmation bias and you should only use one-tailed tests in rare circumstances. It is much safer to use a two-tailed test.

P-value

The p-value is calculated statistically from the measured experimental data (using a standard normal distribution). It represents the probability that an observed measurement (eg. the difference between conversion rates in A and B) occured by chance. Download a free Excel p-value calculator here.

Power

The probability of correctily rejecting the null hypothesis when the alternative hypothesis is true. In other words, the ability of an AB test to detect a difference between the two groups if that difference actually exists. The power of a statistical test is (1-β), where β is the probability of making a type II error.

Power analysis

A calculation that gives you the minimum sample size required to be reasonably confident of minimising inherent statistical errors (false positives and false negatives) and of detecting an effect of a given size (eg. a 3% increase in conversion rate).

Sample size

The number of people who see either the control (A) or the challenger (B).

Significance

The probability of making a type I error.

Significance level

Also known as α. It is the probability of making a type I error. It is not measured from the observed data but is chosen as an acceptable threshold before the experiment begins. Conventionally the significance level is taken as 5%. If the p-value is less than the significance level then the results are statistically significant.

Statistical significance

Statistical significance helps you understand how compelling your experimental data is and whether you can reject the null hypothesis. The result of an experiment is statistically significant if it is unlikely to occur by chance alone. Conventionally, if the p-value is less than 5% then the results are statistically significant and we reject the null hypothesis.

Treatment

Another word for the challenger.

Two-tailed test

In a two-tailed test you only predict that there will be an effect, not the direction of the effect (eg. "conversion rate will change", not "conversion rate will increase"). Your significance level is split equally between each direction, which means you need a higher difference between (for example) conversion rates to reject the null hypothesis. However, it is a much more rigorous test than a one-tailed test.

Type I error

A false positive.

Type II error

A false negative.

About

Experimentation Hub was created by Rik Higham, who is a Senior Product Manager at Skyscanner.
Read Rik's Medium posts on experimentation and Product Management here.

Copyright © Rik Higham 2016 - 2017