Hypothesis Kit for A/B testing
One of the most important parts of A/B testing is having a solid hypothesis.
That’s why we developed the Hypothesis Kit:
Design like you’re right:
Based on [quantitative/qualitative insight].
We predict that [product change] will cause [impact].
Test like you’re wrong:
We will test this by assuming the change has no effect (the null hypothesis) and running an experiment for [X weeks].
If we can measure a [Y%] statistically significant change in [metric] then we reject the null hypothesis and conclude there is an effect.
The kit helps people to frame the experiment properly, focusing them on the most crucial elements.
Design like you’re right
The insight behind the proposed change is key. Leading with this forces people to think about why they want to do the experiment and why they believe the change will have any effect. It also helps to avoid doing tests because somebody (possibly a HiPPO — Highest Paid Person’s Opinion) thinks it’s a good idea or has a gut feeling.
Being clear about the change and predicted impact means we can design a rigorous, trustworthy experiment and measure the appropriate metric.
Test like you’re wrong
This encapsulates the need to be objective and apply critical thinking to the experiment. Look hard for where you might be wrong.
Stating in advance the minimum change we need in our key metric to reject the null hypothesis helps protect us against Confirmation Bias and HARKing (Hypothesising After the Results are Known). Confirmation Bias is the temptation to look at data that supports the hypothesis while ignoring data that would argue against it. HARKing is the act of forming or changing a hypothesis after having seen the results and presenting that as the original hypothesis.
Reverse the logic
The minimum statistically significant change is found by doing a power calculation, another cornerstone of rigorous A/B testing. The power calculation aims to minimise inherent statistical errors (false positives and false negatives) by calculating the minimum sample size (and therefore time) required to be reasonably confident of identifying a change. However, I find it much more valuable to reverse the logic and ask: what can we learn in 1 or 2 complete weeks (or business periods if your business has longer cycles than a week)? What is the minimum change we could detect given the traffic we are willing to expose to the test and the current value of the metric we want to improve?
This avoids running experiments for a long time in order to detect a small change that is statistically significant but insignificant to the business. It helps us focus on the most promising, impactful experiments. It also helps guard against power hacking where a power calculation is manipulated to give a short experiment time by exaggerating the predicted metric change. The risk with power hacking is that the test fails to detect a statistically significant change when it may have just not had enough traffic (it was under-powered). Fixing the experiment length gets round this issue.
Try it out
Try it out for yourself here and do an online power analysis to calculate the minimum statistically significant change you can measure in your experiment.Hypothesis Kit
Rik Higham and Colin McFarland developed the Hypothesis Kit, with contributions from David Pier, Lukas Vermeer, Ya Xu and Ronny Kohavi. “Design like you’re right, test like you’re wrong” props to Jane Murison and Karl Weick. Original Hypothesis Kit from Craig Sullivan.