Designing the experiment - Part 2

Part 2 - Designing the experiment conditions

Design the experiment

When you have an idea for an experiment, the first thing you should do is fill out a Hypothesis Kit statement. This will help you frame the experiment properly, and focus on the most crucial elements.

The Hypothesis Kit consists of two parts: design like you’re right, and test like you’re wrong. These encapsulate the yin and yang of experimentation: your intuition that making a specific change will improve your product’s performance, balanced by a rigorous, scientific methodology to ensure you draw the right conclusions. This is the Hypothesis Kit template:

Design like you're right:

Based on [quantitative/qualitative insight].

We predict that [product change] will cause [impact].

Test like you're wrong:

We will test this by assuming the change has no effect (the null hypothesis) and running an experiment for [X weeks].

If we can measure a [Y%] statistically significant change in [metric] then we reject the null hypothesis and conclude there is an effect.

It can be summarised like this:

The first part of the Hypothesis Kit is about what you want to change and why. Fundamental to this are the insights underpinning it, which is why they are explicitly stated at the beginning. The crucial point is that we aren’t asserting a speculative impact upfront. Instead we’re making a considered hypothesis after interpreting observed qualitative and quantitative insights. Leading with the insight forces people to think about why they want to do the experiment and why they believe the change will have any effect. It also helps avoid HiPPO experiments! When you’re filling out the “quantitative/qualitative insight” part of the template, be detailed. List the actual data and observations that led you to this prediction.

The “product change” describes the actual change you’re making, or the new element you’re adding, which you predict will cause a specific change in behaviour. The “impact” is the anticipated outcome of that behaviour change. Will more people sign-up? Or make a purchase? Whatever outcomes you predict should be captured here.

Building a hypothesis:

Say you work for the travel company Expedia and you have data showing that Russian passport holders require a visa for a large number of destinations. So you make the hypothesis that “a lack of visa information affects bookings made by Russian passport holders”. Not a bad hypothesis, and excellently grounded in objective data, but it’s not falsifiable or testable yet. Let’s give it some direction: “showing visa information will increase bookings made by Russian passport holders. Now let’s fill out the hypothesis kit: “based on the data that Russian passport holders require a visa to visit 73% of the Russian market’s top 100 destinations, we predict that showing visa information for each destination (in the top 100) will increase bookings made by Russian passport holders”. Falsifiable, testable, specific, and small in scope; an excellent hypothesis.

The second part of the Hypothesis Kit encapsulates the need to be objective and apply critical thinking to the experiment. We need to look hard for all the ways we might be wrong. The crucial point is that we state in advance our criteria for rejecting the null hypothesis (and therefore accepting that our change has had an effect). This is incredibly important because it helps protect us against the temptation to accept our change when the evidence doesn’t support it. The two most common temptations are Confirmation Bias and HARKing (Hypothesising After the Results are Known). Confirmation Bias is the desire to look at data that supports the hypothesis while ignoring data that would argue against it. HARKing is the act of forming or changing a hypothesis after having seen the results and presenting that as the original hypothesis. It is human nature to look at things in a way that backs up what you already, or want to, believe. Stating the criteria for rejecting the null hypothesis upfront, before running the experiment, is the best guardrail against these pitfalls.

There’s lots going on in the “Test like you’re wrong” part of the Hypothesis Kit, so let’s unpack it and take it step by step.

It’s all about being interesting

Unfortunately, comparing the value of the key metric in group A with the key metric in group B is not as simple as, for example, measuring the length of two different tables to see which is longer. With the tables, you have 1 measurement from each table. With the experiment, the value of the key metric depends on the behaviour of hundreds of different people, each one using the product at different times of the day, in different contexts, on different devices, with different aims, knowledge and experience. We have to decide whether any difference between the key metrics is due to chance variations in the behaviour of people in each group. Or whether an observed difference is greater than anything chance could possibly cause. To help us decide, we use a method called “null hypothesis significance testing”, which is just a fancy way of saying “is the data more interesting than what chance could reasonably produce?”.

The good news is that you can use sites like ExperimentationHub.com to do the statistical calculations. However, you need a little understanding of what’s going on behind the scenes so that you can correctly set up experiments and analyse their results.

What are the chances!

Say you make the hypothesis that more people will click a red button because it stands out more against the green background than the current blue button. You run an experiment to compare the button click rate (your key metric) in your B group (the “challenger” with the new red button) and your A group (the “control” with the original blue button). However, unknown to you, you happen to get slightly more people who are red-green colour-blind in your B group than in your A group. How do you decide whether any difference in the key metric is real or is just due to chance?

The null and alternative hypotheses

The hypothesis you make when you design your experiment is called the alternative hypothesis. This is your prediction of how your change will affect your key metric.

Because we can never comprehensively prove a hypothesis is true, we start from the point of view that our change will have no effect and nothing interesting will happen to the data. This is the null hypothesis. It states that any difference between the key metrics of A and B are purely due to chance.

We want to disprove the null hypothesis. If we can run an experiment and show that the difference between A and B is actually greater than any differences chance could plausibly cause, then we can reject the null hypothesis. We can conclude that our change had an effect, and accept our alternative hypothesis.

How do we know what difference chance could plausibly cause? We run a power analysis. This will give us the smallest difference between A and B that we would accept as interesting (or surprising). The smallest change in our key metric that we can claim is greater than any statistical noise chance could possibly generate.

I got the Power

Before we can run an experiment, we need to decide how many people we want to expose to the test and for how long we want to run it. To do this we need to look at the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. This is called the power of a statistical test.

There are 2 mistakes we can make when deciding whether or not to reject the null hypothesis. To illustrate these, we will use an analogy: we have taken a walk in the forest and we have made the assumption there is no wolf in the forest (this is the null hypothesis). If we see a wolf in front of us then our assumption is wrong and we can reject the null hypothesis. If we see something out of the corner of our eye and claim we saw a wolf when actually there is no wolf then the null hypothesis is true but we incorrectly reject it. This is a false positive (also known as a Type I error) and it means we are declaring an effect that’s not actually there. On the other hand, if there is a wolf in the forest but we don’t notice it, then the null hypothesis is false but we fail to reject it when we should have done. This is a false negative (also known as a Type II error) and it means we fail to detect a difference between A and B, even though it is real.

Unfortunately, these errors are linked. The more we reduce the risk of a false positive, the more likely we are to have a false negative, and vice versa. We can think of this in terms of a medical examination for cancer. If we decide to accept even the slightest indication of cancer, then we risk many people being misdiagnosed as having cancer, when actually they don’t. On the other hand, if we decide to accept only the strongest indication of cancer, then we risk missing many people who actually do have cancer, and who are told they don’t.

To have a high probability of correctly rejecting the null hypothesis, we need to know: what is the smallest change we can measure in our key metric where we are reasonably likely not to have any false positives or false negatives? This is called the Minimum Detectable Effect (MDE). If our key metric changes by less than the MDE then the difference is most likely caused by chance and we cannot reject the null hypothesis.

To calculate the MDE we do a power analysis using an online tool like this one on experimentationhub.com. The power calculation requires 4 inputs:

  1. The number of eligible users per day (the maximum number of people who could see the change)
  2. The percentage of those users who will see the change (the B group)
  3. The current conversion rate (your key metric)
  4. The number of days you will run your experiment.

A power analysis is basically asking the question: what is the minimum change we could detect, given the amount of traffic we are willing to expose to the test and the current value of the key metric we want to improve? Most online power calculations assume a power of 80%. This means that if we detect no change in our experiment results there is a 20% probability that there actually was a change but we didn’t see it (ie. a 20% chance of a false negative). We can reduce this probability but it would require more traffic. Also, as we’ve said above, false negatives are linked to false positives and most of the time it is better to accidentally miss an effect than mistakenly claim an effect where there isn’t one.

To MDE or not to MDE

Many people do a power analysis in the opposite way: they calculate the minimum time required to run the experiment in order to be reasonably likely to observe an effect of a given size. However, it is better to fix the length of your experiment in advance and calculate the smallest change in your key metric that you would be able to detect (the MDE). This allows us to ask: what can we learn in 1 or 2 weeks? It puts the focus on learning. It helps us to avoid running experiments for a long time in order to reliably detect a small change, and instead prioritise the most impactful experiments. It also helps guard against power hacking, where a power calculation is manipulated to give a short experiment time by exaggerating the predicted effect size. The risk with power hacking is that the test fails to detect a real change (ie. a false negative) because it didn’t have enough traffic (it was under-powered).

There are many other reasons to use the power analysis to calculate the MDE, including:

  1. If you state the effect in advance it’s just a guess because you can’t know exactly how your key metric will change.
  2. Fixing the length of the experiment removes the temptation to run the experiment a bit longer to try and get the result you want.
  3. You need to run the experiment for complete business cycles (see below), so however long a power analysis tells you to run the experiment, you’ll need to round the time up or down to the nearest complete business cycle, which will change the guessed effect size that you input into the calculation!

How long should I run an experiment for?

When considering how long to run your experiment you need to remember that the way people use your product changes throughout the day and week. To include all these different behaviours, you need to run the experiment for at least 1 complete business cycle and if you run it for longer you must always use whole business cycles. For example, a lot of people search for rental cars at the start of the week to check prices, then book the car on Friday, in time for the weekend. If we ran an experiment for 4 days from Sunday – Wednesday we would be missing out a lot of purchasing behaviour and we would bias the experiment towards browsing behaviour. We need to run the experiment for a full 7 days (midnight to midnight), or 14 days, or 21 days etc. For different products, the business cycle may not be 1 week, so you need to look at your data and understand how people use your product. The longer you run the experiment, the smaller the MDE. However, doubling the time does not half the MDE, it only reduces it slightly, so it is a case of diminishing returns (see some examples below). In order to stay lean, learn quickly, and iterate, it is best to run an experiment for 1 or 2 business periods only (although this will depend on the number of visitors per day you have).

What is your business cycle?

A timely note

Most analytics tools treat a “day” as midnight to midnight. If you start your experiment in the afternoon on Monday 5th June and include that day in your analysis then you are only getting part of a day, the later part. This means you are missing data about the behaviour of people earlier on that day, so you are biasing your results and may draw incorrect conclusions. Plus, it means you have fewer visitors, which makes it harder to reliably detect any effect your experiment has had. This is the same for stopping an experiment.

When you analyse your experiment, you need to ensure that each day of your business period has had a full day’s worth of traffic (from midnight to midnight). If you can automatically start and stop your experiment at midnight then you need only to consider the days in your business period (for example if you are scheduling a paid media campaign). However, if you are manually starting and stopping an experiment, you need to start it during the day before your first full day of traffic, and stop it the day after your last full day of traffic. For example, if your business period is 1 week and you started the experiment at 11am on Tuesday 6th June, then you need to stop the experiment on Wednesday 14th June, and use data from Wednesday 7th to Tuesday 13th for the analysis.

How many people should I expose to the challenger?

The more people you expose to the challenger, the smaller the MDE. The maximum number of people you can use is 50% of the eligible traffic because your control group (A) would need the other 50% (you should use the same percentage of traffic for both groups). However, this comes with the risk that a large proportion of your traffic would see a version of your product that could potentially deteriorate your key metric. Using a lower percentage of traffic reduces the risk to your business, and gives stakeholders confidence to test bold changes. It also allows you to run more independent experiments by separating your traffic into cohorts. For example, you could run 10 experiments that don’t overlap if each experiment has 5% traffic per variant:

cohorts

It is a balance between the length of experiment and percentage of people exposed to the test. Since they both affect the MDE you can choose a value for each that gives you an acceptable MDE. Don’t forget that a larger MDE is ok. It just means you’re aiming for a high impact experiment and if you don’t detect a large change in your key metric then you’re happy to miss an incremental, low impact, change.

Examples of power calculations

How many eligible visits do you have per day? 100,000
What percentage of visits will see the change? 50%
What is your current conversion rate? 12%
How many weeks will you run your experiment? 1

The smallest change you can measure in 1 week is 1.82% (the Minimum Detectable Effect).

How many eligible visits do you have per day? 100,000
What percentage of visits will see the change? 50%
What is your current conversion rate? 12%
How many weeks will you run your experiment? 2 (double the period of the first example)

The smallest change you can measure in 2 weeks is 1.29% (the Minimum Detectable Effect). Note that doubling the number of weeks does not half the MDE.

How many eligible visits do you have per day? 100,000
What percentage of visits will see the change? 25% (half the percentage of the first example)
What is your current conversion rate? 12%
How many weeks will you run your experiment? 1

The smallest change you can measure in 1 week with 25% of the traffic is 2.10% (the Minimum Detectable Effect). Note that halving the percentage of visits does not double the MDE.

We now have everything we need to fill out the Test like you’re wrong section of the hypothesis kit:

Design like you're right:

Based on [quantitative/qualitative insight].

We predict that [product change] will cause [impact].

Test like you're wrong:

We will test this by assuming the change has no effect (the null hypothesis) and running an experiment for [X weeks].

If we can measure a [Y%] statistically significant change in [metric] then we reject the null hypothesis and conclude there is an effect.

[X weeks] is the number of business periods you will run your experiment (you can use [X days] if your business period is very short, perhaps social media for example). [Y%] is the Minimum Detectable Effect calculated by the power analysis, and [metric] is the name of your key metric.

We have now crafted our hypothesis (based on objective insights), created the experiment to test it, decided how long to run the experiment and what percentage of our traffic we want to expose to it, and run a power analysis to find our Minimum Detectable Effect.

It’s time to run the experiment!

About

Experimentation Hub was created by Rik Higham, who is a Senior Product Manager at Skyscanner.
Read Rik's Medium posts on experimentation and Product Management here.

Copyright © Rik Higham 2016 - 2018

The Hypothesis Kit was developed by Rik Higham and Colin McFarland, with contributions from David Pier, Lukas Vermeer, Ya Xu and Ronny Kohavi. “Design like you’re right, test like you’re wrong” props to Jane Murison and Karl Weick. Original Hypothesis Kit from Craig Sullivan.

Power analysis calculation based on Experiment Calculator by Dan McKinley, adapted by Rik Higham.