Here be Dragons! Er… I mean False Positives.
✓ Thought of an amazing experiment that’s going to increase conversion/retention/etc? Check!
✓ Filled out a Hypothesis Kit statement and done a power calculation? Check!
✓ Run the experiment for a whole business period? Check!
✓ Calculated statistical significance and got a p-value less than 5? Check!
☐ It’s a winner! Roll it out? Maybe…
Now’s an important time to stop, take a step back, and ask yourself:
“Have I honestly avoided confirmation bias?”
Think like a scientist
Not one like this
Not even close
Thinking like a scientist just means looking at the data objectively. Applying critical thinking to the hypothesis. When you have an experiment that you think is a good idea and have put a lot of time into it’s very easy to latch on to the first positive results and call it a success. This is confirmation bias and you need to think hard about why you might be wrong. Pretend it’s somebody else’s idea and look for things they’ve possibly missed.
A concrete example
We made a change recently at Skyscanner to improve the data flow between our flights results pages and our discovery map. The prices on the map are indicative prices based on previous searches that people have done. When you look at flights on our results page those prices are fresh from the airlines. There’s a few minutes lag before those fresh prices make it to the map (due to the incredibly complex underlying database system), which can be confusing if you’re moving between the results and the map. We’ve removed that lag so now if you see a price on the results page and go back to the map you will see that fresh price.
We did the power calculations and ran an A/B experiment for a week. Conversion for people who used the map increased by 3.7% (at 95% confidence). A great result! Suspiciously great…
The map doesn’t get a huge amount of visits and to be in this experiment you needed to have been to the map, the results page and returned to the map. An even smaller group of people. So I wouldn’t expect conversion for the segment of people who’ve visited the map to increase that much. The experiment had been running for a couple of days longer than the required week so I had a look at the additional sets of week-long periods (at this point I’m in danger of HARKing and I certainly need to apply a Bonferroni correction but that’s a story for a different post). Neither of the subsequent week-long periods showed a statistically significant increase in conversion. I decided to run the experiment again for 2 weeks instead, to use a larger sample size. This time there was no statistically significant increase.
The original week-long experiment was a false positive!
I was lucky in this case that I knew any effect I saw should be small and the result we got was suspiciously large. Had the first week showed a 1% statistically significant increase would I have accepted that without questioning it? I hope not… but it’s been a good reminder that false positives lurk beneath the surface of every experiment!