If there is a statistically significant difference in your key metric between the challenger and the control (p < 0.05) and you are confident you have done a robust, objective analysis, then the common response is to roll that change out to all users. However, this misses the main point of experimentation: to learn.
“The more tests your team runs, the more ideas they should get for new tests. Data should be generative, not conclusive.” – Sara Critchfield
Experiments don’t tell us whether we should release a feature or choose one growth campaign over another. They tell us what happens to our key metrics if we make a specific change. They provide us with information about the impact certain changes have. They give us new insights that increase our understanding (of our customer, product, and market), and help us make decisions and shape future plans. Not every experiment has to be a go/no-go decision. It can be a step that informs the next iteration of the experiment. There may be many iterations before we consider something good enough to show to all our customers. Alternatively, it may be testing one of our riskiest assumptions and is not designed to be launched, but instead inform us about our direction. The knowledge we gain with every experiment moves us closer to our goal. The key to this is rapid tests. What’s the smallest experiment you can run to test your biggest assumption? As Tom Chi, co-founder of Google X, puts it:
“Maximising the rate of learning by minimising the time to try things.” – Tom Chi
Another aspect to consider is how impactful your change has been. Just because an experiment is statistically significant, that doesn’t mean it’s significant from a business point of view. Statistical significance just tells us whether we can consider the change in our key metric to have been caused by our experimental change. It doesn’t tell us whether that will make a material impact to our business. Is a 1% change good enough? That depends on your objective, the numbers involved, and any other consequences of making the change. You may feel pressure to accept a change (and show progress) but there are many things you need to consider. For example, changing a growth marketing campaign creative may involve costs for image licencing and text translations, which have an associated risk of not being properly localised. Launching features has many hidden consequences. For example, it adds code that needs maintaining, that may not be completely bug-free or have 100% test coverage, that increases the weight and complexity of your codebase, that may increase page weight or app size and decrease run-time performance.
You also need to consider your customers’ experience. Individually, each experimental change may increase key metrics but they can create a bloated, Frankenstein product where the overall experience is noisy and confusing. Furthermore, just because an experiment increases a business metric, that doesn’t mean it improves the customer’s experience. For example, you might have put a small blocker in people’s path like forcing them to register/login, or trying to upsell optional extras, or you might have introduced some urgency messaging (like “5 other people are looking at this right now”). In the short term, they might tolerate the inconvenience but over time they could get frustrated and stop using your product, or another product may come along that doesn’t have this friction. As always, this is a tricky balance. Metrics only measure what they measure. Key metrics must be wisely chosen to measure both an improved experience and a long-term gain for the business. This is another reason why usability testing is a powerful ally to experimentation.
There are many good books on choosing the right key metric (among them Lean Analytics by Croll and Yoskovitz) but here are a few key points:
Whatever key metric you choose, it should be what’s right for your product, not what’s easiest to measure. If you can’t easily measure your ideal metric then look for a leading indicator of it. For example, it’s hard to measure 6-month retention during an experiment, so look for metrics that are good predictors of 6-month retention, like activation or 28-day retention (but suitable predictors will depend on your business). People sometimes struggle to set a key metric for completely new features. Since they don’t exist in the control, interactions cannot be used to compare control and challenger. Clicks, however, are a poor metric. Not only are they absolute numbers and a vanity metric, they tell you nothing of the value someone gets from the feature. It is better to focus on the behaviour you want to change and the outcome you want to achieve.
“Not everything that can be counted counts, and not everything that counts can be counted.” - Albert Einstein
You should also consider what metrics you cannot afford to harm. If a test social media creative increases conversion for that product vertical but decreases conversion in one of your other product verticals, would you accept it? If a test log-in wall increases retention but harms revenue, is that beneficial overall? You should have an idea in advance of the trade-offs you are willing to accept. You should also consider having some Overall Evaluation Criteria (OEC). The key metric for your hypothesis only measures 1 thing but there will be a number of key metrics that are important for your product overall and for your business. This set of metrics is known as the OEC and applies to all your experiments. They help you keep an eye on the bigger picture, and ideally none of these should decline as a result of your experimental change.
Sometimes, “do no harm” or “no change” is seen as a valid outcome in itself. We need to be very careful here though. There are one or two scenarios where this is appropriate, for example replacing a service or API with a more scalable one, or removing a feature that you believe has no impact. However, why are we making changes if not to improve your customers’ experience and your business’ performance? If you have a good answer then observing no statistically significant change in your key metric and OEC may be a justifiable reason to accept the challenger. When designing your experiment though, it is critical that you do not underpower it, so that you have a small enough MDE to be confident you are doing no harm. You also need to be wary that if something does change you do not try to explain it away (ie. avoid HARKing and the IKEA effect). Generally speaking, if we are making a change we are doing it to make things better, and if we are making an impact on our business, we should be able to measure it. If the change has no measurable effect, it is normally wiser to stick with the original.