A/B testing is a great way to test out which of two or more variants (A or B) of a website or product perform better. It’s frequently used as an experimentation methodology because it allows us to get rid of a number of statistical biases, and isolate the signal (an increase in a metric) from the noise (random fluctuations in the metric). The power of A/B testing comes from randomization: users are allocated randomly to the control group or A (no change) and a proposed improvement (B).
Measuring a metric before/after a change (pre/post) sometimes doesn’t work, because many things are happening at the same time. If we launch a marketing campaign AND a feature to improve retention, and we see retention decrease, how do we know if it’s caused by poorly qualified traffic from the marketing campaign or by the feature not performing? By A/B testing features, we can still measure their impact in this situation as both A/B test variants would have seen, on average, the same amount of traffic from existing marketing channels and the new marketing campaign.
A/B testing is a good fit when there are several competing valid options with no clear answer, a clear metric to measure success, and a large enough number of users to test on. Evaluate:
A/B tests take time and resources, and we can only run a few A/B tests at a time to avoid interferences between tests. We should keep A/B testing for high-impact metrics optimization that match the other conditions here.
A/B testing is useful when there are several options to choose from, possibly with competing explanations on why they would work, but no clear winner.
It’s not useful A/B testing versions if we already know the result with high probability. Sometimes though, we will want to use A/B testing for things we know are important just to quantify impact (is it worth maintaining this?), as opposed to just validate (yes/no) impact.
A/B testing requires a clear quantitative way to measure what is “better”. Common advice is to pick a single, clear metric so that there is no ambiguity in deciding if the change
A/B testing requires a large enough number of users to get statistically significant results. The smaller the number of users in the A/B testing cohort, the more the change needs to have a big impact on the metric to be measurable. This is because the signal (the impact of the change) needs to be higher than the noise (the random fluctuation in a metric).
If you want to A/B test over a part of the product that does not receive a lot of traffic and therefore needs a much bigger impact on the metric to be statistically significant, the variant tested generally needs to be much more ambitious and thoughtful (e.g. removing 75% of steps in a sign-up flow instead of changing the color of a button).
For example, retention fluctuates week over week for “random” reasons we don’t understand. The more users, the smaller those “random” fluctuations will be: this will make it possible to decide whether a change caused by an experiment was likely random, or likely caused by the experiment.
Running A/B tests on a number of users that is too small is inefficient (waste time to setup the A/B test), inconclusive (the result will be that no conclusion can be made) and sometimes counterproductive (miscommunicated, an insignificant A/B test leads teammates to draw conclusions that aren’t valid and can be hard to debunk). In that case, it’s better to choose another method, such as user interviews.
- Evaluating signup flows, or any flow that result in a measurable action
- Evaluating versions of button placement, CTAs, landing pages
- Evaluating different messaging and positioning statements
The higher the initial conversion percentage, the easier it will be to achieve a sample size quickly, but the tougher it could be to see a meaningful increase.
- Launching a new key feature backed by product and UX research: success can be measured directly with an adoption metric (there is an easier way)
- Fixing a frequently reported bug likely does not require A/B testing (no uncertainty)
- Changing the design to build brand trust (no metric)
After you identify a good A/B test candidate:
Communicate good candidates with any relevant parties. This could include other product/engineering teams that the A/B test could impact, such as but not limited to PMs, designers, BizOps. Posting the proposal in a public channel (#product, #product-led, #analytics) could be a good route to do this.
- Fill out all the following information in an issue, and add it to the A/B testing project.
- Define the target metric
- Define the A, B (and more) versions
- Define the length of the test, depending on the number of users you need, and check for statistical significance with a calculator (example). Let BizOps know if you need to understand how much existing traffic there is to determine the expected length of the test
- Pick a feature flag name for your A/B test (eg.
- Set up the exact methodology (write the query, build the chart) for how this will be evaluated before the test launches to not introduce any bias in the evaluation
- Label the parent issue of the test with
AB-testso anyone can see a list of the A/B tests that have been or are being run
- Label all the issues that will go into the test
AB-test/<flag-name>. That way anyone can see what changes are in a given A/B test, and what the name of the feature flag is. It will also make it easier to clean up the flag when the test ends. Example.
- WIP: Follow the naming conventions when adding events
Analyze, write a report in a source of truth, and link to that in the original A/B test ticket. You can use Amplitude (link to come) or BigQuery/Sheets to evaluate A/B tests.
We use this calculator to evaluate significance.
Book some time for cleaning up after the A/B test. It’s best to create a ticket for cleanup when starting the test. In particular:
- Remove the feature flag:
- In case of a successful experiment, remove the flag and roll out the changes.
- In case of a failed experiment, remove both the changes and the feature flag.
- Remove data from pings: if the experiment used data in pings, and it failed, remove the data. We are very selective on what to include in our pings. Remove any data from pings that is not needed anymore.