If we run an experiment without an hypothesis, we can gather some information and get some results, but we would probably have a hard time to interpret or take lessons from them, since we don’t know what we are actually trying to determine.
If we try to test two hypotheses at the same time, in the end it would be hard to differentiate the reason of the change in metrics and user behaviour since it is not clear which hypothesis and the related solution is responsible for the change.
Even if the significance levels are tempting, significance level itself is not enough to decide whether the experiment should stop or continue. We should rather calculate the number of visitors required (sufficient sample size) and how long the test should run (ideally full weeks since we have a different e-commerce behaviour everyday) before running the test, so that we can achieve results that we can trust. Then we need to end the test at this pre-defined time no matter what the results are. At the end of this predefined time frame,
or
If we don’t end the test on the pre-defined time, there’s always a great chance that we will pick the wrong winner.
As mentioned in 3, we need to run our tests with enough sample size for a sufficient length of time to get reliable and actionable results. That’s why we need enough traffic to achieve the required sample size in a reasonable amount of time since waiting for a test result for months is not an efficient option to test, learn and improve our product.
It is difficult to give a definitive number for the minimum required traffic since the required sample size for the tests depends on:
To run an A/B test, we don’t need the finished and perfectly designed features. What we need is the essentials that will help us to test our hypothesis. For example, if we think that an element is confusing, then we can just blackout it and measure the impact of having this element easily. So, rather than spending lots of time trying to build the perfect features and designs, we need to focus on learning fast and cheap.
If we want to change the test setup after we started the test, we should always purge the data until the time we made the changes or start a new test because otherwise it eventually leads to a data pollution.
In general, we need to make sure that we don’t harm the performance with our tests because of poorly implemented solutions like heavy CSS or inefficient JavaScript.
This is the last piece of this A/B testing essentials mini series. I hope you enjoyed and thanks to all that stuck around until the end!
References
https://www.evanmiller.org/ab-testing/sample-size.html
https://cxl.com/blog/12-ab-split-testing-mistakes-i-see-businesses-make-all-the-time/
16 mistakes that will kill your A/B testing (and what you can do about them)