AB testing: It is not as simple as you might think

AB testing is so popular nowadays. Businesses are AB testing everything. However, there is more to it than meets the eye.

What if we AB test something and in 5 cases A-test wins and in 4 cases B-test wins. Can we make a safe conclusion that A is better than B? I don't think we can. We don't have enough statistics to draw this conclusion.

To make a mathematically accurate conclusion we should define what is called standard deviation for both A and B cases. Then make sure that two distributions don't overlap. Only if distribution peaks don't overlap we can make a safe conclusion about AB test at hand.

How to define standard deviation? I suggest taking a square root of the number of outcomes. For 4 outcomes, a standard deviation is going to be equal to 2. Thus, we can safely say that the number of B-outcomes is really in the range between 4 - 2 = 2 and 4 + 2 = 6. As for the number of A-outcomes, it is somewhere in the range between 3 and 7. Ranges 2 to 6 and 3 to 7 overlap, thus no definite conclusion can be drawn.

In e-commerce there is also transaction amount involved. This complicates things even more. What if you have one A-outcome with $1,000 in sales and ten B-outcomes with $100 order total each (which brings the total amount to the same $1,000). Are these two situations equivalent? The short answer is no. At S3 Stores, Inc. we write complicated math and figure it out. We can make judgments like this. You have to if you want to stay in business.