A/B Testing is Expensive

btilly · on Nov 13, 2013

If you want both a strategy for testing with limited data, and information about the kinds of errors you are likely to encounter, you may want to read http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi....

jamiequint · on Nov 13, 2013

Awesome! Thanks for sharing.

ivankirigin · on Nov 13, 2013

Great post!

When you're first starting out, positioning matters. At http://yesgraph.com we've found copy AB tests to produce incredible lift.

For example, people don't want to "invite" contacts. They do want to "email" contacts though. It's the same flow, but a few words triggered massive lift. The reason it is massive is because it is so unoptimized. So it is specifically at the start where such small tests can matter.

jamiequint · on Nov 13, 2013

Absolutely, small tests can make large differences.

However, you only have so many "bullets" to shoot at tests with a small audience so you have to be very picky. Sounds like contact inviting is a key viral feature of YesGraph so it makes a lot of sense to optimize there.

I'm not anti A/B testing at all, but anecdotally a lot of people I talk to about this stuff fire their test bullets on the wrong things and end up with not much to show for it.

A better understanding of psychology goes a long way into intuiting where you may actually be able to see gains, and how to go about achieving them. Sometimes this can be small changes producing large gains, although I have seen large changes produce larger gains more often.

ivankirigin · on Nov 13, 2013

Agreed

hvass · on Nov 13, 2013

Yep.

Dan Siroke, CEO of Optimizely always, always stresses in his presentations it is vital to test. The presentations on his SlideShare are awesome: http://www.slideshare.net/dsiroker

swalling · on Nov 13, 2013

One cheap (in time and money) complement to A/B or multivariate testing that the author doesn't mention is usability testing, specifically remote testing. Before we launch a test, we always run remote usability tests.

Feedback like this should be taken with a grain of salt, since these people are testers, not necessarily like your users in all respects. But it's still really valuable. I've caught numerous errors that test data would not help me understand easily.

Combine remote usability testing through something like usertesting.com with prototyping, and you've got a really rapid way to get feedback on the cheap, even if you don't have enough site visitors to get statistical significance on a reasonable time frame.

insickness · on Nov 13, 2013

If you want to test a sales page prior to product launch, it will be expensive to A/B test because you can't use natural visitors.

I have a site with a few thousand visitors per day. I had a product I was going to release. I was working up to a big product launch, building up the anticipation over my email list and on the site itself. In that case, I couldn't use natural traffic from the site to test the sales page prior to launch. I had to use cold traffic such as adwords to see which version of the page people responded to. I probably spent about $8k on traffic just crafting and A/B testing the sales page.

But it was worth it. It was like a university education in marketing. Marketing can be so counter-intuitive. So many things I expected to work did not work and vice-versa. But once I had the final tested sales page in place it worked and it worked well. I still get about a a 4% conversion. And most importantly, I knew that when I did finally launch, I had a well-tested and solid page that would convert the large initial influx of customers I got from the building up process before the product launch.

ryanglasgow · on Nov 13, 2013

Interesting read, but I would have to disagree. It's not difficult to reach 90% confidence with very a small sample size:

  - Variation A and B each receive 20 visits
  - Variation A receives 10 clicks while variation B receives 5 clicks
  - The confidence interval for Variation A is 90%
  (Source: https://mixpanel.com/labs/split-test-calculator)

Also, I wrote an article titled "Creating Successful Product Flows" that is very relevant to this post: https://medium.com/design-startups/c41ffbce49a1

shawabawa3 · on Nov 13, 2013

Of course if you are A/B testing something which doubles conversions from 25% to 50% (100% improvement) you'll know quickly. However, if you're looking at something which is better by something more realistic like taking conversions from 5% to 5.5%, you're looking at around 10000 visits each for 90% confidence.

ryanglasgow · on Nov 13, 2013

A startup isn't looking to make tiny .5% increment improvements so I don't see how this is relevant. Companies looking to grow a small user base are making significant changes, seeking significant improvements.

insickness · on Nov 13, 2013

Your average well-crafted sales page on the internet has a conversion rate of 2.5%. A 0.5% increment is a HUGE difference. You're lucky if you get a 0.2% increment after an extensive A/B test.

jamiequint · on Nov 13, 2013

Two things here. First, 90% confidence isn't great, I look for 99% confidence in running tests. Second, this assumes there is a lot of stuff you can test that produces 2x gains when in reality the number of things that do that is very small.

Its fair to A/B test things you expect to produce high leverage changes. That was actually part of the point of the article, no small tests. Focus here first, consumer psych helps you figure out where these opportunities are.

Once you get through these big opportunities though even respectable gains (e.g. 10%) take a lot of traffic to measure. For example, seeing a 10% gain in a 50% conversion rate takes around 2500-3000 visits to A/B test at 99% confidence. Seeing a 10% gain in a 10% conversion rate at 99% confidence takes 10 times more traffic than that.

gwern · on Nov 13, 2013

> Two things here. First, 90% confidence isn't great, I look for 99% confidence in running tests.

Why? Why are you so worried about controlling false positives that you're willing to eat a whole bunch of false negatives?*

You're not administering expensive drugs to cancer patients, you're designing a website! If you mistakenly think that green buttons perform better than blue buttons when the actual truth is the null hypothesis that they perform the same, that's not the end of the world.

* and I do mean a whole bunch; in that scenario, moving from alpha=10% to alpha=1% means you increase your false negatives by something like 3x. The power calculations:

    R> power.prop.test(n=20, p1=0.5, p2=0.25, sig.level=0.10)
    ...
              power = 0.4951
    ...
    R>
    R> power.prop.test(n=20, p1=0.5, p2=0.25, sig.level=0.01)
    ...
              power = 0.1646
    ...
    R>
    R> 0.4951/0.1646
    [1] 3.008

insickness · on Nov 13, 2013

There will be times when you make a change to a page and the difference in reception between the two pages is as stark as the situation you described above where out of 20 clicks, one page does twice as well. But most often there is a very minor difference between the click rate of the two pages, like less than 1%. In that case, you need a much larger samples size.

And even if you do get lucky and get a test like the one you described above, chances are, you want to continue to revise the page and make more subtle changes which will mean you need a much larger sample size even to reach the low bar of 90% confidence.

graeme · on Nov 13, 2013

Can someone with expertise comment on this? I once worked in a company where the founders thought that the small samples were adequate. I thought that the calculators were misleading with such small samples sizes, even though they gave "high confidence".

But that was only based on my intuition, not math, and I've never seen anyone give a good discussion of whether "90% confidence" is as definitive as it sounds in the context of a very small sample.

ronaldx · on Nov 13, 2013

It's a bit awkward to give a full answer to this, but this is to the best of my understanding and explained as simply as is reasonable:

A small sample has less statistical 'power' to identify significant differences where they exist. Put another way, a large sample is more likely to give a true significant result than a small sample.

But, if you do see 10% significance(/90% confidence) in a small sample, this is just as good as 10% significance in a large sample. Although the cutoff point will be more rough in a smaller sample, it's a good standard practice to round conservatively to account for this.

10% is unlikely to be considered a good result for statistics in either case - you can engineer a result by doing 10 tests on nothing and there's a danger you would have unknowingly or unconsciously done this, maybe (for example) by not deciding the sample size in advance. However, there's also presumably strong enough evidence against a harmful difference that you aren't likely to lose anything by following these results.

It can be good idea to do numerous small investigative tests as justification for bigger tests - relying on lots of small tests alone requires consideration for multiple testing (e.g. Bonferroni correction).

vasilipupkin · on Nov 13, 2013

"But, if you do see 10% significance(/90% confidence) in a small sample, this is just as good as 10% significance in a large sample". That is not true, strictly speaking. You are assuming that small sample describes the underlying distribution well. But this may not be the case due to non-normality of the distribution itself or potential biases

ronaldx · on Nov 13, 2013

Cool point and I agree.

The sample has to represent the population, that's fundamental. If the sample is so small that it can't characterise the population distribution, then you have a problem anyway. If you're measuring a events that happen 1% of the time (or 99% of the time), a sample of 100 is not nearly enough.

If you chose an appropriate non-parametric test to cover an unknown distribution with a small sample, it maybe would have zero power (impossible to give a significant result)

jfarmer · on Nov 14, 2013

There's no such thing as a "small" or "large" sample size, per se. If you're doing it rigorously, you need to fix both your confidence interval (e.g., 95%) and the effect size you expect to see (e.g., a 50% lift in metric X relative to your control). You can then do some simple math which will tell you what sample size you need before there's only a 5% chance you'll see a 50% lift in metric X if you continue the test. Finally, you run the test until you've sampled that many users and stop the test. If there's a winning variant and it's statistically significant, congrats! If not, go back to square one.

The larger the effect size, the smaller your sample size can be before you reach that conclusion.

Most folks don't fix the desired effect size and instead just create a bunch of variants, start the A/B test, wait for the A/B testing framework to shout "statistically significant!", and then declare a winning variant. If the sample size seems "too small" they might not feel comfortable declaring a winner, so they perfunctorily "get a few more samples." Neither of these are rigorous, so it's a bit pointless to debate about which one is "better."

vasilipupkin · on Nov 13, 2013

small sample sizes are misleading. You probably need at least 100 data points for reasonable significance, but if your data is skewed or has fat tails then most likely much more than that

jtcchan · on Nov 13, 2013

> It's not difficult to reach 90% confidence with very a small sample size:

I think the difficulty in reaching 90% confidence is in designing a challenger that is THAT much better the original (i.e. 10 vs 5). Most split tests are shots in the dark. You'll basically need a design or copy that is doing pretty bad and an a challenger that is a lot better (but not obviously good enough that you use it in the first place).

jtcchan · on Nov 13, 2013

I agree with your points re: A/B testing early on for startups but I don't think the conclusion is that A/B testing is an expensive option, it's more like the wrong option.

Yes, you'll need moderate levels of traffic for split tests to be effective, so if you don't have the traffic or time to wait around, you should be talking to your users.

grinnick · on Nov 13, 2013

I recently wrote a calculator which will tell you how many days it will take to run an A/B test to 90% significance.

You just plug in

1. the number of pageviews your page got in the last month

2. the number of conversions that resulted from those pageviews

http://abtestcalculator.com

peeplaja · on Nov 14, 2013

Here's an article on how to do conversion optimization with little to no traffic http://conversionxl.com/how-to-do-conversion-optimization-wi...

tersiag · on Nov 13, 2013

An alternative could be cloud based eye tracking testing services such as http://www.gazehub.com/ where you can get a lot info about how visitos navigate... and you dont need large volumes of traffic

badman_ting · on Nov 13, 2013

In a lot of cases, I would say it's ridiculously cheap -- you can make some very small changes and get a huge response. But in the context of a site that isn't getting many visitors yet, this make sense.

codexity · on Nov 13, 2013

As a practical matter, with limited data, you can't do A/B testing. You have to just move forward, guessing on each step -- yet carefully tracking whether the change brought an improvement.