A/B testing mistakes

Jasber · on Jan 3, 2013

I recently implemented A/B testing on a client's site using one of these Javascript-based A/B testing tools (but not this one).

I hadn't used one before, so wanted to verify the data would actually be accurate.

I did an A/A test, basically testing the same exact page––expecting the results would be the same.

Not only were the results not the same, but they were off by a wide margin.

Given this, I don't know how I'm supposed to trust any of the data.

Has anyone else had experiences like this? Is A/B testing in Javascript just not as reliable?

jfarmer · on Jan 3, 2013

"I did an A/A test, basically testing the same exact page––expecting the results would be the same."

That's not how A/B testing works. :)

Let's say we want to detect a 1% lift in some metric at 95% confidence and we set up an A/A test. We do the math and it tells us we need to sample 1,000 people to reach 95% confidence on a 1% lift.

If we ran the A/A test 100 times, roughly 5 of them would show a statistically significant difference between the two groups. That's what "95% confidence" means -- it means your false positive rate is 5%. This is called a Type I Error.

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors

You could run a kind of meta-analysis and use the false positive rate as the variable you're measuring to see if there's a statistically significant difference between the %5 false positive rate you expect and the false positive rate the A/B testing software generates in practice.

In this case, your null hypothesis is that the "true alpha" of the A/B testing software is 0.05. You'd sample from among all the 95% confidence tests you run and see whether you can reject the null hypothesis.

btilly · on Jan 4, 2013

There are a number of valid ways to run and analyze A/B tests. Bayesian approaches look rather different than what you're describing.

jfarmer · on Jan 4, 2013

The original commenter was using off-the-shelf A/B testing software, so the odds of it doing anything other than a simple t-test are virtually zero. Not sure that the frequentist vs. Bayesean debate is the most relevant thing for him right now.

I felt it best to leave out nuance that didn't help him understand why his software was showing a statistically significant outcome for an A/A test.

robrenaud · on Jan 4, 2013

You are assuming the framework isn't broken or misconfigured.

jfarmer · on Jan 4, 2013

You're not going to out-pedant me, damnit! :P

Given the original comment, yes, I think it's more likely he just didn't understand why a A/A test might sometimes show a false positive.

Even if the system were misconfigured, there's no reason to think it would manifest as a false positive in an A/A test. There are lots of ways it could manifest.

siddharthdeswal · on Jan 3, 2013

There's a very simple reason for that.

You're seeing a difference between Control and Variation in an A/A test is because a very small number of visitors have been tested. To explain, suppose you toss a coin 10 times and 7 out of those it shows heads. Just based on these 10 tosses, would you conclude that the coin is loaded? Probably not. Suppose you tossed the coin a 100 times, it'll probably show heads maybe 43 or 47 or 51 or 52 times.

Point being, as you toss it more and more, the number of times it shows a heads or tails comes closer and closer to 50% but you need to toss it a large number of times to be fairly certain that it isn't loaded. The more you toss it, the more certain you are. However, you'll only be more and more certain, but never completely certain. VWO works on a similar principle. The more number of times you toss up Control and Variation to visitors, the more certain you become of either being better, worse or equal to each other.

If you'll read the post, the graph shows the fluctuations in the beginning, after which things kind of settle down. In an A/A test, they'll settle down to a very similar conversion rate.

Here's an article from the VWO Knowledgebase that'll help you with running an A/B test correctly http://visualwebsiteoptimizer.com/knowledge/how-to-ideally-r...

chc · on Jan 3, 2013

Some people actually suggest running A/A/B tests just to gauge how much noise is in their numbers, though that requires even more visitors to achieve statistical confidence since they're spread out among more options.

jfarmer · on Jan 3, 2013

I've worked at companies that tried to do this before. It makes no sense and shows the people running the A/B tests don't really understand the statistics behind A/B testing.

If I'm running an A/A test at 95% confidence and a sufficient number of visitors for whatever effect size I'm interested in, then 1 in 20 A/A tests will register a false positive. That's what "95% confidence" means. It does not mean there is "too much noise."

Moreover, in a proper A/B test, the A group and B group need to be independent and identically distributed. So, in an A/A/B test, if the A/A disagree it shouldn't tell you anything about B. That's what "independent" means.

If you want to be more confident you just increase your alpha. alpha=0.05 is already too high for most consumer web apps anyhow, IMO, but go wild. 99% confidence! Woo!

As a rule you want higher confidence when the cost of a mistake is high, e.g., this medicine gives people brain tumors! Oops.

mjw · on Jan 3, 2013

Perhaps you could view this "A/A/B" test as a very crude form of http://en.wikipedia.org/wiki/Bootstrapping_(statistics) method? At least if you're resampling A1 and A2 from a pool A and then doing separate A1/B and A2/B tests and looking at how much the resulting statistic varies between the two runs.

Agreed this is a silly way to go about it, but there better-thought-out bootstrapped confidence tests which could be used if you don't fully trust the distributional assumptions behind (say) the t-test.

jfarmer · on Jan 4, 2013

I wish! The words empirical distribution are music to my ears.

No, usually the rule people use is this: "If A1 and A2 show a statistically significant difference, then do not reject the null hypothesis regardless of A1/B or A2/B."

friendofasquid · on Jan 14, 2013

I do this when I'm not confident I set up the experiment correctly. If the A's differ bit quite a bit, it's more likely I made a mistake than there's is normal statistical variance. I make mistakes daily.

I think that's pretty sensible reason for A/A/B testing. Or A/B/B testing. Whatever you like.

Angostura · on Jan 3, 2013

> You're seeing a difference between Control and Variation in an A/A test is because a very small number of visitors have been tested.

Well, we don't actually know how many samples the OP has, do we?

siddharthdeswal · on Jan 4, 2013

True, we don't. But that is usually the reason for A/A tests throwing up different results (assuming of course, that the tool used to run the test is not flawed).

noelwelsh · on Jan 3, 2013

This is relevant: http://news.ycombinator.com/item?id=4997127

The answer is: it depends. How much data did you collect? How big was the difference you observed?

Javascript is less reliable than server-side tracking. There are lots of issues. Some people have Javascript turned off, though normally this is a small proportion. If people navigate away before the testing harness is loaded you can lose data. If you're testing elements that navigate away from the current page you can lose data if the harness is not implemented correctly (the browser will cancel pending requests when you navigate away).

I can't say what went wrong in your case but there is the potential for lots to go wrong.

marcosdumay · on Jan 4, 2013

Congratulations, if you have statistical relevance on those numbers, you've just clustered your clients into too different populations.

Now, you just need to discover why one of the populations had better results than the other, and act acordingly. Don't forget to do actual A/B tests to verify your hypotesis.

orangethirty · on Jan 3, 2013

May you provide more details? We could use your experience to shed some public light into how to interpret A/B testing data.

karolisd · on Jan 3, 2013

8) Have a hypothesis of what you're testing and control for variables. Run a MVT test if you're changing a lot of things. If the test wins and it's implemented, everyone is happy and people don't ask too many questions. If it loses, what have you learned? Test a hypothesis.

If a client looks at a comp for a test and asks to change something, I always ask them, "What hypothesis are we testing with that change?"