When Randomization Is Not Enough: Improving Sample Balance in Online A/B Tests

bradleybuda · on June 23, 2015

I am not a statistician, but this fails two sniff tests for me. Can someone explain where my intuition is wrong?

1) Yes, you might find that running an A/A test on your data shows a "result" for one of your groups, even though there obviously is none; this (as I see it) is just due to insufficient data. Isn't running an A/A test isomorphic (in a strong sense) to correctly checking the statistical significance of your A/B test, and only believing your A/B test if it does in fact reach a high significance threshold?

2) Re-randomizing your data the way they describe will inherently increase the "orderedness" (decrease the entropy) of you data set, which risks biasing the A and B groups in some what you don't fully understand. It seems like the re-randomization procedure would at best leave the statistical properties of your A and B groups unaltered, and at worst introduce a new bias that makes your result fishy.

I'm sure there's something deeper going on here, but the post doesn't explain it well.

jib · on June 23, 2015

They are saying that the variation in tests can be lowered by including covariant variables in selection criteria. Let's say one product of many I sell a "drivers license test kit" and somehow know the age of my site visitors. I have an idea of how to change how I advertise the kit on my website. Most of the people who would buy the kit are in a very specific age bracket (16-20 year olds depending on legal age in your country). My site has an even spread of visitors ages. if I random split my test pool and it ends up with an imbalance in ages my results will be skewed so I can lower amounts of tests needed if I don't form test groups OR control groups where all 16-20 year olds are on one side. I "know" those tests would be irrelevant so I just don't run them. If they would have yielded relevant results I'm screwed but if not, I saved time.

For your 1 and 2, yes and yes. They measure different things. But I can see that "a test population of site visitors of an even age distribution" would yield answers faster than a random distro. You are asserting a belief rather than testing for it, adding a given, so reducing problem space, so trading lower variance in results for higher uncertainty of the significance (because maybe your assertion that age drives interest is wrong, and then all tests are wrong). P (this test is totally meaningless) goes up, but P (if this test is meaningful the results are likely close to true) also goes up? That's how I read it at least.

icegreentea · on June 23, 2015

What I believe the post is recommending is that you attempt to characterize your population, and then select your groupings based on that characterization. The goal is that the distribution of your two groups across all possible metrics you can measure are as close to equal as possible. Some trivial (and obviously extreme) examples of the medical field could be:

a) If you're doing a study with 50/50 males and female participants, you'd probably reroll your control/active group distributions if it was far off from 50/50. Hell, you'd probably just split males and females separately.

b) If you had a twin study, the only way to divide up your population that makes sense is by splitting up your twins. Nothing else makes sense.

c) Imagine you wanted to test how Adderall effects the test studying/taking abilities of a variety of college aged students. You would do your best to make sure that your control/active populations have similar distributions in your test metric (or test metric proxy such as IQ) prior to starting test.

Example c) is the closest to the case that the post is talking about. When you're trying your absolute best to maximize the power of your test, usually want to take into account what you know, or what you assume about the thing your testing. In the case of website A/B testing, you're making an assumption (that is possibly unfounded) that reaction to A is an okay proxy for reaction to B.

This is an assumption - nearly all statistics is based on assumptions. Nearly all exercises in powering and designing experiments is based on assumptions and iteration.

RA_Fisher · on June 23, 2015

Great questions. I can't say for sure, but I believe you might be right on 1). When there's low variability within each group, a smaller sample size will yield higher statistical power. That's why pairing is so popular in clinical studies. However the trick to solving 2) is that you pair and then randomly assign the groups within each pair. I found this great article [1] on the subject. It has an interesting discussion of the history. John Stuart Mill's approach was to stamp out heterogeneity however possible (e.g. breeding the differences out of mice). However, Fisher showed that no matter how much you tried to stamp out hetero, it was an uncountable and infinite enumerable list. It says that Fisher made a list of all the possible differences in the tea cups used in his famous Lady Tasting Tea experiment [1, 2]. Fisher's contribution was noticing that random assignment could balance out hetero.

This article [1] puts it succinctly:

"If treatments are randomly assigned, so treatment effects may be estimated without bias, increasing the sample size and decreasing the heterogeneity of experimental units have similar consequences: they reduce the sampling variability of unbiased estimates."

[1] http://www-stat.wharton.upenn.edu/~rosenbap/heteroReprint.pd...

[2] https://en.wikipedia.org/wiki/Lady_tasting_tea

URSpider94 · on June 23, 2015

I think what you're missing is that, in the real world, sample sizes are limited by time, budget and recruitment. Statistical power grows slowly with sample size, so fixing potential sampling errors by increasing sample size is a game of diminishing returns. Anything you can do up front to focus your testing on the variable that you want to measure, the better.