Hacker News new | past | comments | ask | show | jobs | submit login

That's not Simpson's Paradox. Simpson's Paradox is when the aggregate winner is different from the winner in each element of a partition, not just some of them



Also, doesn't their suggested approach amount to multiple testing? In other words, a kind of p-hacking: https://en.wikipedia.org/wiki/Multiple_comparisons_problem

Edit - and this: http://www.stat.columbia.edu/~gelman/research/unpublished/p_...


Yeah, a good AB testing framework would either refuse to let you break things down too much or have a large warning about the results not being significant, but that doesn't always stop the business-types from trying to wiggle in some way for them to show a win.


Yes, I don’t think it’s possible to observe a simpson’s paradox in a simple conversion test, either.

Simpson’s paradox is about spurious correlations between variables - conversion analysis is pure Bayesian probability.

It shouldn’t be possible to have a group as a whole increase its probability to convert, while having every subgroup decrease its probability to convert - the aggregate has to be an average of the subgroup changes.


Are you sure?

Consider the case where iOS users are more likely to convert than Android users, but you currently have very few iOS users. You then A/B test a new design that imitates iOS, but has awful copy. Both iOS and Android users are less likely to convert, but it attracts more iOS users.

The group as a whole has higher conversion because of the demographic shift, but every subgroup has less.


I don't follow. If one bucket has many more iOS users, it seems like you have done a bad job randomizing your treatment?


It could be self-selection happening after you randomized the groups. For example a desktop landing page advertising an app, which might be installed on either mobile operating system.


Simpson's paradox is sometimes about spurious correlations, but the original paradox Simpson wrote about was simply a binary question with 84 subgroups, where 3 or 4 subgroups with the outlying answer just had a significant enough amount of all samples, and a significant enough effect, to mutate the whole.


Exactly.

On that topic – what do you do when you observe that in your test results? What's the right way to interpret the data?


Let's consider an example that would be a case of Simpson's Paradox. Suppose you are A/B testing two different landing pages, and you want to know which will make more people become habitual users. You partition on whether the user adds at least one friend in their first 5 minutes on the platform. It might be that landing page A makes people who add a friend in the first 5 minutes more likely to become habitual users, and it also makes people who don't add a friend in the first 5 minutes more likely to become habitual users. But page A makes people less likely to add a friend in the first 5 minutes, and people who add a friend in the first 5 minutes are overwhelmingly more likely to become habitual users than people who don't. So, in this case at least, it seems like the aggregate statistics are most relevant, but the fact that page A is bad mainly because it makes people less likely to add a friend in the first 5 minutes is also very interesting; maybe there is some way of combining A and B to get the good qualities of each and avoid the bad qualities of both


With random bucketing happening at the global level for any test, the proper thing to do is to take any segments that show interesting (and hopefully statistically significant) results that differ from the global results and test those segments individually so the random bucketing happens at that segment level.

There are two issues at play here -- one is that the sample sizes for the segments may not be high enough, the other is that the more segments you look at , the greater the probability for finding a false positive.


It can only happen with unequal populations. If you decide to include people in the control or test group randomly you're fine (you can use statistical tests to rule out sample biad).


I'm the author of this blog. Thank you for calling this out! I'll update the example to fix this :)


The example is fine, you are calling out confounding variables. Just call it confounders, instead of Simpson paradox.


FWIW, the arithmetic in that example also has a glitch. For the "Mobile" "Control" case, 100/3000 is about 3% rather than 10%.


What it is is confounding




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: