That's not Simpson's Paradox. Simpson's Paradox is when the aggregate winner is ...

hnhg · on June 16, 2023

Also, doesn't their suggested approach amount to multiple testing? In other words, a kind of p-hacking: https://en.wikipedia.org/wiki/Multiple_comparisons_problem

Edit - and this: http://www.stat.columbia.edu/~gelman/research/unpublished/p_...

throwaway202351 · on June 16, 2023

Yeah, a good AB testing framework would either refuse to let you break things down too much or have a large warning about the results not being significant, but that doesn't always stop the business-types from trying to wiggle in some way for them to show a win.

jameshart · on June 16, 2023

Yes, I don’t think it’s possible to observe a simpson’s paradox in a simple conversion test, either.

Simpson’s paradox is about spurious correlations between variables - conversion analysis is pure Bayesian probability.

It shouldn’t be possible to have a group as a whole increase its probability to convert, while having every subgroup decrease its probability to convert - the aggregate has to be an average of the subgroup changes.

BoppreH · on June 16, 2023

Are you sure?

Consider the case where iOS users are more likely to convert than Android users, but you currently have very few iOS users. You then A/B test a new design that imitates iOS, but has awful copy. Both iOS and Android users are less likely to convert, but it attracts more iOS users.

The group as a whole has higher conversion because of the demographic shift, but every subgroup has less.

whimsicalism · on June 16, 2023

I don't follow. If one bucket has many more iOS users, it seems like you have done a bad job randomizing your treatment?

BoppreH · on June 16, 2023

It could be self-selection happening after you randomized the groups. For example a desktop landing page advertising an app, which might be installed on either mobile operating system.

HWR_14 · on June 16, 2023

Simpson's paradox is sometimes about spurious correlations, but the original paradox Simpson wrote about was simply a binary question with 84 subgroups, where 3 or 4 subgroups with the outlying answer just had a significant enough amount of all samples, and a significant enough effect, to mutate the whole.

robertlacok · on June 16, 2023

Exactly.

On that topic – what do you do when you observe that in your test results? What's the right way to interpret the data?

throwaway084t95 · on June 16, 2023

Let's consider an example that would be a case of Simpson's Paradox. Suppose you are A/B testing two different landing pages, and you want to know which will make more people become habitual users. You partition on whether the user adds at least one friend in their first 5 minutes on the platform. It might be that landing page A makes people who add a friend in the first 5 minutes more likely to become habitual users, and it also makes people who don't add a friend in the first 5 minutes more likely to become habitual users. But page A makes people less likely to add a friend in the first 5 minutes, and people who add a friend in the first 5 minutes are overwhelmingly more likely to become habitual users than people who don't. So, in this case at least, it seems like the aggregate statistics are most relevant, but the fact that page A is bad mainly because it makes people less likely to add a friend in the first 5 minutes is also very interesting; maybe there is some way of combining A and B to get the good qualities of each and avoid the bad qualities of both

ssharp · on June 16, 2023

With random bucketing happening at the global level for any test, the proper thing to do is to take any segments that show interesting (and hopefully statistically significant) results that differ from the global results and test those segments individually so the random bucketing happens at that segment level.

There are two issues at play here -- one is that the sample sizes for the segments may not be high enough, the other is that the more segments you look at , the greater the probability for finding a false positive.

contravariant · on June 16, 2023

It can only happen with unequal populations. If you decide to include people in the control or test group randomly you're fine (you can use statistical tests to rule out sample biad).

Lior539 · on June 16, 2023

I'm the author of this blog. Thank you for calling this out! I'll update the example to fix this :)

hammock · on June 16, 2023

The example is fine, you are calling out confounding variables. Just call it confounders, instead of Simpson paradox.

keithwinstein · on June 16, 2023

FWIW, the arithmetic in that example also has a glitch. For the "Mobile" "Control" case, 100/3000 is about 3% rather than 10%.

hammock · on June 16, 2023

What it is is confounding