That's not Simpson's Paradox. Simpson's Paradox is when the aggregate winner is different from the winner in each element of a partition, not just some of them
Yeah, a good AB testing framework would either refuse to let you break things down too much or have a large warning about the results not being significant, but that doesn't always stop the business-types from trying to wiggle in some way for them to show a win.
Yes, I don’t think it’s possible to observe a simpson’s paradox in a simple conversion test, either.
Simpson’s paradox is about spurious correlations between variables - conversion analysis is pure Bayesian probability.
It shouldn’t be possible to have a group as a whole increase its probability to convert, while having every subgroup decrease its probability to convert - the aggregate has to be an average of the subgroup changes.
Consider the case where iOS users are more likely to convert than Android users, but you currently have very few iOS users. You then A/B test a new design that imitates iOS, but has awful copy. Both iOS and Android users are less likely to convert, but it attracts more iOS users.
The group as a whole has higher conversion because of the demographic shift, but every subgroup has less.
It could be self-selection happening after you randomized the groups. For example a desktop landing page advertising an app, which might be installed on either mobile operating system.
Simpson's paradox is sometimes about spurious correlations, but the original paradox Simpson wrote about was simply a binary question with 84 subgroups, where 3 or 4 subgroups with the outlying answer just had a significant enough amount of all samples, and a significant enough effect, to mutate the whole.
Let's consider an example that would be a case of Simpson's Paradox. Suppose you are A/B testing two different landing pages, and you want to know which will make more people become habitual users. You partition on whether the user adds at least one friend in their first 5 minutes on the platform. It might be that landing page A makes people who add a friend in the first 5 minutes more likely to become habitual users, and it also makes people who don't add a friend in the first 5 minutes more likely to become habitual users. But page A makes people less likely to add a friend in the first 5 minutes, and people who add a friend in the first 5 minutes are overwhelmingly more likely to become habitual users than people who don't. So, in this case at least, it seems like the aggregate statistics are most relevant, but the fact that page A is bad mainly because it makes people less likely to add a friend in the first 5 minutes is also very interesting; maybe there is some way of combining A and B to get the good qualities of each and avoid the bad qualities of both
With random bucketing happening at the global level for any test, the proper thing to do is to take any segments that show interesting (and hopefully statistically significant) results that differ from the global results and test those segments individually so the random bucketing happens at that segment level.
There are two issues at play here -- one is that the sample sizes for the segments may not be high enough, the other is that the more segments you look at , the greater the probability for finding a false positive.
It can only happen with unequal populations. If you decide to include people in the control or test group randomly you're fine (you can use statistical tests to rule out sample biad).