The most fascinating aspect is that once you recognize the issue, you can't simply rely on using just the partitioned data or the aggregate. Be sure to read the "Implications to decision making" section.
The practical significance of Simpson's paradox surfaces in decision making situations where it poses the following dilemma: Which data should we consult in choosing an action, the aggregated or the partitioned? In the Kidney Stone example above, it is clear that if one is diagnosed with "Small Stones" or "Large Stones" the data for the respective subpopulation should be consulted and Treatment A would be preferred to Treatment B. But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate to consult the aggregated data and administer Treatment B? This would stand contrary to common sense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.
On the other hand, if the partitioned data is to be preferred a priori, what prevents one from partitioning the data into arbitrary sub-categories (say based on eye color or post-treatment pain) artificially constructed to yield wrong choices of treatments? Pearl[2] shows that, indeed, in many cases it is the aggregated, not the partitioned data that gives the correct choice of action. Worse yet, given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data; with each story dictating its own choice. Pearl[2] considers this to be the real paradox behind Simpson's reversal.
As to why and how a story, not data, should dictate choices, the answer is that it is the story which encodes the causal relationships among the variables. Once we extract these relationships and represent them in a graph called a causal Bayesian network we can test algorithmically whether a given partition, representing confounding variables, gives the correct answer. The test, called "back-door," requires that we check whether the nodes corresponding to the confounding variables intercept certain paths in the graph. This reduces Simpson's Paradox to an exercise in graph theory.
To not stand too long on my soapbox, causal Bayesian networks are, I think, the most important tool for high level statistics users. I wish Pearl were more popular.
Fortunately, Pearl has distilled the highlights of his book _Causality_ into a 50-page paper that makes a great introduction to the modern theory of causation:
The "Berekley gender bias case" section made it make sense to me. Every single department was more likely to admit a woman than a man, and yet the school as a whole was more likely to admit a man than a woman.
This was because women were applying to more competitive departments, on average, so a lower percentage of women applicants were getting admitted.
> Every single department was more likely to admit a woman than a man
Not quite; you can see several exceptions to this in the table in the article.
The key points of the partitioned data were:
No department was significantly biased against women.
Most departments had a statistically significant bias towards women.
It's interesting to note that the departments that ARE biased towards men have more female applicants; I wonder if being in a minority group for that department is an even more important confounding factor.
Another example which recently made the news is comparing education systems of Wisconsin and Texas. Each ethnic group in Texas performs better than in Wisconsin, but Wisconsin has fewer of the low performing groups and therefore has higher test score averages.
In my high school stats class the example that was used was 2 airlines, one of which had better on-time arrival rates at every airport, yet in sum had a worse on-time arrival rate due to where more flights emanated from.
The examples definitely help. The Civil Rights vote did it for me:
* The South voted against it overwhelmingly.
* Democrats voted for it (compared to Republicans)
* But, because the Democrats had the majority of the Southern seats their overall vote was against it. So, whichever party had the Southern seats was going to come out as voting against, compared to the other party
When I saw the post title I had my fingers crossed hoping "please be a paradox based on something from The Simpsons". Oh well... Interesting nonetheless
This imagined paradox is caused when the percentage is provided but not the ratio. In this example, if only the 90% in the first week for Bart was provided but not the ratio (9:10), it would distort the information causing the imagined paradox. Even though Bart's percentage is higher for the first and second week, when two weeks of articles is combined, overall Lisa had improved a greater proportion, 55% of the 110 total articles. Lisa's proportional total of articles improved exceeds Bart's total.
So it's important to not rely on percentage data alone, as it is a form of reduction. Rely instead on ratios, allowing you to see the validity of the percentage.
On Amazon and review sites, when I search by rating, I always look at the number of reviews and weight accordingly. I remember that IMDB (before it became an Amazon property) implemented a beyesian posterior mean to cull out the low-review anomolies: https://secure.wikimedia.org/wikipedia/en/wiki/Internet_Movi...
I wish other places like Amazon would implement similar weighting mechanisms to really allow a user to navigate by reviewed quality.
I glance at the 5 star reviews then head straight for the 1 stars.
Reviews aren't really ratio based - a single counter of the right kind can completely disuade me. I really don't want "low revie anomalies" culled.
I watched the Benoit Mandelbrot TED talk earlier, he had a graph of S&Ps stock market index normally compared to it win the five most anomalous daily trades removed - it was very different. Stock market models, he said, try to smooth out the rough bits which are hard to handle, but that's really where the meat is.
Sorry, when I said low-review anomalies, I'm talking products that are not reviewed enough (i.e., 5-star from a single review). They likely give an inaccurate view of the product.
Comparing two separate products, of which one has 100 reviews and another has 7, I'll take the 100-review rating as more "interesting" than the 7-review product, even though the 7-review item maybe rated higher.
Simpson's paradox is amazing, it's worth reading all the examples in the article and really understanding it.
Another good source is Martin Gardner's "Aha! Gotcha" book. He presents the paradox as a woman trying to find eligible bachelors at a party. In room 1 her odds are better if she goes for guys with mustaches. In room 2 her odds are also better if she goes for guys with mustaches. But when everyone goes into one room, her odds are better for guys without mustaches. It's incredible.
It's not really that interesting, and the paradoxical/unexpected aspects of it go away once you start looking at everything through the lenses of conditional probability theory. e.g. http://uncertainty.stat.cmu.edu/ goes over it as early as chapter 2.
This happens to students on a fairly consistent basis.
Suppose a project is worth 10% of the student's final mark while a midterm is worth 30% and the final exam is worth 60%. If the student performs well on the project, average on the midterm, but poorly on the Final Exam, their mark is still going to be poor due to the heavier weighting of the exam and midterm over the project.
EDIT: I know it's not a perfect example, but comparing marks between different students based on how they are weighted is essentially the idea.
That's not quite right. The "paradoxical" version would be something like:
Alice and Bob are taking a course where their grades are determined by 5 essays and/or presentations, and they can choose how many of each to do. Alice does one essay where she earns 80%, and 4 presentations where she earns an average of 90%. Bob does 4 essays earning an average of 85%, and one presentation where he earns 95%. For each assignment type Bob has a higher average than Alice, but Alice's overall grade (assuming equal weighting) is higher than Bob's: Alice gets 88% ((80 + 4 * 90)/5) and Bob gets 87% ((4 * 85 + 95)/5).
This is not the same thing as the Simpson's paradox. Specifically, as stated in wiki, you need to show a grouping (that is classically anti-causally related) with one correlation, but overall, upon combining the data, the correlation is reversed.
This does not happen in your example. But you do hint at something similar, i.e., the weights assigned are different to different groups and that is necessary for the Paradox to occur.
Is this really the same thing? That just sounds like whoever chose the weightings got the result they wanted - a student who couldn't master the final exam got a poor mark.
Well the comparison would be between different students. But the weightings essentially account for this difference. However, the better comparison would probably be if different students had different marking schemes.
"A real-life example is the passage of the Civil Rights Act of 1964. Overall, a higher ratio of Republicans voted in favor of the Act than Democrats. However, when the congressional delegations from the northern and southern States are considered separately, a higher ratio of Democrats voted in favor of the act in both regions. This arose because regional affiliation is a very strong indicator of how a congressman or senator voted, whereas party affiliation is a weak indicator."
The chart then shows that the "Northern" House had 316 members vs. the South's 104, and that the "Northern" Senate had 78 members vs. the South's 22.
Uh... the South has Florida, Texas, and California. How can it be represented by less than a quarter of Congress? I mean, the Senate makes a bit more sense due to the tiny states of the northeast and such (though it still seems low), but the House too? Really?
Makes me wonder how exactly they classified South vs. North... Former Confederate states vs. everyone else? Either way, Geography cannot be a major factor.
California is not considered part of the South. It has a different heritage, history, demographic, income distribution, and in fact California's southern border is pretty close in terms of latitude to Louisiana's northern border, while its northerly border is similar to Pennsylvania's northern border.
In US geography, "The South" means something very different to the southern half of the country. Likewise, "The Midwest" is concentrated entirely in the eastern half of the country. "The South" is really the south-east corner of the country.
Is it just confederate states? Well, eleven states seceded, and 22 senators implies eleven states, so probably. Not sure where Kentucky fits in.
As others have noted, 'The South' in the context of 1964 is more likely to have meant ex-confederate states. Even today, nothing west of Texas is considered 'The South', though west of Texas inclusive is often called 'the Southwest'.
And, the South's population was relatively a lot smaller 45+ years ago. In 1960, Texas was the 6th most populous state, Florida 10th, Georgia 16th. Now Texas is 2nd, Florida 4th, Georgia 10th.
http://en.wikipedia.org/wiki/Simpsons_paradox#Implications_t...
The practical significance of Simpson's paradox surfaces in decision making situations where it poses the following dilemma: Which data should we consult in choosing an action, the aggregated or the partitioned? In the Kidney Stone example above, it is clear that if one is diagnosed with "Small Stones" or "Large Stones" the data for the respective subpopulation should be consulted and Treatment A would be preferred to Treatment B. But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate to consult the aggregated data and administer Treatment B? This would stand contrary to common sense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.
On the other hand, if the partitioned data is to be preferred a priori, what prevents one from partitioning the data into arbitrary sub-categories (say based on eye color or post-treatment pain) artificially constructed to yield wrong choices of treatments? Pearl[2] shows that, indeed, in many cases it is the aggregated, not the partitioned data that gives the correct choice of action. Worse yet, given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data; with each story dictating its own choice. Pearl[2] considers this to be the real paradox behind Simpson's reversal.
As to why and how a story, not data, should dictate choices, the answer is that it is the story which encodes the causal relationships among the variables. Once we extract these relationships and represent them in a graph called a causal Bayesian network we can test algorithmically whether a given partition, representing confounding variables, gives the correct answer. The test, called "back-door," requires that we check whether the nodes corresponding to the confounding variables intercept certain paths in the graph. This reduces Simpson's Paradox to an exercise in graph theory.