Hacker News new | past | comments | ask | show | jobs | submit login

A more general lesson is that correlation and causation are unrelated: the former doesnt imply the latter, and the latter does not imply the former. Just because one thing causes another does not mean it will be correlated with it.

There is no contradiction in subsets having different correlations that the parent set. The apparent "paradox" arises from reading the data causally. The purpose of this lesson is to expose these assumptions in interpretation of data. Few seem to get the message though.




If anyone has doubts about the second claim, think about a hash function. The input certainly causes the output, but they are not correlated in a statistical sense.


Consider a medicine which kills everything with kidneys. It perfectly correlates with killing everything with a liver.

Consider another medicine which kills everything with kidneys, unless they have a liver (eg., which filters it). Now there is no correlation at all with an effect on the kidneys, nor will there ever be (since all animals with one have the other) unless someone deliberately impairs a liver.


Consider a medicine that completely cures Alzheimer's.

It also necessarily increases the incidence of other causes of death, as those who won't die of Alzheimer's will die of something else instead.


> A more general lesson is that correlation and causation are unrelated

This is a bit extreme. The tongue-in-cheek variant I like (which I first read about in the book referenced by TFA) is "no correlation without causation". In order for two things to truly co-vary (and not just by accident, or as a consequence of poor data collection/manipulation), there needs to be some causal connection between the two – although it can be quite distant.


> there needs to be some causal connection between the two

Umm.. No, there doesn't... This idea features in the earlist 20th C. writings on statistics, but it's pseudoscience.

If one carves up the whole history of the entire universe into all possible events, then there's likely to be a (near) infinite number of pairs of events which "perfectly" co-vary without any causal connection whatsoever. Indeed, one could find two galaxies that are necessarily causally isolated and find correlated events.

This is, in part, because the properties of two casually independent systems can have indistinguishable distributions -- just by the nature of what a distribution is.

It's this sort of thinking that I was aiming to rule out: really they have nothing to do with each other. It's early 20th C. frequentist pseudoscience that has given birth to this supposed connection, and it should be thrown out all together.

Causation is a property of natural systems. Correlation is a property of two distributions. These have nothing to do with each other. If you want to "test" for causation, you need to have a causal theory and a causal analysis in which "correlation" shouldn't feature. If you induce correlation by causal intervention, i'd prefer we gave that a different name ("induced correlation") which is relevant to causation -- and it's mostly this confusion which those early eugenticists that created statistics were talking about.


> If one carves up the whole history of the entire universe into all possible events, then there's likely to be a (near) infinite number of pairs of events which "perfectly" co-vary without any causal connection whatsoever.

But if they are not linked by a stable causal connection, wouldn't they eventually diverge, if we observe long enough?


> But if they are not linked by a stable causal connection, wouldn't they eventually diverge, if we observe long enough?

I'm not sure why you would think so. All that's required is that the process they are following to generate observables is deterministic or law-like random .

Consider a possible universe where everything is deterministic, and at t=0 N=infinity objects created each with some very large number of measurable properties. Some never change, so property p=1,1,1,1,1,1,1, etc. forever. Some change periodicially, p=1,0,1,0,1... etc.

Now I dont really see why there wouldn't be an infinite number of correlated such properties of objects with no casual relationship whatsoever.

Maybe you want to claim that the actual universe is chaotic over long time horizons, with finite objects, finite properties, etc. and as t->inf the probability of finding properties which "repeat together" goes to zero. ... like, Maybe, but that's a radical claim.

I'd say its much more likely that, eg., some electron orbiting some atom somewhere vs. some molecule spinning, etc. will always be correlated. Just because there's so many ways of measuring stuff, and so much stuff, that some measures will by chance always correlate. Maybe, maybe not.

The point is that the world does not conspire to correlate our measures when causation is taking place. We can observe any sort of correlation (including 0) over any sort of time horzion and still there be no causation.

In practice, this is very common. It's quite common to find some measurable aspects of some systems, over horizons we measure them, to "come together in a pattern" and yet have nothing to do with each other. I regard this as the default, rather than vice versa. At least every scientist should regard it as the default.. and yet, much pseudoscience is based on a null hypothesis of no pattern at all.


There's subtleties in what you two are saying that I think are leading to miscommunication.

I think it is better to think about this through mutual information rather than "correlation"[0], adding DAGs (directed acrylic graphs) also helps but are hard to draw here.

If causation exists between A and B, the two must also have mutual information. This is more akin to the vernacular form of "correlation" which is how I believe you are using it. But statisticians are annoying and restrict "correlation" to be linear. In that case, no, causation does not necessitate nor imply linear correlation (/association).

For mjburgess's universe example, I think it may depend on a matter of interpretation as to what is being considered causal here. A trivial rejection is that causation is through physics (they both follow the same physics) so that's probably not what was meant. I also don't really like the example because there's a lot of potential complexity that can lead to confusion[1], but let's think about the DAG. Certainly tracing causality back both galaxies converge to a single node (at worst, the Big Bang), right? They all follow physics. So both have mutual information to that node. *BUT* this does not mean that there is an arrow pointing from one branch to the other branch. Meaning that they do not influence one another and are thus not causally related (despite having shared causal "history", if you will).

Maybe let's think of a different bland example. Suppose we have a function f(x) which outputs a truly random discrete outputs that are either 0 or 1 (no bias). Now we consider all possible inputs. Does there exist an f(a) = f(b) where a ≠ b? I think with this example we can see believe this is true but you can prove it if you wish. We can even believe that there is a stronger condition of a having no mutual information between a and b. In the same way here, if we tracked the "origin" of f(a) and f(b) we would have to come through f (f "causes" f(a) and f(b)), but a and b do not need to be constructed in any way that relates to one another. We can even complexity this example further by considering a different arbitrary function g which has a discrete output of [-1,0,1], or some other arbitrary (even same) output, and follow the same process. When doing that, we see no "choke point" and we could even pull a and b from two unrelated sets. So everything is entirely disjoint. Try other variations to add more clarity.

[0] I also corrected mjburgess through this too because a subtle misunderstanding led to a stronger statement which was erroneous https://news.ycombinator.com/item?id=41228512

[1] Not only the physics part but we now have to also consider light cones and what physicists mean by causation



  > correlation and causation are unrelated
This is incorrect (but what followed is correct).

You have extended the meaning of the phrase "correlation does not imply causation" to a stronger case[0]. The correct way to say this is that "correlation are not necessarily related."

The other way you might determine this was wrong is that ,,association''[2] always occurs when there is causation. So we have the classic A ⇒ B ⇏ B ⇒ A (A implies B does not imply B implies A), where ordering matters.

Last, we should reference Judea Pearl's Ladder of Causality[1].

[0] Another similar example was given to us by Rumsfield with respect to the Iraq WMD search. Where the error was changing "the absence of proof is not proof of absence" to the much stronger "the absence of evidence is not evidence of absence". It also illustrates why we might want to "nitpick" here https://archive.is/20140823194745/http://logbase2.blogspot.c...

[1] https://web.cs.ucla.edu/~kaoru/3-layer-causal-hierarchy.pdf

[2] Edit for clarity: The reason I (and Pearl) use the word "association" rather than "correlation" is because in statistics "correlation" often refers to linear relationship. So association clarifies that there is mutual information. There might be masked relationships, so non-linear. But if we are to use the standard vernacular of "correlation" (what most people think) then we could correctly say "causation implies correlation" (or more accurately, "causation implies correlation, but not necessarily linear correlation."). And of course, causation implies high mutual information, but high mutual information does not imply causation :) https://stats.stackexchange.com/questions/26300/does-causati...


> ,association''[2] always occurs when there is causation

This is incorrect. See Perl's work itself. Association does not occur when there is a collider. https://en.wikipedia.org/wiki/Collider_(statistics)

Since almost all variables we are measuring are on uncontrolled environments, in almost all cases, there is an opportunity to observe no association with causation.

I give an example of this above:

> Consider another medicine which kills everything with kidneys, unless they have a liver (eg., which filters it). Now there is no correlation at all with an effect on the kidneys, nor will there ever be (since all animals with one have the other) unless someone deliberately impairs a liver.


I think our disagreement is coming down to the interpretation and nuance of your example.

Mutual information between random variables is zero iff the two random variables are independent.

In your example, you illustrate that the MI is non-zero. Sure, it is clear that it may appear zero during sampling, but that's a different story. I fully agree that there is an opportunity to observe no association. That is unambiguously accurate. But in this scenario you presumably haven't sampled animals with damaged livers. But you can also have bad luck or improper sampling even when the likelihood of sampling is much higher! That doesn't mean that there is no association, that means there's no measured (or observed) association. The difference matters, black swans or not. Especially being experimentalists/analysts, it is critical we remember how our data and experimentation is a proxy, and of what. That they too are models. These things are fucking hard, but it's also okay if we make mistakes and I'd say the experiments are still useful even if they never capture that relationship.

If we strengthen your example to the medicine always being (perfectly) filtered out by a liver (even an impaired one) and all animals must have livers, then it does not make your case either. We will be able to prune that from the DAG. The reason being that it does not describe a random variable... (lack of distribution). I think you're right to say that there is still a causal effect, but what's really needed is to extend the distribution we are sampling from to non-animals or at least complete ones. But the point here would be that our models (experiments) are not always sufficient to capture association, not that the association does not exist.

Maybe you are talking from a more philosophical perspective? (I suspect) If we're going down that route, I think it is worth actually opening the can of worms: that there are many causal diagrams that can adequately and/or equally explain data. I don't think we should shy away from this fact (nor the model, which is a subset of this), especially if we're aiming for accuracy. I rather think what we need to do is embrace the chaos and fuzziness of it all. To remember that it is not about obtaining answers, but finding out how to be less wrong. You can defuzz, but you can't remove all fuzz. We need to remember the unfortunate truth of science, that there is an imbalance in the effort of proofs. That proving something is true is extremely difficult if not impossible, but that it is far easier to prove something is not true (a single counter example!). But this does not mean we can't build evidence that is sufficient to fill the gaps (why I referenced [0]) and operate as if it is truth.

I gripe because the details matter. Not to discourage or say it is worthless, but so we remember what rocks are left unturned. Eventually we will have to come back, so its far better to keep that record. I'm a firm believer in allowing for heavy criticism without rejection/dismissal, as it is required to be consistent with the aforementioned. If perfection cannot exist, it is also wrong to reject for lack of perfection.


I'm not sure what you mean by association here then.

If you mean to say that there are, say, an infinite number of DAGs that adequately explain reality -- and in the simplest, for this liver-kideny case, we don't see association ---- but in the "True DAG" we do.. then maybe.

But my point is, at least, that we dont have access to this True model. In the context of data analysis, of computing association of any kind, the value we get -- for any reasonable choice of formulae -- is consistent with cause or no cause.

Performing analysis as-if you have the true model, and as-if the null rival is just randomness, is pseudoscience in my view. Though, more often, it's called frequentism.


  > what you mean by association here then.
Mutual information

  > but in the "True DAG" 
I'm unconvinced there is a "true" DAG and at best I think there's "the most reasonable DAG given our observations." For all practical purposes I think this won't be meaningfully differentiable in most cases, so I'm fine to work with that. Just want to make sure we're on the same page.

  > But my point is, at least, that we dont have access to this True model.
Then we're in agreement, but it's turtles all the way down. Everything is a model and all models are wrong, right? We definitely have more useful models, but there is always a "truer" model.

Why I was pushing against your example is because I think it is important to distinguish lack of association because the data to form the association is missing or unavailable to us (which may be impossibly unavailable; and if we go deep enough, we will always hit this point) vs a lack of association because the two things are actually independent[0]. One can be found via better sampling where the other will never be found (unfortunately indistinguishable from impossibly unavailable information).

  > as-if you have the true model
Which is exactly why I'm making the point. We never have (or even have access to!) the "true" model. Just better models. That's why I say it isn't about being right, but less wrong. Because one is something that's achievable. If you're going to point to one turtle, for this, I think you might as well point to the rest. But there's still things that aren't turtles.

[0] I'll concede the to an argument of "at some point" everything is associated tracing back in time. Though I'm not entirely convinced of this argument because meta information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: