Correlation means how well the points fit on the line. It is unfortunate that th...

streptomycin · on Jan 27, 2013

one can tell there is no line which most of the points lie near

No, that's not correct. You can't conclude anything from looking at a big blob because you don't know the density of points at different places in the blob. This is the point the guy you replied to was making.

As an extreme example, imagine a billion data points that fit perfectly on a straight line. Then superimpose a million data points randomly on top of it. What does it look like? A big blob. But almost every point is highly correlated with that straight line.

randomknowledge · on Jan 28, 2013

Okay hows this: unless the author of the original blog post is deliberately deceiving the audience but putting a bunch of points on top of each other then one can tell there is no line which most of the points lie near.

I agree that a scatter plot is not the best for showing data with that many data points, but frankly it's kind of irrelevant. The point the author was making was just that it isn't that hard to cook up some data that is not highly correlated, but will be if you bin and average it.

yummyfajitas · on Jan 28, 2013

It's not a matter of being deliberate. He's just being lazy, using the standard excel scatterplot. To fix that graph, he needs to carefully choose an opacity, which he didn't do.

See the discussion here: http://news.ycombinator.com/item?id=4027337

I stand by the claim made in my blog post. Don't use scatterplots, use a density plot instead.

Incidentally, according to a comment the author made, the correlation is actually 0.3. That's far better than his graph suggests.

randomknowledge · on Jan 29, 2013

I agree he could have presented his data better. I also recognize that there is some correlation. None the less, the story was about how averaging can be used to make a correlation appear strong than it is, and going from 0.3 to 0.99 certainly meets that criteria. I was responding to the claim that the data might already have a strong correlation (presumably on the order of 0.99), and I was arguing that no "natural" strongly correlated data would have a scatter plot like that.

I might add that discussions like these, Illustrate a broader problem with the hacker news community. Rather than discussing the facts of the story, many of the posts are instead about the less relevant detail about how the author chose to present them, despite the fact the the authors point is still effectively made (and if the 0.3 were included there wouldn't have been any doubt)

BTW: tmoertel did some analysis on the data: http://news.ycombinator.com/item?id=5125851 and the correlation appears to be around 0.2, which is pretty weak.