Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Good opportunity to plug https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't know much about the underlying distribution, simple statistics don't describe it well.

From Wikipedia description: Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it, and the effect of outliers and other influential observations on statistical properties.




There's a fun paper by Autodesk where they make datasets that look whatever way you want them to.

https://www.autodesk.com/research/publications/same-stats-di...


Yes, a both funny and insightful lesson on how weak basic indicators (mean, standard deviation, correlation) can be, the interest and limits of box-plot with quartiles and whiskers, the benefit of the violin-plot.

Definitely worth a quick look.


NB: Post author here.

This is great! So fun...will have to use in the future...


While I have your attention...

> For the median and average to be equal, the points less than the median and greater than the median must have the same distribution (i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller).

[0, 2, 5, 9, 9] has both median and mean = 5, but the two sides don't really have the same distribution.


Totally true...thoughts on how I could rephrase? I guess it's more the "weight" of points greater than and less than the median should be the same, so symmetric distributions definitely have it, asymmetric may or may not. Definitely open to revising...


> symmetric distributions definitely have it, asymmetric may or may not

Doesn't have to be any more complicated than that. It's more a curio than an important point anyway :)


This is excellent information, thank you for posting this! I was not familiar with this example previously, but it is a perfect example of summary statistics not capturing certain distributions well. It's very approachable, even if you had to limit the discussion to mean and variance alone. Bookmarked, and much appreciated.


NB: Post author here.

That's really nifty, wish I'd heard about it earlier. Might go back and add a link to it in the post at some point too! Very useful. Definitely know I wasn't breaking new ground or anything, but fun to see it represented so succinctly.


>importance of graphing data before analyzing it

Very discouraging if one is trying to analyze data algorithmically. Often when faced with a problem in statistics, the answer is: "Look at the graph and use intuition!".


Interesting, thanks for the concept. What's one to do for high dimensional data then?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: