Hacker News new | past | comments | ask | show | jobs | submit login
Beware the Mean (stephen.sh)
94 points by sjwhitworth on Jan 28, 2020 | hide | past | favorite | 36 comments



Other, related points:

- With heavy tails, the sample mean (i.e. the number you can see) is very likely to underestimate the population mean.

- With heavy enough tails, higher moments like variance (and therefore standard deviation) do not exist at all -- they're infinite.

- Critically: With heavy tails, the central limit theorem breaks down. Sums of heavy-tailed samples converge to a normal distribution so slowly it might not realistically ever happen with your finite data. Any computation you do that explicitly or implicitly relies on the CLT will give you junk results!


Can you elaborate on the part about sample vs population mean?

The way I see it; in these scenarios you aren’t looking at the sample mean. There is no reason to sample your customer base to get an estimation of your average revenue. You can calculate from the entire population.


Your current customer base is just a sample of your total market, for example. Next year’s (larger) customer base will be slightly closer to the whole set (so different) but still just a sample.


So the point is that the current customer base is a skewed sample of the theoretical customer population.

Therefore we shouldn’t look at the current customer mean profit to predict what would happen to profits if we doubled the customer base.


Exactly. These summaries are often used for prediction, which means historic data is used as a sample of the same distribution as future data.

Even when comparing two different historical data sets you have to be careful: if you're doing anything that resembles hypothesis testing (i.e. trying to figure out if something you changed made a difference) you're not really comparing two historical data sets -- you're trying to compare the underlying distributions from which the historical data sets were drawn, but hoping that the historical data are representative samples from those.


It's not about the doubling, it's about how the distribution changes as you gather more data and get closer to the actual mean (e.g. if you had EVERY possible customer ever, like cable companies).

Take Tesla, for example. Its first customers were enthusiasts, but the mean of that group is not the mean of all current Tesla customers - they would look very different, and had Tesla optimized for the latter we'd be looking at a very different outcome.


With heavy enough tails even the mean might not exist.


Every non-empty set of numbers has a mean.

Perhaps you mean that, in a multimodal distribution, the mean might not resemble any individual member of the population?


A finite set of numbers has a mean, yes. A probability distribution with heavy enough tails does not have a mean (e.g. the Cauchy distribution).


Sure, but let's not scope creep the conversation. This is in the context of talking about summary stats calculated against sets of discrete observations, not properties of abstract probability distributions. So we're talking about doing some basic arithmetic on a list of numbers, not taking integrals.


If your data comes from a probability distribution that doesn't have a mean then calculating the mean of your data is basically meaningless (no pun intended). In a Cauchy distribution, for example, the mean of a million datapoints is equally likely to be a thousand units off from the true center of the distribution as any individual datapoint is.

This is not scope creep. This is not an abstract academic concern. I've actually seen people run into this in practice--"I quadrupled the size my dataset, why is my sample mean still wildly inaccurate?"

If you're not aware of the properties of abstract probability distributions then your basic arithmetic on a list of numbers may well be completely useless.


"If your data comes from a probability distribution that doesn't have a mean then calculating the mean of your data is basically meaningless (no pun intended)."

I think this is actually the major takeaway for this sort of discussion. All statistical measures carry with them assumptions about the underlying distribution. When you blindly use one without verifying the underlying distribution, you are asking to be lied to by your statistics.

It took me a while, but I've trained myself to actively ask the question "is this appropriate for the underlying distribution" whenever I see a "mean" or a "standard deviation". Spoiler: The answer is usually "no"! Central limit theorem notwithstanding, we encounter a lot of non-Gaussian and outright pathological distributions in the real world. Average is useful for more than just Gaussian, but it means different things in different distributions, most of them quite unlike what our Gaussian-trained intuition suggests. Standard deviation is really Gaussian-only, or at least, if you insist on using it, you ought to pair it with the skew, kurtosis, or other measures of how non-Gaussian your distribution is. Remembering that there's only one way to be Gaussian but an infinity of ways to be non-gaussian is helpful too. This can be a helpful video to visualize that: https://www.youtube.com/watch?v=iwzzv1biHv8


No, the point is that for certain sets of numbers, calculating the mean does not give you a summary statistic. It just gives the average of a bunch of numbers, not a number that is representative of your population.


This is why using Gaussian models to predict something that's fundamentally fat-tailed (e.g. IQ vs. many forms of societal achievemnt) is doomed to failure.


To add to the arguments already presented, it's not like you never encounter practical scenarios where a mean does not exist.

A good example would be earthquakes. The commonly used Richter law [1] suggests that the number of earthquakes decreases exponentially with the magnitude, while the intensity increases exponentially with the magnitude. In seismically active regions the frequency and intensity even appear to be inversely proportional. This results in a model where there is no average intensity. Sure this model is bound to break down eventually, but until the first earthquake that quite literally breaks the mould we likely won't know what the upper bound may be.

[1]:https://en.wikipedia.org/wiki/Gutenberg%E2%80%93Richter_law


An example of your infinite variance case, sums of cauchy distributions don't even converge to a normal distribution.


If the author is seeing this thread; I couldn't find an RSS feed for your site. I don't know if they are difficult to setup, but if it's very little effort, I would appreciate seeing what you post next :)

As for the waryness about the mean. A lot of people much further behind than thinking of different distributions. Even something you assume is normal distributed needs a mean and a variance! As for visualising, histograms are incredibly underrated tools. You can infer a lot of information by just looking at a distribution.


Sorry, I do need to set this up. It's just a bunch of Markdown at the moment. Thanks for reading!


You could always try to contact him direct:

https://stephen.sh/about


The mean is misleading. The median is misleading. The mode is misleading. Any reduction of a range of data to a single representative datum is misleading.

However, the fight back against providing something a bit more meaningful than a single value can sometimes be quite strong.

I try hard to provide software estimates as probability distributions, but when someone sees a line with a probability peak somewhere around two days (could be really simple), and then a wide hump somewhere around two weeks (if it's not simple, it will mean a significant rewrite), with a very low line between them and then a long, long tail off to several months, it is not well-received.

I can see their point; they're trying to plan things, and the whole system is set up to work with single numbers. If everyone provided probability graphs for their estimates, and we had a tool that could then combine them and deliver the net probability graph of the combined pieces, I expect they'd be a lot more amenable.


> The mean is misleading. The median is misleading. The mode is misleading. Any reduction of a range of data to a single representative datum is misleading.

For anyone who hasn't seen it before, Anscombe's Quartet is a nice visual of this (and actually goes a bit further, showing reduction in general can be misleading, not just to a single point).

https://en.wikipedia.org/wiki/Anscombe%27s_quartet


I was just coming here to post that. A very neat (and surprising) thing! :)


> they're trying to plan things

Each divorce starts with a marriage. Each delay starts with a plannig.


What is your solution in practise?


Nassim Taleb greatly expands on this point in Antifragile. For a freely available, techical argument, check out Doing Statistics Under Fat Tails[1].

[1]: https://www.fooledbyrandomness.com/FatTails.html


This is a worthy posting, particularly as so much becomes iterative statistics in "A.I." clothing. The two old (slightly hackneyed) counter-examples which are popular in lectures about measures of the central tendency are:

- One is trying to get a sense of the common sort of income in a room and then Bill Gates wanders in. Suddenly the average income becomes an amount which no one experiences.

- What is the average number of testicles in the human population? That computed central tendency is quite rare.


The second one doesn't seem that bad to me. The number can still be used to answer common questions like, "Assuming it takes 10 seconds to tickle a testicle, how long do I have to tickle testicles if I want to tickle every testicle in my apartment complex."

Sure, you have to know what the mean means, but it's still a useful number. Your first example is more indicative of the problem, IMO.


I would quibble some here. When we look at revenue, I agree: ignore the mean. If there's a whole bunch of people not paying you anything, that's OK... Look at the 50th and 90th percentile.

But profit, and similarly costs? Your mean customer better be profitable, or you won't be. How much the people on the left of the graph cost you is important.

Part of this is definitional, too. Do you include that far left part of the graph where people are not really paying you as a "customer"?


> Part of this is definitional, too. Do you include that far left part of the graph where people are not really paying you as a "customer"?

Exactly. Often when the mean is problematic, it's because there's multiple subgroups in the data with very different behavior, and when you split those up, you can learn a lot more from the numbers.


A “mean” profitable customer would be misleading. You could see that 99% of your customers lose money for you, while 1% earn all the profit (aka fat tail)

The point is you should be careful about reducing huge swaths of data to one datum. It often hides the more interesting insights.


> A “mean” profitable customer would be misleading.

In no place did I suggest "mean profitable customer" as a metric. I sort of said the opposite.

At the same time, in a traditional industry we wouldn't consider people with an evaluation license or sample or who came in to wander and see our wares a "customer". They're a lead and have a cost associated with them.

> The point is you should be careful about reducing huge swaths of data to one datum. It often hides the more interesting insights.

Sure. At the same time, if you dance with 30 numbers looking for insight, pretty soon you're practicing qualitative, wishy-washy innumeracy. The discussion is about KPIs, which are all about "reducing huge swaths of data to one datum" but also absolutely essential to run a real business day to day.

Monitor more than one, and keep your eyes open for where they go wrong, and be ready to change what you do.


The mean is not so bad for many purposes because it is an expectation value.

If you add up your revenue, subtract your expenses, and divide by the number of customers that gives you a real profit number. (Condition how you define revenue & expenses) If that number is negative or positive it is meaningful.

The median on the other hand has a different set of problems. If you are running a game like Fate Grand Order you'd better cultivate the guy who spends $70k because he has to "catch them all". The median player probably pays little or nothing, but the guy who sells ero comics at Comiket complains about what it costs to get (say) Saber Bride, but it is worth more to him than it is to the medium.

Mean and median are terrible numbers to use for latency; what drives you nuts with your computer being unresponsive is not the median latency, but the 99% latency.


> The mean is not so bad for many purposes because it is an expectation value.

Implicit assumptions:

1. The sample mean is an accurate estimator of the expectation.

2. The expectation is a useful number.

Both of these are false surprisingly often; an example is one you're mentioning: latencies.


I came across too many people who value mean soooooo much in the analysis. Well, some of them made mistake and the project died. Hypothesis: heavy reliance on mean increases the probability of failure in internet industry. This reminds of PG's essay mean people fail: http://www.paulgraham.com/mean.html

Pun intended :)


The Iranian civilization can draw continuity to Susa, circa 3000BC, further than China. The Mesopotamian and Indian civilizations are older still but broke continuity.


I think you meant to reply to this post https://news.ycombinator.com/item?id=22166846. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: