> If you must use average, then you should remove the top 5% from the average. 5...

wredue · on Sept 28, 2023

5% is not arbitrary. It is removing massively skewing outliers who also see a massive growth outlier compared to everyone else.

Looking at wage movement as slices of where one falls gives a clearer picture.

Additionally, the reason it doesn’t make sense to remove low wage outliers while removing high wage outliers for average is because the median wage is closer to low wage outliers than it is to high wage outliers.

For example, if you take the median and then +- for your dataset (if median is 48,000, then use 0 thru 96,000), you’ll be removing more than just the top 5%, and yet, this also gives a far far far better picture of the dire economic position and what is happening with real wage movement.

squeaky-clean · on Sept 28, 2023

It's arbitrary because why not 4% or 6%. 5 just happens to be a pleasant number but there's nothing special about it in an economic sense.

Gormo · on Sept 29, 2023

Why would you cut off a fixed percentage rather than use a standard deviation range?

ralphm · on Sept 29, 2023

Because wages are non-Gaussian, standard deviation doesn't mean anything in this context.

Gormo · on Oct 9, 2023

Interesting. Most income graphs I've seen appear to be normally distributed -- could you explain what would cause them not to be, or provide any examples that demonstrate non-Gaussian distributions?

But if wages are non-Gaussian, how would trimming the top and bottom 5% off of your sample be any better than using standard deviation ranges to control for outliers? The assumption that outliers are to be found on the top and bottom of your range is one that seems to apply to Gaussian distributions, and doesn't necessarily hold for others, regardless of whether you are using fixed percentage values or standard deviation thresholds. For example, in a bimodal distribution, outliers might be found in the center.

permo-w · on Sept 28, 2023

median is a type of average. the parent commenter means the mean

ryanisnan · on Sept 28, 2023

Is this correct? I know no definition where median can be interpreted as an average.

mlyle · on Sept 29, 2023

e.g. M-w.com:

> 1 a > : a single value (such as a mean, mode, or median) that summarizes or represents the general significance of a set of unequal values

Average isn't well defined-- if you have to guess, people probably intend "arithmetic mean" when they say it. But if you're using it in the sense of a single number that represents the typical value of a dataset, you may want the median, mode, or midrange, or even a geometric or harmonic mean depending upon circumstance.

So it's best to use that term. And it's best to be charitable to not jump on people saying "average" when the median might be the best measure of central tendency for a task.

rrrix1 · on Sept 30, 2023

HN tangent threads are the best.

Hopefully no one nitpicks my use of "tangent" on a digression of an aside...

llbeansandrice · on Sept 28, 2023

It's frequently used in when talking about incomes specifically. [0]

> For example, the average personal income is often given as the median—the number below which are 50% of personal incomes and above which are 50% of personal incomes—because the mean would be higher by including personal incomes from a few billionaires. For this reason, it is recommended to avoid using the word "average" when discussing measures of central tendency and specifically specify which type of measure of average is being used.

[0] https://en.wikipedia.org/wiki/Average

KolenCh · on Sept 29, 2023

The median minimizes the L1 norm and the mean minimizes the L2 norm. (Of the error made between the estimator and the data roughly speaking.)

In this sense yes, median is some kind of average.

But then it is insensitive to the outliners, which is the reason it is suggested above.

adolph · on Sept 28, 2023

Given outliers in a skewed dataset, wouldn't median be a worse description than mean without outliers?

https://openstax.org/books/statistics/pages/2-6-skewness-and...

ghufran_syed · on Sept 28, 2023

I think the median works better in this situation because the huge population means that both the outliers and the skewness have much less effect than they do on the mean. The problem with using the mean “without outliers” is that you have to make arbitrary decisions about what data to exclude as an outlier, unlike with the median.

sapiogram · on Sept 28, 2023

The answer is always "it depends", which is exactly why I prefer the median in most cases. Once you choose to use the median, there are no more choices/degrees of freedom - it's just the median. On the other hand, "mean without outliers" requires you to make a subsequent value judgement on what exactly is an "outlier".

adolph · on Sept 28, 2023

> "mean without outliers" requires you to make a subsequent value judgement

Do you think that comparison of outliers to interquartile range is not a relatively objective method of determining outliers?

The interquartile range is a number that indicates the spread of the middle half, or the middle 50 percent of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1) . . . The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than 1.5 × IQR below the first quartile or more than 1.5 × IQR above the third quartile. Potential outliers always require further investigation.

https://openstax.org/books/statistics/pages/2-3-measures-of-...