5% is not arbitrary. It is removing massively skewing outliers who also see a massive growth outlier compared to everyone else.
Looking at wage movement as slices of where one falls gives a clearer picture.
Additionally, the reason it doesn’t make sense to remove low wage outliers while removing high wage outliers for average is because the median wage is closer to low wage outliers than it is to high wage outliers.
For example, if you take the median and then +- for your dataset (if median is 48,000, then use 0 thru 96,000), you’ll be removing more than just the top 5%, and yet, this also gives a far far far better picture of the dire economic position and what is happening with real wage movement.
Interesting. Most income graphs I've seen appear to be normally distributed -- could you explain what would cause them not to be, or provide any examples that demonstrate non-Gaussian distributions?
But if wages are non-Gaussian, how would trimming the top and bottom 5% off of your sample be any better than using standard deviation ranges to control for outliers? The assumption that outliers are to be found on the top and bottom of your range is one that seems to apply to Gaussian distributions, and doesn't necessarily hold for others, regardless of whether you are using fixed percentage values or standard deviation thresholds. For example, in a bimodal distribution, outliers might be found in the center.
> 1 a
> : a single value (such as a mean, mode, or median) that summarizes or represents the general significance of a set of unequal values
Average isn't well defined-- if you have to guess, people probably intend "arithmetic mean" when they say it. But if you're using it in the sense of a single number that represents the typical value of a dataset, you may want the median, mode, or midrange, or even a geometric or harmonic mean depending upon circumstance.
So it's best to use that term. And it's best to be charitable to not jump on people saying "average" when the median might be the best measure of central tendency for a task.
It's frequently used in when talking about incomes specifically. [0]
> For example, the average personal income is often given as the median—the number below which are 50% of personal incomes and above which are 50% of personal incomes—because the mean would be higher by including personal incomes from a few billionaires. For this reason, it is recommended to avoid using the word "average" when discussing measures of central tendency and specifically specify which type of measure of average is being used.
I think the median works better in this situation because the huge population means that both the outliers and the skewness have much less effect than they do on the mean. The problem with using the mean “without outliers” is that you have to make arbitrary decisions about what data to exclude as an outlier, unlike with the median.
The answer is always "it depends", which is exactly why I prefer the median in most cases. Once you choose to use the median, there are no more choices/degrees of freedom - it's just the median. On the other hand, "mean without outliers" requires you to make a subsequent value judgement on what exactly is an "outlier".
> "mean without outliers" requires you to make a subsequent value judgement
Do you think that comparison of outliers to interquartile range is not a relatively objective method of determining outliers?
The interquartile range is a number that indicates the spread of the middle half, or the middle 50 percent of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1) . . . The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than 1.5 × IQR below the first quartile or more than 1.5 × IQR above the third quartile. Potential outliers always require further investigation.
5% is too arbitrary, just use the median.