TLDR: Because it gives more weight to one big error then to multiple small ones ...

get · on Feb 15, 2018

An example:

Imagine these two predictors:

    Reality: 1 1 1 1 1 9 1 1 1 1
    Predic1: 2 2 2 2 2 2 2 2 2 2
    Predic2: 3 3 3 3 3 6 3 3 3 3

    SumOfErrors(Predic1) is 16
    SumOfErrors(Predic2) is 21

So Predic1 was better then Predic2? No. Because correctly predicting the one outlier shows more predictive power then staying close to the average. Therefore we use SumOfSquerrors:

    SumOfSquerrors(Predic1) is 58
    SumOfSquerrors(Predic2) is 45

This shows that Predic2 is "better" and we are happy :)

scrooched_moose · on Feb 15, 2018

It should be 16 & 21 and 58 & 45.

get · on Feb 15, 2018

True. Fixed. Thanks.

ralusek · on Feb 16, 2018

But what I've never understood is that if your objective is to magnify errors, why not cube it? Why not to a greater power still? If the other benefit is that all negative values to an even power become positive, then why not take the absolute value of the cube? No matter what, the degree to which we magnify errors strikes me as arbitrary.

srean · on Feb 16, 2018

> Noise usually has a gaussian distribution

This belief is often a good indicator that a data scientist is divorced from reality.