Is the point that the cost functions have incompatible gradients around local minima/different local minima?
I think that is part of it: the different cost functions can have different local minima and also different saddle points; ideally even different ridge/valley configurations.
In machine learning there is a well-known technique called Stochastic Gradient Descent (SGD) [1]. There the cost function is the sum of a very large number of terms reflecting how well each element of the training set has been reproduced.
With SGD the optimisation steps use randomly chosen cost functions which are obtained by choosing a random subset of the training set.
I had thought the advantage of SGD was purely in saved computation: by computing the approximate cost function on only a small batch you have only a tiny fraction of the computational expense. That if you could use larger batches it would always help the convergence.
This demo writeup makes me realize there may be a benefit from the randomness. Different cost functions may have different local minima, different saddles, different ridges. That helps you not get stuck or even slowed at these points.
I gather that's the idea, but mean squared error and mean absolute error are fairly correlated, so I'm not sure if that would be an advantage or disadvantage.
I'm running a MSE hill climbing thing at the moment, I might give it a go and see if it helps.