According to the article there's a confidence score as well. As long as this is sufficiently predictive of errors either a tight or wide distribution is likely acceptable.
We need to see the relationship between confidence and GDT score. If you have a nice relationship then again everything is great. But... most confidence metrics from neural networks do not have a nice relationship to the primary metric.