In their paper they even used "weaker" DenseNet-121 instead of DenseNet-169 for Mura/bones. DenseNet-BC I tried is another refinement of the same approach.
Those are some sketchy statistics. The evaluation procedure is questionable (F1 against the other 4 as ground truth? Mean of means?), and the 95% CI overlap pretty substantially. Even if their bootstrap sampling said the difference is significant, I don't believe them.
Basically, I see this as "everyone sucks, but the AI maybe sucks a little less than the worst of our radiologists, on average"
https://stanfordmlgroup.github.io/projects/chexnet/
In their paper they even used "weaker" DenseNet-121 instead of DenseNet-169 for Mura/bones. DenseNet-BC I tried is another refinement of the same approach.