Is it possible to over-fit optimisation methods? I mean people concentrate so much on coming up with fancy ways of training resnet really quickly or with huge batch sizes or whatever. Maybe the methods themselves don't generalise to other networks.
Yes, over-fitting is definitely something that happens. Here's a paper that demonstrates the problem by testing CIFAR classifiers on a new version of the input data: https://arxiv.org/abs/1806.00451
The flip side is that by having a common problem to tackle it's much easier to compare and contrast different results to figure out what really works. Many approaches that improve results on CIFAR don't scale to ImageNet, and many of the ImageNet papers don't scale to the even larger datasets that people are using now.
Maybe they do though. Just a though.