Is it possible to over-fit optimisation methods? I mean people concentrate so much on coming up with fancy ways of training resnet really quickly or with huge batch sizes or whatever. Maybe the methods themselves don't generalise to other networks.
Yes, over-fitting is definitely something that happens. Here's a paper that demonstrates the problem by testing CIFAR classifiers on a new version of the input data: https://arxiv.org/abs/1806.00451
The flip side is that by having a common problem to tackle it's much easier to compare and contrast different results to figure out what really works. Many approaches that improve results on CIFAR don't scale to ImageNet, and many of the ImageNet papers don't scale to the even larger datasets that people are using now.
Are there any (unsupervise or min supervised) methods yet that learn on non repeating datasets? As humans we don’t really learn by rereading the same text 35 times, or stear at the same picture continuously, and so why are all these algorithm research papers keep focusing on methods that are basically trying to fit a function over a static known and labeled data?
> As humans we don’t really learn by rereading the same text 35 times, or stear at the same picture continuously
Well we do learn language that way, by hearing the same sounds over and over until we understand words and associations. There are lots of other examples of needing to have repetition for learning to sink in as well, from music to math to writing.
Yes there are certainly datasets you can train on a dataset effectively without repetition; this is more common in areas like reinforcement learning where you are simulating every training point. While theoretically possible you have the same random seed to produce the same training point, in practice that wont happen without bugs but some training data will be similar. Augmentation can also ensure you dont see the exact training data more than one.
However, this doesn't guarantee you will be able to train an amazing model. You are limited by properties like model capacity, label noise, poor feature representation, optimizer+learning rate, etc. In the limit empirically there will be some dataset size after which you get diminishing returns on improving model performance.
Our genes have spent millions of years preconditioning us to expect the world to appear to us in a certain way. We then spend a lifetime refining an emergent cognitive model so that we can navigate it.
Yeah, one shot learning in humans is built on top of enormous model representations of the world built up over many years, so I wouldn’t be surprised if we’re a bit premature in trying to pull it off in ML other than in very simple cases.