Training ResNet-50 on ImageNet in 35 Epochs Using Second-Order Optimization

p1esk · on Dec 1, 2018

I think the main question is - do second order methods enable faster training on a single GPU? Not in terms of epochs, obviously, but wall time.

IshKebab · on Dec 1, 2018

Is it possible to over-fit optimisation methods? I mean people concentrate so much on coming up with fancy ways of training resnet really quickly or with huge batch sizes or whatever. Maybe the methods themselves don't generalise to other networks.

Maybe they do though. Just a though.

asparagui · on Dec 1, 2018

Yes, over-fitting is definitely something that happens. Here's a paper that demonstrates the problem by testing CIFAR classifiers on a new version of the input data: https://arxiv.org/abs/1806.00451

The flip side is that by having a common problem to tackle it's much easier to compare and contrast different results to figure out what really works. Many approaches that improve results on CIFAR don't scale to ImageNet, and many of the ImageNet papers don't scale to the even larger datasets that people are using now.

IshKebab · on Dec 4, 2018

Aha amazing - I've been wondering for ages how rankings would change with new test data for MNIST (or another common dataset). Thanks for the link!

sheeshkebab · on Dec 1, 2018

Are there any (unsupervise or min supervised) methods yet that learn on non repeating datasets? As humans we don’t really learn by rereading the same text 35 times, or stear at the same picture continuously, and so why are all these algorithm research papers keep focusing on methods that are basically trying to fit a function over a static known and labeled data?

dahart · on Dec 1, 2018

> As humans we don’t really learn by rereading the same text 35 times, or stear at the same picture continuously

Well we do learn language that way, by hearing the same sounds over and over until we understand words and associations. There are lots of other examples of needing to have repetition for learning to sink in as well, from music to math to writing.

emcq · on Dec 1, 2018

Yes there are certainly datasets you can train on a dataset effectively without repetition; this is more common in areas like reinforcement learning where you are simulating every training point. While theoretically possible you have the same random seed to produce the same training point, in practice that wont happen without bugs but some training data will be similar. Augmentation can also ensure you dont see the exact training data more than one.

However, this doesn't guarantee you will be able to train an amazing model. You are limited by properties like model capacity, label noise, poor feature representation, optimizer+learning rate, etc. In the limit empirically there will be some dataset size after which you get diminishing returns on improving model performance.

krona · on Dec 1, 2018

Our genes have spent millions of years preconditioning us to expect the world to appear to us in a certain way. We then spend a lifetime refining an emergent cognitive model so that we can navigate it.

IanCal · on Dec 1, 2018

To be fair, we do spend a pretty major chunk of our lives looking at repetition to learn, particularly when younger.

The term you may be looking for though is "one shot learning".

ericd · on Dec 1, 2018

Yeah, one shot learning in humans is built on top of enormous model representations of the world built up over many years, so I wouldn’t be surprised if we’re a bit premature in trying to pull it off in ML other than in very simple cases.