>For instance, local entropy, the loss that Entropy-SGD minimizes, is the solution of the Hamilton- Jacobi-Bellman PDE and can therefore be written as a stochastic optimal control problem, which penalizes greedy gradient descent. This direction further leads to connections between variants of SGD with good empirical performance and standard methods in convex optimization such as inf- convolutions and proximal methods.
You clearly didn't read the article. What are you commenting on, then? It seems to be about your own understanding, since it's certainly not about the article.
my understanding is that the issue is that the full Hessian of the loss is too expensive to compute at each step for the relative size of the increase in learning speed
I'm think this is a work in progress? Seems to be in ieee format, where pages are gold, and one would almost never waste an entire on references(surveys), and leave another mostly blank space. I'll wait for the peer reviewed version.
This is a horribly titled paper and I don't even agree with the premise necessarily. Since when did we become convinced deep learning achieves "global optimality" as they put it.
When the network is deep or big enough, local minimums tend to have loss close to global optimum. There is theoretical and empirical evidence for this. Global minimum is usually overfitting so what is needed is getting close enough.
In practice algorithms like stochastic gradient descent have problems distinguishing between saddle points and local minima. There is hypothesis that in many/most cases local minima where algorithms get stuck are actually saddle points with long plateau.
This mathematical hand waving plagued the field of genetic algorithms and its variants in the 90's until the 'No free lunch theorem' was published from the Santa Fe institute. It essentially said (my take on it) that if you had no information about the landscape you were searching - then couldn't say much about the algorithm you were selling.
We are talking about continuous multidimensional optimization. We have already selected our bias and subset of problems we want to solve. Now we are figured out that in this domain there are some theoretical reasons that explain why gradient descent works so well over large number of problems.
Agreed. NFL is general and we need to discuss a specific landscape.
Deep learning is continuous <nonlinear> multidimensional optimization - yes.
What defines the subset of problems ? a general nonlinear mapping from R^n to R^m n>m ?
or are you limiting to image classification ? speech recognition ? which would be a subset.
We have empirical evidence that deep-learning works but I'm not confident that we have the mathematical tools to understand why.
It showed that large CNN models which have far greater capacity than the data they are shown (and could have memorized it) still tend to learn very good generalizable minima.
See proposition 1:
(i) For any model class F whose model complexity is
large enough to memorize any dataset and which
includes f∗ possibly at an arbitrarily sharp minimum,
there exists (A, Sm) such that the generalization
gap is at most epsilon, and
(ii) For any dataset Sm, there exist arbitrarily unstable
and arbitrarily non-robust algorithms A such
that the generalization gap of f_A(Sm)
is at most epsilon.
True. High-dimensional non-linear systems (like deep NNs) are a frontier we are just beginning to explore with empirical success.
However, if I was a VC - my knowledge of the NFL would make me look skeptically at any grandiose claims of miracle learning algorithms. In a sense, it provides me with an upper bound to BS.
>> Global minimum is usually overfitting so what is needed is getting close enough.
If the global minimum is overfitting, then what are local minima doing, exactly? Generalising well? I find that very hard to believe.
Also- we're talking about overfitting in the context of a particular training dataset, right? Because we all know that it doesn't matter how well your model validates against your out-of-sample data, its performance is going to fall by 10% to 20% once it hits the real world and has to deal with truly unseen data.
Yes. Specifically, the common argument has become that flat minima generalize well.
As for the last part, when people talk about overfitting we're typically ignoring distributional shift. Otherwise, models can achieve very good generalization.
I'm skeptical about both statements. I haven't heard the "comnon argument" you mention and I don't see how ignoring differences in distribution between laboratory and real-world data will improve generalisation.
I find the last bit particularly hard to accept, but I'd like to hear why you say so, on both counts.
Sorry, my last sentence
> Otherwise, models can achieve very good generalization.
was probably unclear. When we talk about generalization, we typically have to assume a lack of distributional shift. Otherwise, if the "truly unseen data" really is drawn from a different distribution (say one where all cats are now called "dogs" and all dogs are now called "cats"), we can get arbitrarily poor performance.
As for the flat minima part, lemme pull up some papers from my notes.
>Although flatness of the loss function has been
linked to generalization [Hochreiter and Schmidhuber, 1997], existing definitions
of flatness are neither invariant to nodewise re-scalings of ReLU nets nor general
coordinate transformations [Dinh et al., 2017] of the parameter space, which calls
into question their utility for describing generalization
https://arxiv.org/abs/1703.11008 is also a really promising paper on generalization that basically shows another way to view flatness from a Bayesian perspective (they also show the most impressive generalization bound I've seen yet).
I think it's clear from these papers that the "flat minima generalize well" argument is a common avenue of research with respect to generalization. Many of the generalization papers I've read recently are mostly related to different ways of calculating flatness, and pretty much all of them make reference and link their work to the 1st and 4th paper I linked above.
Thanks for the reply. What do you do, btw? I mean, where does your interest
in this subject come from, if I'm not being nosy (again)?
I'll have to have a look at those papers. Schmidhuber and Hochreiter first,
obviously- my favourites.
>> When we talk about generalization, we typically have to assume a lack of
distributional shift. Otherwise, if the "truly unseen data" really is drawn
from a different distribution (say one where all cats are now called "dogs"
and all dogs are now called "cats"), we can get arbitrarily poor performance.
Well, the difference in distribution doesn't have to be that radical, to lead
to egregious mistakes. For example- I don't know if you caught wind of the
recent "deep gaydar paper", where two researchers from Standford trained a
deep learning classifier to identify gay men and women from images of their
faces?
The data in one experiment consisted of about 12k pictures of users of a US
dating site. Shockingly (to me anyway), the number of pictures of gay men was exactly the same as the number of pictures of straight men, and so on for the numbers of pictures of gay and straight women
[1]. This is a completely unrealistic, artificial distribution - estimates
I've heard about the prevalence of homosexuality in the general population
vary from 5% to 10%. The paper itself references a 7% prevalence.
Given such artificial data, I really don't think anyone could train a
classifier that would have the same ability to discriminate between categories
when exposed to the real world. And we're not talking about "cats" looking
like "dogs"- it's a simple binary problem, discriminating between gay/not gay, the same features are (presumably) associated with the same classes,
except a classifier trained on balanced classes and in the real world the classes are
unbalanced.
Of course I can't really say what would happen if their classifier was trained
on data with a distribution matching the real world (or their 7% expectation
of gay men), but I do know that, with unbalanced classes, it's perfectly
possible to train a classifier with perfect recall and rubbish precision (with
a ratio of 10% gay men, you can get 1.0 recall with 0.37 ish precision). So with different training distributions you get different classifiers. I mean, duh.
So that's the kind of thing I'm talking about. And for large enough problems
(e.g. the language domain where the true problem is probably infinite) that is
a real issue, because even with the best intentions and the best sampling
techniques you may never get enough data to train a proper discriminator. And
even if your laboratory experiments tell you that your discriminator is
generalising well against the data you have, but didn't train it on, you have no
way to tell what will happen when it encounters data that even you have yet to see.
_________________
[1] That was for users with at least one picture, which is to say, everyone.
Decreasing numbers of users had 2, 3 and so on up to 5 pictures and they were
used in different experiments.
I agree that the sort of sampling bias/distributional shift you mention is definitely a real problem in deep learning. It so happens that yesterday I read a blog post that elaborates on this problem clearer than I can. Although it's still true that allowing arbitrary distributional shift is impossible, it does make sense to allow for distributional shift across the natural manifold (manifold of all "allowable" data). The blog post called this "strong generalization", which isn't a term I've heard before.
I'm just an undergraduate doing research in computer vision/deep learning. I've been following a lot of the research in deep learning theory because I think it's really interesting and there's been a lot of activity recently. Maybe I'll be able to do research on pure ML theory one day, but my math ability is definitely not up to par yet haha...
Thanks for the article! I'm still trying to find the time to read it, but it looks very intersting and well written. I hadn't heard the term "strong generalisation" before either, but it's a nice concept that is often overlooked in discussions of machine learning.
To me it's a big problem and I don't think anyone has good answers. I once pressed one of my tutors at university (I did an AI Masters) with a question about it. In the end he kind of put his hands up and said that the final arbiter of your system's performance is your client- if they're happy with your model, than that's that. That certainly does not satisfy me. On the other hand, of course, the success of some machine learning systems "in the wild" is indisputable (say, AlphaGo and AlphaZero etc).
Don't be too worried about the maths. I'm doing an AI PhD and my maths was never very good either (although to be fair it's a GOFAI subject, so it's a different kind of maths). You learn to read the maths in papers and understand what's going on- slowly and painfully at first, but you get there in the end. Besides, if you're good with the code, you can use that to help you get your head around the hardest bits. I still have trouble with calculus, but I once implemented an LSTM network in pure python ish (no frameworks, besides numpy) for one of my classes at uni. I think it even worked properly. My tutor was duly impressed anyway.
Thanks again for the links to the papers I hadn't read and that nice little article- and good luck with your research :)
I think my math is pretty good... for an undergrad at least. Problem is that the barrier for contributing meaningful research in ML theory is higher than in applied ML haha. Anybody can read an applied paper like resnet and get the main idea.
I do agree that I think strong generalization is a problem that definitely needs to be solved.
Basically, we need empirical results to know what kind of theoretical models are useful to study. Otherwise, we fall into the trap where we assumed that non-convex surfaces are basically impossible to analyze. I think he draws a very cogent comparison to statistical physics.
If you have good maths skills you shouldn't be too worried. In any case, most peoples' first contributions -in any field- are rarely entirely their own. Look around for papers where the supervisor is the last author in a paper- like the Hochreiter & Schmidhuber paper you linked above. That's a clear sign of a more experienced researcher lending a hand to a younger researcher. And that's the supervisor-supervisee relation in a nutshell, really.
Is there a separate word from overfitting to refer to generalization performance in a way that explicitly includes distributional shift? Just curious as I never realized that overfitting was commonly constrained in the way you describe and would like to go back and relabel my knowledge to better distinguish between your definition of overfitting and a more general version.
Generalization performance is typically defined in roughly this manner.
Assume your training data consists of a sample S of m elements from your full distribution D. Let L(S) be the average loss for your model (hypothesis is also a common term) on S, and L(D) be the average loss for your model on D.
Now, we say that training generalizes if your model that minimizes L(S) also achieves similarly low L(D).
In practice, we detect generalization by taking a separate sample (validation set) S_2 from D and using L(S_2) to approximate L(D).
I'm not familiar with any literature of generalization to arbitrary distributional shift between training and test. If you allow arbitrary distributional shift (say we decide that all dogs are now called cats and all cats are now called dogs), it's pretty clear that you can get arbitrarily poor "generalization".
Ok — but in many cases there’s more that can be said about the error in the approximation to D of S and S_2 which would allow discussion of various forms of bounded or constrained distributional shift. I don’t think the only choices are “no distributional shift” and “arbitrary distributional shift” are they? What’s the vocabulary to refer to poor generalization performance to datasets with some form of constrained or procedurally introduced distributional shift?
In short, it's when your training data come from different conditions than your real world data.
This means that the probability distribution your model was trained on is different from the one you are using it on. This means that an model that did great in cross-validation could still be useless in practice.
Basically, it's when your distribution that your classes come from change between training and test.
For example, if after training your classifier to distinguish between cats and dogs, all of a sudden society decides that all white haired dogs are now called cats, your classifier is clearly going to get "poor" performance.
> Since when did we become convinced deep learning achieves "global optimality" as they put it.
There are a couple of papers over the past few years arguing/proving under various models that the local optima in deep models either are or are within epsilon of the global optima and this is why SGD reaching only local optima results in models that still work well. I assume these results are covered in OP (haven't read it yet, just answering your question).
> There are a couple of papers over the past few years arguing/proving under various models that the local optima in deep models either are or are within epsilon of the global optima
I don't care so much for the “arguing” part, but the “proving” part got me interested. ML methods reliably being able to get epsilon-close to global optima is a BFD. What mathematical technology was used to establish these results? Could you provide links to those papers?
I believe this is in reference to the information bottleneck. This is an idea that there is a limit to the performance of a particular network. From an information theory point of view, a deep neural network can be viewed as compressing the input and subsequently generating a code on the other side. So the network is seen as encoding the data and because of that we can bound it using some tools based in information theory.
It is not; it actually looks like a very well-written paper, for a review, with references to all the latest developments. Your particular quibble is answered in a section along with references.