Yes. Specifically, the common argument has become that flat minima generalize well.
As for the last part, when people talk about overfitting we're typically ignoring distributional shift. Otherwise, models can achieve very good generalization.
I'm skeptical about both statements. I haven't heard the "comnon argument" you mention and I don't see how ignoring differences in distribution between laboratory and real-world data will improve generalisation.
I find the last bit particularly hard to accept, but I'd like to hear why you say so, on both counts.
Sorry, my last sentence
> Otherwise, models can achieve very good generalization.
was probably unclear. When we talk about generalization, we typically have to assume a lack of distributional shift. Otherwise, if the "truly unseen data" really is drawn from a different distribution (say one where all cats are now called "dogs" and all dogs are now called "cats"), we can get arbitrarily poor performance.
As for the flat minima part, lemme pull up some papers from my notes.
>Although flatness of the loss function has been
linked to generalization [Hochreiter and Schmidhuber, 1997], existing definitions
of flatness are neither invariant to nodewise re-scalings of ReLU nets nor general
coordinate transformations [Dinh et al., 2017] of the parameter space, which calls
into question their utility for describing generalization
https://arxiv.org/abs/1703.11008 is also a really promising paper on generalization that basically shows another way to view flatness from a Bayesian perspective (they also show the most impressive generalization bound I've seen yet).
I think it's clear from these papers that the "flat minima generalize well" argument is a common avenue of research with respect to generalization. Many of the generalization papers I've read recently are mostly related to different ways of calculating flatness, and pretty much all of them make reference and link their work to the 1st and 4th paper I linked above.
Thanks for the reply. What do you do, btw? I mean, where does your interest
in this subject come from, if I'm not being nosy (again)?
I'll have to have a look at those papers. Schmidhuber and Hochreiter first,
obviously- my favourites.
>> When we talk about generalization, we typically have to assume a lack of
distributional shift. Otherwise, if the "truly unseen data" really is drawn
from a different distribution (say one where all cats are now called "dogs"
and all dogs are now called "cats"), we can get arbitrarily poor performance.
Well, the difference in distribution doesn't have to be that radical, to lead
to egregious mistakes. For example- I don't know if you caught wind of the
recent "deep gaydar paper", where two researchers from Standford trained a
deep learning classifier to identify gay men and women from images of their
faces?
The data in one experiment consisted of about 12k pictures of users of a US
dating site. Shockingly (to me anyway), the number of pictures of gay men was exactly the same as the number of pictures of straight men, and so on for the numbers of pictures of gay and straight women
[1]. This is a completely unrealistic, artificial distribution - estimates
I've heard about the prevalence of homosexuality in the general population
vary from 5% to 10%. The paper itself references a 7% prevalence.
Given such artificial data, I really don't think anyone could train a
classifier that would have the same ability to discriminate between categories
when exposed to the real world. And we're not talking about "cats" looking
like "dogs"- it's a simple binary problem, discriminating between gay/not gay, the same features are (presumably) associated with the same classes,
except a classifier trained on balanced classes and in the real world the classes are
unbalanced.
Of course I can't really say what would happen if their classifier was trained
on data with a distribution matching the real world (or their 7% expectation
of gay men), but I do know that, with unbalanced classes, it's perfectly
possible to train a classifier with perfect recall and rubbish precision (with
a ratio of 10% gay men, you can get 1.0 recall with 0.37 ish precision). So with different training distributions you get different classifiers. I mean, duh.
So that's the kind of thing I'm talking about. And for large enough problems
(e.g. the language domain where the true problem is probably infinite) that is
a real issue, because even with the best intentions and the best sampling
techniques you may never get enough data to train a proper discriminator. And
even if your laboratory experiments tell you that your discriminator is
generalising well against the data you have, but didn't train it on, you have no
way to tell what will happen when it encounters data that even you have yet to see.
_________________
[1] That was for users with at least one picture, which is to say, everyone.
Decreasing numbers of users had 2, 3 and so on up to 5 pictures and they were
used in different experiments.
I agree that the sort of sampling bias/distributional shift you mention is definitely a real problem in deep learning. It so happens that yesterday I read a blog post that elaborates on this problem clearer than I can. Although it's still true that allowing arbitrary distributional shift is impossible, it does make sense to allow for distributional shift across the natural manifold (manifold of all "allowable" data). The blog post called this "strong generalization", which isn't a term I've heard before.
I'm just an undergraduate doing research in computer vision/deep learning. I've been following a lot of the research in deep learning theory because I think it's really interesting and there's been a lot of activity recently. Maybe I'll be able to do research on pure ML theory one day, but my math ability is definitely not up to par yet haha...
Thanks for the article! I'm still trying to find the time to read it, but it looks very intersting and well written. I hadn't heard the term "strong generalisation" before either, but it's a nice concept that is often overlooked in discussions of machine learning.
To me it's a big problem and I don't think anyone has good answers. I once pressed one of my tutors at university (I did an AI Masters) with a question about it. In the end he kind of put his hands up and said that the final arbiter of your system's performance is your client- if they're happy with your model, than that's that. That certainly does not satisfy me. On the other hand, of course, the success of some machine learning systems "in the wild" is indisputable (say, AlphaGo and AlphaZero etc).
Don't be too worried about the maths. I'm doing an AI PhD and my maths was never very good either (although to be fair it's a GOFAI subject, so it's a different kind of maths). You learn to read the maths in papers and understand what's going on- slowly and painfully at first, but you get there in the end. Besides, if you're good with the code, you can use that to help you get your head around the hardest bits. I still have trouble with calculus, but I once implemented an LSTM network in pure python ish (no frameworks, besides numpy) for one of my classes at uni. I think it even worked properly. My tutor was duly impressed anyway.
Thanks again for the links to the papers I hadn't read and that nice little article- and good luck with your research :)
I think my math is pretty good... for an undergrad at least. Problem is that the barrier for contributing meaningful research in ML theory is higher than in applied ML haha. Anybody can read an applied paper like resnet and get the main idea.
I do agree that I think strong generalization is a problem that definitely needs to be solved.
Basically, we need empirical results to know what kind of theoretical models are useful to study. Otherwise, we fall into the trap where we assumed that non-convex surfaces are basically impossible to analyze. I think he draws a very cogent comparison to statistical physics.
If you have good maths skills you shouldn't be too worried. In any case, most peoples' first contributions -in any field- are rarely entirely their own. Look around for papers where the supervisor is the last author in a paper- like the Hochreiter & Schmidhuber paper you linked above. That's a clear sign of a more experienced researcher lending a hand to a younger researcher. And that's the supervisor-supervisee relation in a nutshell, really.
Is there a separate word from overfitting to refer to generalization performance in a way that explicitly includes distributional shift? Just curious as I never realized that overfitting was commonly constrained in the way you describe and would like to go back and relabel my knowledge to better distinguish between your definition of overfitting and a more general version.
Generalization performance is typically defined in roughly this manner.
Assume your training data consists of a sample S of m elements from your full distribution D. Let L(S) be the average loss for your model (hypothesis is also a common term) on S, and L(D) be the average loss for your model on D.
Now, we say that training generalizes if your model that minimizes L(S) also achieves similarly low L(D).
In practice, we detect generalization by taking a separate sample (validation set) S_2 from D and using L(S_2) to approximate L(D).
I'm not familiar with any literature of generalization to arbitrary distributional shift between training and test. If you allow arbitrary distributional shift (say we decide that all dogs are now called cats and all cats are now called dogs), it's pretty clear that you can get arbitrarily poor "generalization".
Ok — but in many cases there’s more that can be said about the error in the approximation to D of S and S_2 which would allow discussion of various forms of bounded or constrained distributional shift. I don’t think the only choices are “no distributional shift” and “arbitrary distributional shift” are they? What’s the vocabulary to refer to poor generalization performance to datasets with some form of constrained or procedurally introduced distributional shift?
In short, it's when your training data come from different conditions than your real world data.
This means that the probability distribution your model was trained on is different from the one you are using it on. This means that an model that did great in cross-validation could still be useless in practice.
Basically, it's when your distribution that your classes come from change between training and test.
For example, if after training your classifier to distinguish between cats and dogs, all of a sudden society decides that all white haired dogs are now called cats, your classifier is clearly going to get "poor" performance.
As for the last part, when people talk about overfitting we're typically ignoring distributional shift. Otherwise, models can achieve very good generalization.