Mathematics of Deep Learning [pdf]

msmith10101 · on Dec 17, 2017

Newtons method and linear algebra. High school math, basically.

aisofteng · on Dec 17, 2017

>For instance, local entropy, the loss that Entropy-SGD minimizes, is the solution of the Hamilton- Jacobi-Bellman PDE and can therefore be written as a stochastic optimal control problem, which penalizes greedy gradient descent. This direction further leads to connections between variants of SGD with good empirical performance and standard methods in convex optimization such as inf- convolutions and proximal methods.

You clearly didn't read the article. What are you commenting on, then? It seems to be about your own understanding, since it's certainly not about the article.

santaclaus · on Dec 17, 2017

> Newtons method

Where are you seeing Newton's method? I didn't think second order information was available for typical systems in statistical machine learning.

make3 · on Dec 17, 2017

my understanding is that the issue is that the full Hessian of the loss is too expensive to compute at each step for the relative size of the increase in learning speed

_0ffh · on Dec 17, 2017

Yeah I think that's why quasi-Newton methods like BFGS have been developed.

leecarraher · on Dec 16, 2017

I'm think this is a work in progress? Seems to be in ieee format, where pages are gold, and one would almost never waste an entire on references(surveys), and leave another mostly blank space. I'll wait for the peer reviewed version.

banachtarski · on Dec 16, 2017

This is a horribly titled paper and I don't even agree with the premise necessarily. Since when did we become convinced deep learning achieves "global optimality" as they put it.

nabla9 · on Dec 16, 2017

When the network is deep or big enough, local minimums tend to have loss close to global optimum. There is theoretical and empirical evidence for this. Global minimum is usually overfitting so what is needed is getting close enough.

In practice algorithms like stochastic gradient descent have problems distinguishing between saddle points and local minima. There is hypothesis that in many/most cases local minima where algorithms get stuck are actually saddle points with long plateau.

viewtransform · on Dec 16, 2017

This mathematical hand waving plagued the field of genetic algorithms and its variants in the 90's until the 'No free lunch theorem' was published from the Santa Fe institute. It essentially said (my take on it) that if you had no information about the landscape you were searching - then couldn't say much about the algorithm you were selling.

https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

nabla9 · on Dec 16, 2017

You are using too general argument.

We are talking about continuous multidimensional optimization. We have already selected our bias and subset of problems we want to solve. Now we are figured out that in this domain there are some theoretical reasons that explain why gradient descent works so well over large number of problems.

viewtransform · on Dec 16, 2017

Agreed. NFL is general and we need to discuss a specific landscape.

Deep learning is continuous <nonlinear> multidimensional optimization - yes.

What defines the subset of problems ? a general nonlinear mapping from R^n to R^m n>m ? or are you limiting to image classification ? speech recognition ? which would be a subset.

We have empirical evidence that deep-learning works but I'm not confident that we have the mathematical tools to understand why.

aoeusnth1 · on Dec 17, 2017

https://arxiv.org/abs/1710.05468 was an interesting paper that came out recently.

It showed that large CNN models which have far greater capacity than the data they are shown (and could have memorized it) still tend to learn very good generalizable minima.

See proposition 1:

(i) For any model class F whose model complexity is large enough to memorize any dataset and which includes f∗ possibly at an arbitrarily sharp minimum, there exists (A, Sm) such that the generalization gap is at most epsilon, and

(ii) For any dataset Sm, there exist arbitrarily unstable and arbitrarily non-robust algorithms A such that the generalization gap of f_A(Sm) is at most epsilon.

cgmg · on Dec 16, 2017

The problem with the NFL theorem is that, in a sense, it doesn't apply to the real world:

https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

viewtransform · on Dec 16, 2017

True. High-dimensional non-linear systems (like deep NNs) are a frontier we are just beginning to explore with empirical success.

However, if I was a VC - my knowledge of the NFL would make me look skeptically at any grandiose claims of miracle learning algorithms. In a sense, it provides me with an upper bound to BS.

throwaway37874 · on Dec 16, 2017

The funnny thing is that there's no proof that SGD converges to the local minima either.

YeGoblynQueenne · on Dec 16, 2017

>> Global minimum is usually overfitting so what is needed is getting close enough.

If the global minimum is overfitting, then what are local minima doing, exactly? Generalising well? I find that very hard to believe.

Also- we're talking about overfitting in the context of a particular training dataset, right? Because we all know that it doesn't matter how well your model validates against your out-of-sample data, its performance is going to fall by 10% to 20% once it hits the real world and has to deal with truly unseen data.

chillee · on Dec 17, 2017

Yes. Specifically, the common argument has become that flat minima generalize well.

As for the last part, when people talk about overfitting we're typically ignoring distributional shift. Otherwise, models can achieve very good generalization.

YeGoblynQueenne · on Dec 17, 2017

I'm skeptical about both statements. I haven't heard the "comnon argument" you mention and I don't see how ignoring differences in distribution between laboratory and real-world data will improve generalisation.

I find the last bit particularly hard to accept, but I'd like to hear why you say so, on both counts.

chillee · on Dec 18, 2017

Sorry, my last sentence > Otherwise, models can achieve very good generalization.

was probably unclear. When we talk about generalization, we typically have to assume a lack of distributional shift. Otherwise, if the "truly unseen data" really is drawn from a different distribution (say one where all cats are now called "dogs" and all dogs are now called "cats"), we can get arbitrarily poor performance.

As for the flat minima part, lemme pull up some papers from my notes.

https://arxiv.org/abs/1609.04836 showed larger batch sizes lead to sharper minima.

> ... and as is well known, sharp minima lead to poorer generalization.

https://arxiv.org/abs/1711.01530 provides a parametrization-invariant notion of "flatness" that counters Dinh et al's point in https://arxiv.org/abs/1703.04933. I thought this paper looks pretty promising, although I haven't gone through the math.

>Although flatness of the loss function has been linked to generalization [Hochreiter and Schmidhuber, 1997], existing definitions of flatness are neither invariant to nodewise re-scalings of ReLU nets nor general coordinate transformations [Dinh et al., 2017] of the parameter space, which calls into question their utility for describing generalization

https://arxiv.org/abs/1703.11008 is also a really promising paper on generalization that basically shows another way to view flatness from a Bayesian perspective (they also show the most impressive generalization bound I've seen yet).

And of course, the classic Schmidhuber paper on Flat Minima: http://www.bioinf.jku.at/publications/older/3304.pdf

I think it's clear from these papers that the "flat minima generalize well" argument is a common avenue of research with respect to generalization. Many of the generalization papers I've read recently are mostly related to different ways of calculating flatness, and pretty much all of them make reference and link their work to the 1st and 4th paper I linked above.

YeGoblynQueenne · on Dec 19, 2017

Thanks for the reply. What do you do, btw? I mean, where does your interest in this subject come from, if I'm not being nosy (again)?

I'll have to have a look at those papers. Schmidhuber and Hochreiter first, obviously- my favourites.

>> When we talk about generalization, we typically have to assume a lack of distributional shift. Otherwise, if the "truly unseen data" really is drawn from a different distribution (say one where all cats are now called "dogs" and all dogs are now called "cats"), we can get arbitrarily poor performance.

Well, the difference in distribution doesn't have to be that radical, to lead to egregious mistakes. For example- I don't know if you caught wind of the recent "deep gaydar paper", where two researchers from Standford trained a deep learning classifier to identify gay men and women from images of their faces?

The data in one experiment consisted of about 12k pictures of users of a US dating site. Shockingly (to me anyway), the number of pictures of gay men was exactly the same as the number of pictures of straight men, and so on for the numbers of pictures of gay and straight women [1]. This is a completely unrealistic, artificial distribution - estimates I've heard about the prevalence of homosexuality in the general population vary from 5% to 10%. The paper itself references a 7% prevalence.

Given such artificial data, I really don't think anyone could train a classifier that would have the same ability to discriminate between categories when exposed to the real world. And we're not talking about "cats" looking like "dogs"- it's a simple binary problem, discriminating between gay/not gay, the same features are (presumably) associated with the same classes, except a classifier trained on balanced classes and in the real world the classes are unbalanced.

Of course I can't really say what would happen if their classifier was trained on data with a distribution matching the real world (or their 7% expectation of gay men), but I do know that, with unbalanced classes, it's perfectly possible to train a classifier with perfect recall and rubbish precision (with a ratio of 10% gay men, you can get 1.0 recall with 0.37 ish precision). So with different training distributions you get different classifiers. I mean, duh.

So that's the kind of thing I'm talking about. And for large enough problems (e.g. the language domain where the true problem is probably infinite) that is a real issue, because even with the best intentions and the best sampling techniques you may never get enough data to train a proper discriminator. And even if your laboratory experiments tell you that your discriminator is generalising well against the data you have, but didn't train it on, you have no way to tell what will happen when it encounters data that even you have yet to see.

_________________

[1] That was for users with at least one picture, which is to say, everyone. Decreasing numbers of users had 2, 3 and so on up to 5 pictures and they were used in different experiments.

chillee · on Dec 20, 2017

I agree that the sort of sampling bias/distributional shift you mention is definitely a real problem in deep learning. It so happens that yesterday I read a blog post that elaborates on this problem clearer than I can. Although it's still true that allowing arbitrary distributional shift is impossible, it does make sense to allow for distributional shift across the natural manifold (manifold of all "allowable" data). The blog post called this "strong generalization", which isn't a term I've heard before.

This is the blog post I'm referring to: http://blog.evjang.com/2017/11/exp-train-gen.html?spref=tw

I'm just an undergraduate doing research in computer vision/deep learning. I've been following a lot of the research in deep learning theory because I think it's really interesting and there's been a lot of activity recently. Maybe I'll be able to do research on pure ML theory one day, but my math ability is definitely not up to par yet haha...

YeGoblynQueenne · on Dec 21, 2017

Thanks for the article! I'm still trying to find the time to read it, but it looks very intersting and well written. I hadn't heard the term "strong generalisation" before either, but it's a nice concept that is often overlooked in discussions of machine learning.

To me it's a big problem and I don't think anyone has good answers. I once pressed one of my tutors at university (I did an AI Masters) with a question about it. In the end he kind of put his hands up and said that the final arbiter of your system's performance is your client- if they're happy with your model, than that's that. That certainly does not satisfy me. On the other hand, of course, the success of some machine learning systems "in the wild" is indisputable (say, AlphaGo and AlphaZero etc).

Don't be too worried about the maths. I'm doing an AI PhD and my maths was never very good either (although to be fair it's a GOFAI subject, so it's a different kind of maths). You learn to read the maths in papers and understand what's going on- slowly and painfully at first, but you get there in the end. Besides, if you're good with the code, you can use that to help you get your head around the hardest bits. I still have trouble with calculus, but I once implemented an LSTM network in pure python ish (no frameworks, besides numpy) for one of my classes at uni. I think it even worked properly. My tutor was duly impressed anyway.

Thanks again for the links to the papers I hadn't read and that nice little article- and good luck with your research :)

chillee · on Dec 21, 2017

I think my math is pretty good... for an undergrad at least. Problem is that the barrier for contributing meaningful research in ML theory is higher than in applied ML haha. Anybody can read an applied paper like resnet and get the main idea.

I do agree that I think strong generalization is a problem that definitely needs to be solved.

As for validation through empirical results, I think this Reddit comment is pretty enlightening on the topic. https://www.reddit.com/r/MachineLearning/comments/7i1uer/n_y...

Basically, we need empirical results to know what kind of theoretical models are useful to study. Otherwise, we fall into the trap where we assumed that non-convex surfaces are basically impossible to analyze. I think he draws a very cogent comparison to statistical physics.

YeGoblynQueenne · on Dec 22, 2017

If you have good maths skills you shouldn't be too worried. In any case, most peoples' first contributions -in any field- are rarely entirely their own. Look around for papers where the supervisor is the last author in a paper- like the Hochreiter & Schmidhuber paper you linked above. That's a clear sign of a more experienced researcher lending a hand to a younger researcher. And that's the supervisor-supervisee relation in a nutshell, really.

More links. Oh dear :)

breatheoften · on Dec 17, 2017

Is there a separate word from overfitting to refer to generalization performance in a way that explicitly includes distributional shift? Just curious as I never realized that overfitting was commonly constrained in the way you describe and would like to go back and relabel my knowledge to better distinguish between your definition of overfitting and a more general version.

chillee · on Dec 18, 2017

Generalization performance is typically defined in roughly this manner.

Assume your training data consists of a sample S of m elements from your full distribution D. Let L(S) be the average loss for your model (hypothesis is also a common term) on S, and L(D) be the average loss for your model on D.

Now, we say that training generalizes if your model that minimizes L(S) also achieves similarly low L(D).

In practice, we detect generalization by taking a separate sample (validation set) S_2 from D and using L(S_2) to approximate L(D).

I'm not familiar with any literature of generalization to arbitrary distributional shift between training and test. If you allow arbitrary distributional shift (say we decide that all dogs are now called cats and all cats are now called dogs), it's pretty clear that you can get arbitrarily poor "generalization".

breatheoften · on Dec 29, 2017

Ok — but in many cases there’s more that can be said about the error in the approximation to D of S and S_2 which would allow discussion of various forms of bounded or constrained distributional shift. I don’t think the only choices are “no distributional shift” and “arbitrary distributional shift” are they? What’s the vocabulary to refer to poor generalization performance to datasets with some form of constrained or procedurally introduced distributional shift?

alok-g · on Dec 17, 2017

I would like to read more about this. Do you have some references? Thanks.

chillee · on Dec 18, 2017

Probably the single best article I saw recently was this: http://www.offconvex.org/2017/12/08/generalization1/

I think that blog post references most of the relevant papers as well, but if not, check the sibling comments I've made.

aoeusnth1 · on Dec 17, 2017

Noob here - what do you mean by distributional shift?

rocqua · on Dec 17, 2017

In short, it's when your training data come from different conditions than your real world data. This means that the probability distribution your model was trained on is different from the one you are using it on. This means that an model that did great in cross-validation could still be useless in practice.

chillee · on Dec 18, 2017

Basically, it's when your distribution that your classes come from change between training and test.

For example, if after training your classifier to distinguish between cats and dogs, all of a sudden society decides that all white haired dogs are now called cats, your classifier is clearly going to get "poor" performance.

gwern · on Dec 16, 2017

> Since when did we become convinced deep learning achieves "global optimality" as they put it.

There are a couple of papers over the past few years arguing/proving under various models that the local optima in deep models either are or are within epsilon of the global optima and this is why SGD reaching only local optima results in models that still work well. I assume these results are covered in OP (haven't read it yet, just answering your question).

catnaroek · on Dec 16, 2017

> There are a couple of papers over the past few years arguing/proving under various models that the local optima in deep models either are or are within epsilon of the global optima

I don't care so much for the “arguing” part, but the “proving” part got me interested. ML methods reliably being able to get epsilon-close to global optima is a BFD. What mathematical technology was used to establish these results? Could you provide links to those papers?

vardhanw · on Dec 17, 2017

https://arxiv.org/abs/1605.08361

https://arxiv.org/abs/1704.08045

nabla9 · on Dec 16, 2017

https://arxiv.org/abs/1412.0233

sjg007 · on Dec 17, 2017

I believe this is in reference to the information bottleneck. This is an idea that there is a limit to the performance of a particular network. From an information theory point of view, a deep neural network can be viewed as compressing the input and subsequently generating a code on the other side. So the network is seen as encoding the data and because of that we can bound it using some tools based in information theory.

throwaway312493 · on Dec 16, 2017

It is not; it actually looks like a very well-written paper, for a review, with references to all the latest developments. Your particular quibble is answered in a section along with references.