Hacker News new | past | comments | ask | show | jobs | submit login
Gradient Descent Models Are Kernel Machines (infoproc.blogspot.com)
161 points by dilap on Feb 8, 2021 | hide | past | favorite | 50 comments



The paper discussed showed on reddit a few months back [1]. Another paper showed up shortly after claiming the exact opposite [2]. Some discussion of contradiction this can be found in this part of the thread: [3].

I myself am very interested in [2]. It's fairly dense, but I've been meaning to go through it and the larger Tensor Programs framework ever since.

[1] https://www.reddit.com/r/MachineLearning/comments/k7wj5s/r_e...

[2] https://www.reddit.com/r/MachineLearning/comments/k8h01q/r_w...

[3] https://www.reddit.com/r/MachineLearning/comments/k8h01q/r_w...


Thank you! Number 2 was exactly what I was looking for and I just couldn't find the link.


If you're surprised by this result because you've used kernel machines and didn't find them very good at generalization, keep in mind that this assumes a kernel function that accurately reflects similarity of input samples. Most work with kernel machines just uses Euclidean distance. For instance, in an image recognition model it would have to identify 2 images with dogs as more similar than an image with a dog and an image with a cat.

With a sufficiently magical kernel function, indeed you can get great results with a kernel machine. But it's not so easy to write a kernel function for a domain like image processing, where shifts, scales, and small rotations shouldn't affect similarity much. Let alone for text processing, where it should recognize 2 sentences with similar meaning as similar.


This may not have been your intention but I think you're underselling the utility of this result a bit. Yes you need a good kernel function to get good results but this paper literally provides a way to create such a kernel.

With a way to turn a gradient descent into a roughly equivalent kernel machine you can turn gradient descent into an embedding, so rather than merely seemingly learning features you can recover the actual feature space.

It's also interesting how the result heavily depends on the path taking during the gradient descent rather than merely the end result. This is somewhat counterintuitive when for the equivalent gradient descent only the end result is seemingly important.


> It's also interesting how the result heavily depends on the path taking during the gradient descent rather than merely the end result. This is somewhat counterintuitive when for the equivalent gradient descent only the end result is seemingly important.

I don’t understand. What do you mean here?

I only (quickly) skimmed the article but it says that the limit is equal to a (not the) zero of your gradient.

The result should be the same for every trajectory leading to the same zero. Put another way, it doesn’t depend on the path if it leads to the same zero.

I would say here that not all zeros are equal though. :)


> It's also interesting how the result heavily depends on the path taking during the gradient descent rather than merely the end result.

After reading just the linked article, not the original paper, my interpretation is that the kernel isn't calculated during the gradient descent process, but from the final model after training by gradient descent. If that's right, it's not path-dependent.


No, the kernel is a mixing of gradients from the training. The proof depends completely on the particular form of optimization, in this case gradient descent in the limit of infinitesimal "timestepping", and suggests that gradient descent optimization doesn't cleverly pick good internal representations of the data.


> suggests that gradient descent optimization doesn't cleverly pick good internal representations of the data.

How so? The existence of equivalent kernels doesn't imply they aren't clever or good.


The final kernel is an integral taken over the path of the parameters during the gradient descent.

Basically what they're doing is just y = y0 + \int dy/dt dt and then rephrasing this equation as a kernel machine. Which I guess should also work for other forms of learning where you slowly change the parameters.


Well, we have the sufficiently magical kernel function from the gradient descent process -- seems like we could save a lot of computation by extracting out that function and running it as an SVM rather than holding it independent.


Interesting idea! Is there any research in this direction?


>With a sufficiently magical kernel function, indeed you can get great results with a kernel machine. But it's not so easy to write a kernel function for a domain like image processing, where shifts, scales, and small rotations shouldn't affect similarity much. Let alone for text processing, where it should recognize 2 sentences with similar meaning as similar.

I think the key issue at hand is that gradient descent is easier to train than a model using kernel functions. Someone could absolutely devise a mechanism for back propagation of errors with kernel functions, but at that point, it is basically a neural network.


Wouldn't that (similarity matters) be very useful information to encode in the NN though? Kinda like how you can do non-convolutional deep-learning on images, but it takes exponentially more training data than convolutional deep-learning does.


> If you're surprised by this result because you've used kernel machines and didn't find them very good at generalization,

This statement is a bit strange, because most of the theoretical work around NTK assumes that the tangent kernel doesn't change sufficiently during training. This seemed to make this closer to "extreme machines" than deep learning.

There was a paper from Francis Bach's group which seemed to be saying as much, but that criticism was no match for the meme-fication of this line of research by over-enthusiastic amateurs.


I may or may not be surprised by the result, I'm definitely not surprised by the 'Thing A is thing B' in machine learning.

Every tom, machine-learner, and harry, is an expert on proving to the whole world that thing-A is thing-B. The only problem is people don't hire them with a million dollars a year total compensation.


But the converse is not true in this case, so still interesting.


is an expert == is an "expert"


The form derived in Pedro's paper is not a standard kernel machine - the coefficients depend on the example, and even the form of the Kernel machine is obtained rather unnaturally by pushing one kernel term into the coefficient and keeping the other outside of it.


Discussion of the actual paper here:

https://news.ycombinator.com/item?id=25314830

It's a really neat one for people who care about what's going on under the hood, but not immediately applicable to the more applied folks. I saw some good quotes at the time to the tune of "I can't wait to see the papers citing this one in a year or two."


I'm not in this subsection of the industry, so take this with a heaping grain of salt. I'm just using my own biases here.

But I get the sense that this is one of a number of big pieces of the puzzle of how current cutting edge models can be cut down in operational size. I have a feeling that new methods of creating models will stick around, but any deep learning model can be reduced to a less resource intensive approximation through findings such as this. And if (IF!) so a lot of AI hype is going to hit a real post-hype cycle plateau of productivity because all this stuff will become easier and cost effective to deploy to the real world.


I think it's clear that there's a strong relationship between all of these methods. Papers proving that Neural Networks are GPs came out a year or two ago, for example. Deep learning is more useful in the end because kernel machines and gaussian processes (classically) store all samples in a dataset and require expensive pairwise comparison. Deep learning/gradient descent doesn't.


I don’t want to seem like minimising (if I may say) all these works and similar ones. This is not surprising at all and they’re not new results actually. Cf. Donsker-like theorems.


> Gradient Descent Models Are Kernel Machines

... that also happen to actually work.


Is this now considered correct/acceptable terminology? Back in my day optimization techniques weren’t the same thing as models.

You can get point estimates for many completely different models using gradient descent. The article clarifies what is meant in the first line, but it still seems like a very poor choice of words.

Now get off my lawn.


There's some super-interesting recent empirical work from MS research that makes this perspective less likely to be a useful one. In short, the behavior of neural networks in a few regimes makes this interpretation of them less likely, specifically: ensembling the same network trained multiple times separately improves results compared to training multiple networks concurrently, knowledge distillation works, and self-distillation works. The blog post[1] has a good intuitive explanation of the matter and a link to the paper.

[1] https://www.microsoft.com/en-us/research/blog/three-mysterie...


To add on: Zeyuan Allen-Zhu and Yuanzhi Li had a great theoretical paper earlier last year proving a fundamental discrepancy between the concept classes learnable by kernel methods and deep learning models.

https://arxiv.org/abs/2001.04413


Yannic Kilcher's paper explained series covers this pretty paper well. I feel like I have a decent understanding of it just after watching the video with a few pauses/rewinds.

https://www.youtube.com/watch?v=ahRPdiCop3E


* for a very specific, nonstandard definition of a kernel machine...


Quoting an expert from the last time this was posted:

So at the end, it rephrased a statement from "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" [https://arxiv.org/abs/1806.07572], besides in a way which is kind of miss-leading.

The assertion is known by the community at least since 2018, if not even well before.

I find this article and the buzz around a little awkward.


somedays I feel that, in the final analysis, everything is just linear regression.


As someone who has studied nonlinear nonconvex optimization, I don't think linear regression is the final word here. In the universe of optimization problems for curve-fitting, the linear case is only one case (albeit a very useful one).

In many cases though, it is often insufficient. The next step is piecewise linearity and then convexity. It is said that convexity is a way to weaken a linearity requirement while leaving a problem tractable.

Many real world systems are nonlinear (think physics models), and often nonconvex. You can approximate them using locally linear functions to be sure, but you lose a lot of fidelity in the process. Sometimes this is ok, sometimes this is not, so it depends on the final application.

It happens that linear regression is good enough for a lot of stuff out there, but there are many places where it doesn't work.


> Many real world systems are nonlinear (think physics models)

Technically, only if you don't zoom in too far, quantum mechanics is linear.


This is a little too simplistic/misleading

- yes, QM is linear in terms of states, vector spaces and operators acting on states (its just functional analysis [1])

- it’s not linear in terms of how probability distributions and states behave under the action of forces and multi-body interactions (see, eg correlated phenomena like phase transitions, magnetism, superconductivity etc)

[1] https://en.m.wikipedia.org/wiki/Functional_analysis


Or rather, that all statistical methods are dressed up regression.


That's not really correct either. Regression is but a small (and useful) part of statistics -- it's the part of statistics that most engineers use, but statistics goes beyond regression. For instance, sampling theory is not regression. Hypothesis testing is not regression. Causal inference is not regression.

Probability undergirds a lot of statistical methods.


Yes there are non-linear functions...but often these result from combinations of linear functions.


This is not true. Nonlinear functions generally do not result from combinations of linear functions. It's the other way around -- combinations of linear basis functions are often used to approximate nonlinear functions, often in a spline sort of way.

However, this only works (practically) for mildly nonlinear functions. If you have exp, log, sin, cos functions for instance and you're modeling a range where the function is very nonlinear, the approximation errors often become problematic unless you use a really dense cluster of knots. For certain nonlinear systems of equations, this can blow up the size of your problem. It's a tradeoff between number of linear basis functions you use for the approximation, and the error you are willing to accept.



hey, hey!, we can over-fit higher order approximations too. The nerve on some of ya.

(Absolutely just kidding around here)


FC neural nets are iterated linear regression, in some sense.


given the properly transformed space that may very well be true...


> This result makes it very clear that without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples.

Is this another way of saying that neural networks are just another statistical estimation method and not a path to general artificial intelligence? Or is it saying that problems like self-driving cars are not suitable to the current state of the art for AI since we have to ensure that reality doesn't deviate from training examples? Or both?

I'd love to understand the "real life" implications of this finding better.


Neither. It's more akin to the fundamental theorem of calculus. If you follow your data from point A to point B along some differential path, you can sum up/integrate the steps to follow your data in a single jump from point A to point B directly (with the jump in terms of those integrals). It's a super interesting viewpoint on gradient descent and models that use it that could be really useful in looking at and understanding those models abstractly, but isn't saying anything about suitability for different tasks.


When this was discussed about two months ago the conclusion I took away was not very many at the moment beyond somewhat formal equivalence.

https://news.ycombinator.com/item?id=25314830


Is it not possible to achieve AGI with a good statistical estimation method?


Depends if you think "a human unlikely to perform well on an example which deviates strongly (as defined by experience) from all training examples." is true.


A human can learn from a single training sample. I can show my 2 year old a cartoon picture of an unfamiliar animal- say, an anteater- and then show him a page of drawings by a different artist in a different style, of several animals, in various poses, and he can immediately point to the anteater. Think about that: a single example, crudely drawn. You and I could do that too. Human beings are masters of learning from training sets of n=1.

I’m not sure why humans are so good at that. I feel like it might be that by spending the first couple years of life picking things up and rotating them and putting them in our mouths, we have some deeper understanding that images of an animal like an anteater are not really two dimensional but are supposed to represent volume and form. That’s my guess. A neural net trained to recognize 2d images will not have that.


I don't think that's a fair comparison. It's not 1 sample. It's 1 sample + existing training in segmentation + existing model for expected animal build + existing model for 2d/3d extrapolation from the drawing + pre-trained capacity for reinforcement learning, etc.

If you start slightly before 2yo, you may find that after they see a cow they will call every large animal a cow for weeks/months - that's closer to one sample without all the models.


It’s not a fair comparison, of course, but you know, I don’t think the public cares as much about whether it’s fair as they care about how well it works. I guess what I’m saying is that to match human performance you’d need all that prior understanding that you just listed. Maybe that’s what’s missing from our models.


I think he is saying that a competition where both sides are allowed to bring a lot of preconceptions into the learning task is an interesting case where people currently beat the machines by a lot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: