>The training algorithms for deep learning are also the hottest algorithm resear...

unishark · on May 1, 2020

I'd more generally describe the area as first order optimization, including methods like acceleration, automatic differentiation, stochastic approaches. Adam is just one trick for determining a hyperparameter.

They are usable everywhere derivative-based optimization is usable. Which certainly means SVM's, though since it's a shallow method you don't need much data to train it, and hence don't need a scalable optimization methods (it would just be unnecessarily slow). But you certainly could do it if you somehow needed to. Here's the first hit on google for "sgd svm': https://scikit-learn.org/stable/modules/generated/sklearn.li...

The fact that you can't use first order optimization methods for graphical models is one answer to the question of why everyone doesn't use them. Though for small models there are deep networks which model them and are trained as per usual for neural networks. I think this is still an active research area.

astrophysician · on May 3, 2020

Nice yea I would agree with the vast majority of this, only thing I would add is that Adam/gradient methods are still useful in a graphical model e.g. to get a MAP estimate (and then you can get a rough posterior estimate using variational methods or Laplace approximation once you find the MAP). But I agree I wasn’t clear about what I mean when I say graphical models since I think most people would understand graphical models to mean a full MCMC sampling of the posterior and marginalization over hyperparameters. I would say it’s useful to understand why people do that and why that is useful, but many times that is (1) overkill and (2) inspires overconfidence in the result because once we marginalize over our prior distribution people tend to forget that our prior may have been a complete fudge. I just mean graphical models as a tool for model building, understanding how different models relate to one another, and as a recipe for deriving a loss function.

rangerranvir · on May 1, 2020

Thanks for sharing, looks promising.

astrophysician · on May 1, 2020

Adam, RMSProp, etc are just flavors of gradient descent so they’re useful on anything from ResNet to logistic regression. There are more flavors like natural gradient that are more useful for smaller problems since they require a Hessian matrix, but gradient descent is gradient descent. We use Adam in production for logistic regression, not for any particular reason really, just happens to work.

rangerranvir · on May 1, 2020

Random question, do you guys have tried linear regression in place of Adam and was there a significant change in the training time?

And on what data you decide to move on to NN based model rather than using a simpler models like linear regression etc?

disgruntledphd2 · on May 1, 2020

I'm not the OP, but personally I see NN's as being really really useful where the input data is unstructured (such as text or images). The deep approach (appears to) build better features than a human can, but I'm not convinced that they are _that_ much better (or indeed at all) than standard methods for tabular data.

Once upon a time, when I used to hire data people, I'd ask them to tell me about a recent data project. They'd normally mention some kind of complex model, and I'd ask them how much better it was than linear/logistic regression. A really large proportion of candidates (around 50%) couldn't answer this because they'd never compared their approach to anything simpler.

One person told me that linear regression wasn't in the top 10 Kaggle models, so they would never use it.

astrophysician · on May 1, 2020

Oh so training time is virtually irrelevant to us and if it weren’t we would have to be a lot more careful about optimization methods and possibly which language to use. We also cannot use NN for the models we build (we are restricted to LR, but LR has as much model capacity as you need as long as you include more and more feature interaction terms).

NN’s are universal function approximators. They can have arbitrary model capacity, and you can sort of control that with architecture decisions, loss function/regularization choices, and early stopping, but depending on the problem they can cause more problems than they solve. Usually you don’t really know if your NN will generalize well outside of your train/test distributions, so many times it’s better to have a simpler, more predictable model that you can control the behavior of. This is all from my personal experience and is completely moot when we’re talking about e.g. NLP or vision tasks or situations where you’re drowning in data. NNs are super interesting and powerful, don’t mean to suggest otherwise but the mantra is: “what is the right solution to my problem”. Lots of great advantages to NN’s as well (you can get them to do anything with enough cajoling and they can be solutions to major headaches you would usually have in e.g. kernel methods).