Personally I'd advise against both SVM's and Bayesian methods for a beginner. Bayesian statistics is very much the deep end of the pool. Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling.
A strong basis in statistics is certainly a great thing, but that can be maximum likelihood plus Bayes law (i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method) and provide the big picture for almost everything.
Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
> Personally I'd advise against both SVM's and Bayesian methods for a beginner. Bayesian statistics is very much the deep end of the pool.
I don’t know, I think it depends on what you mean by Bayesian. I would say understanding loss functions and regularization requires some understanding of Bayesian stats (just knowing that it comes from log p(x|q) + log p(q) and what both of those terms mean).
> Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling
Can you be more specific here? It sounds like you’re talking about a particular problem or class of methods. PGMs/Bayesian methods can mean basically anything from logistic regression to running HMC on some hierarchical model using 10,000 CPU hours. I just mean aspiring to learn PGMs will force you to quickly learn and gain a deeper understanding of and appreciation for Bayesian stats, and then you can build on that for years and years. But it depends on what you’re interested in doing —- there’s a difference between model building and inference; you can spend your whole life using the same loss function and just focus on making your NN architecture better, you don’t need much Bayesian stats to do that.
> i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method
Huh? Maybe we mean different things by Bayesian — the mode of your posterior seems pretty Bayesian to me!
> Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
Would agree that optimization is an important part of ML/DS, but since nowadays virtually all of the most popular optimization algorithms are available at our fingertips in e.g. pytorch, I would still think its better to start with trying to build a fundamental understanding of how to frame problems. But that’s colored by my own experience and background, people’s priorities should be different depending on what they want to do.
To elaborate on the point: When doing probabilistic modeling, whether one realizes or not, there typically is an underlying Bayesian formulation which explains what one is doing. Now, that might be well-aligned with the problem of interest (or not), and being clear on the fundamentals helps understand that, and also to compose distinct ideas which make sense together in the context of the problem. eg: see my comments below, in the context of linear regression from a Bayesian perspective.
Also, while "scaling" with data is a very hip thing these days, for most problems of interest it is very difficult/expensive to get lots of data (or afford compute). Further, humans often have very useful domain-models which are worth encoding into the structure of the model. This also helps nicely mix together a conventional "software" modeling with probabilistic aspects (for those who weren't aware, this flows towards what is called "probabilistic programming", and recent developments have made significant progress towards methods which work for an "intermediate" dimensionality, if not "large" dimensionality).
--
@astrophysician: Feels nice to see expressed so clearly, a perspective that I share! Feel free to get in touch if you'd like to discuss ML.
Additionally, you should probably learn a (little) R (which you can get from this book). This is not because R is a wonderful language (though I'm pretty fond of it myself) but because it's a great tool for the communication and expression of statistical methods.
Most good stats (which will help you be actually good at ML) books tend to either be written in mathematics, or R (or both). Given that you're already a programmer, R will probably make it easier for you to learn a bunch of this stuff (and the docs for R functions tend to point towards useful literature).
I actually travelled the other way (i.e. from stats to code) and I found R very very helpful. Of course, your mileage may vary, but the link above is probably the best single book that you could read to start learning ML.
Thank you! Providence wills that I have R studio installed to make a wordcloud, but I installed it without actually knowing what it is. (Just followed a tutorial to get my wordcloud :)
So thanks for the recommendation, looking forward reading it!
By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.
For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.
I agree it's an awkward definition but it does differentiate between the easy tricks and the tough approaches Bayesian researchers actually work on.
Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere. For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.
As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.
By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.
> By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.
Ah I see yea, if you’re talking about hyperpriors and marginalization and MCMC I totally agree; these are really valuable in science, and this is the “full Bayesian solution” but with many caveats, one big one being that it’s very easy to be overly confident in the results of some unwieldy hierarchical model and ignore the fact that the priors (or hyperpriors) are often times fudge factors because it’s so damn hard to articulate your prior belief as a bona fide distribution. A lot of times you run into a “garbage in garbage out” problem and it might not be obvious right away since we are probably busy patting ourselves on the back and popping champagne bottles because we’ve gotten our MCMC chains to converge.
> For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.
Yea right — nothing wrong with cross validation in a lot of cases. Since yea we could go with the “full Bayesian solution” and marginalize over some distribution of our hyperprior, but that very well might not give you anything more than cross validation (or maybe will give you something worse if you have a bad hyperprior).
> Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere.
Agreed, my only argument is that it’s helpful to know where all of these terms come from in a general sense rather than treat each one as some ad hoc solution.
> For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.
Agree and that’s exactly the sort of valuable insight that’s helpful to have (though maybe this one isn’t really an insight from Bayesian stats).
> As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.
Sure, if you know MAP and what it means, then you know what a posterior means and you know that every model you fit and loss function you use comes from that, and you know if you care about more than a mode or mean of the posterior you can use lots of tools to deal with that. I’ve never really encountered MCMC in the work I do and i don’t think it would really bring any additional insight into what we’re doing, but sometimes we care about uncertainty and Laplace approximation or variational inference can do the job just fine.
> By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.
While I would agree MCMC and it’s variants are certainly not a good idea a lot of the time because they require a lot of time and computation, very rarely they are (you want insights from your data to use internally or something and you really want to explore implications of many different assumptions), and HMC etcetera can deal with very high dimensional settings (> millions of parameters). But generally I agree, if you’re talking about “scalable solutions” it usually means MCMC is not really what you want to be doing.
How does understanding loss functions and regularization require understanding Bayesian statistics? Those notions are literally part of linear regression theory.
Only if you use them as a black box. What is considered l2 norm in regression theory corresponds to a Gaussian likelihood, whose log-likelihood becomes quadratic. And one typically might not have any priors (which is subtly misleading, because the "flat" prior is highly dependent on the chosen representation for the underlying degree of freedom). Note that regularizations are just log-priors.
Without an understanding of the underlying Bayesian formulation, linear regression theory might look like a vast and somewhat ad-hoc collection of separate ideas (eg: robust loss functions, and the many different kinds of regularizations), but seen in the correct light, it is easy to start with a general formulation and specialize it nicely to your problem. Working that way, you can often design a good solution for your problem without searching through handbooks for possible pre-defined methods. You can also combine multiple ideas very easily. eg: A couple of weeks ago, working from first principles I rediscovered what is called "Lavrentyev_regularization": https://en.wikipedia.org/wiki/Tikhonov_regularization#Lavren...
Regularization is used by mathematicians and engineers with a different theoretical perspective. A regular solution is one that is smooth or well-behaved in some sense. The motivation to choose the penalty is based on desired properties of the solution. L2 regularization leads to the least-length solution, L1 is chosen if one wants the sparsest solution, and so on. It is a bit less of a deep motivation, but then the decision of which model to choose is always somewhat ad hoc.
A lot of tools in ML might be used in other places with other interpretations, and the intuition for L1 and L2 that you describe is not wrong at all, but (1) ML/DS is usually done in a statistical context so I would argue that it’s a good idea to understand that formalism, and (2) that intuition doesn’t sound like it would help you build more complex statistical models, whereas understanding where L1/L2 come from in a Bayesian context definitely would help you understand what you would need to do to form a regularization term for e.g. a probability, or how to edit your loss to learn uncertainty. It also helps you understand what not to do and why not.
All of this is opinion for the most part, and if you feel there is more to learn from alternative interpretations, fine, but the suggestion is to understand the fundamentals of what you’re doing, and you’re usually doing statistics in ML/DS (whether you know it or not). Also, understanding Bayesian stats will make your life easier and it will make it easier to understand lots of other ML concepts in a unified way rather than in an ad hoc way: “minimum length solution” or “sparse solution” is what I mean by ad hoc. Both of those things are true and important, but they’re ad hoc.
I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.
I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.
And one can do statistics without being a Bayesian of course.
> I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.
This is the number one misunderstanding when it comes to Bayesian stats. Priors are hard, priors are often bullshit, priors are often the source of a “garbage in garbage out” problem, absolutely. I don’t mean to suggest Bayesian stats as something magical (magical thinking will get you in trouble). But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors. That’s just not true: priors are an unavoidable fact of life. If you’re not explicit about your prior, it means you’re still using one but not being upfront about it. So I would agree that priors are difficult and problematic, but they are also unavoidable, and I would not say they’re “ad hoc”. I would also say it’s important to understand what they are.
> I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.
I don’t know what you mean by “deterministic”...do you mean “frequentist”? If so I would disagree completely. Frequentist and Bayesian views are equivalent except for philosophy, and frequentist stats are taught at all levels of school until grad school (at least in my experience) and I think that’s a huge mistake. What do you mean by “nice to have not a necessity”? If you’re solving a statistical question, stats are a necessity. Other fields are the nice-to-have intuition. I would agree however that sometimes you’re solving a NON-stats problem in which case have at it with whatever field makes sense.
> And one can do statistics without being a Bayesian of course.
Again, fine, I agree you can use the same math with a different philosophy, the philosophy is up to you, but if you think somehow you can do inference without priors I’m sorry but that’s wrong. In my experience “Frequentist” usually has meant Bayesian but with a flat prior (please comment if you have a counter example).
In summary: study what you want, and lots of perspectives bring more understanding, absolutely. But I stand by the importance of understanding Bayesian stats for doing ML. Even if you don’t like Bayesian stats, it’s still important to understand what is going on. Also I should be clear by “Bayesian” I mean nothing more than understanding what posteriors, priors, and likelihoods are, not a hierarchical model with MCMC or something.
> But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors.
I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.
As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".
As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".
As for statistics with out being a Bayesian, I did mean frequentist, though that may not cover everyone. You can even use a prior distribution that is estimated from data (people commonly do that with the naive Bayes method), whatever you want to call such a person. I wouldn't call them a Bayesian. You can simply view it as applying the chain rule of probability to get a more convenient form of your maximum likelihood equation.
> I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.
Oh totally, I agree with that.
> As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".
As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".
Yea I agree with this too, at least in principle. No issue with solving a lot of problems from a non statistical perspective since many times statistics is not the clear “right” choice. E.g. understanding that L1 regularization corresponds to a “Laplace prior” doesn’t give you that much deeper of an understanding of what you’re doing, since most people use L1 regularization to encourage sparsity. Also, if you’re more comfortable with a non-stats perspective on things, no problem approaching problems in the way you prefer.
Summary: I agree with everything you’ve said here. All that’s left is I think a difference of opinion about how important it is to understand the Bayesian perspective and I think that likely comes down to (1) the types of problems you typically work on, and (2) personal preference. I find personally that understanding the Bayesian interpretation is extremely helpful for building a deeper understanding of a wide variety of ML algorithms but I totally concede this is not necessarily a hard truth. So I stand by my advice, but will definitely agree that there are alternatives. I took the route of understanding ML without Bayesian stats first — really didn’t understand or know Bayesian stuff for a decent amount of time after I got into ML. I’ve found the Bayesian perspective has helped tremendously but that’s just me.
I get it, understanding the Bayesian perspective on linear regression requires understanding Bayesian statistics. But linear regression is not solely a Bayesian technique.
You’re right, but in ML/DS (and I would argue, even in other contexts you might not consider ML/DS) you usually are doing things in a Bayesian context (“what can I say about my model parameters given my observations and assumptions”). If you’re only doing OLS with maybe a L1 or L2 regularization, it’s not critical you understand the Bayesian interpretation, but it helps, and as soon as you start venturing out into other ideas and messier/biased data and want to modify something, you’ll need a Bayesian understanding of what you’re doing.
>The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning.
The lore I've heard is that most new deep learning training algorithms (optimization algorithms) only work better on particular special cases, and it is hard to do better than the established algorithms in general.
I'm also not sure why you're saying they're applicable beyond deep learning--how do you plan to train a PGM or SVM using Adam?
I'd more generally describe the area as first order optimization, including methods like acceleration, automatic differentiation, stochastic approaches. Adam is just one trick for determining a hyperparameter.
They are usable everywhere derivative-based optimization is usable. Which certainly means SVM's, though since it's a shallow method you don't need much data to train it, and hence don't need a scalable optimization methods (it would just be unnecessarily slow). But you certainly could do it if you somehow needed to. Here's the first hit on google for "sgd svm': https://scikit-learn.org/stable/modules/generated/sklearn.li...
The fact that you can't use first order optimization methods for graphical models is one answer to the question of why everyone doesn't use them. Though for small models there are deep networks which model them and are trained as per usual for neural networks. I think this is still an active research area.
Nice yea I would agree with the vast majority of this, only thing I would add is that Adam/gradient methods are still useful in a graphical model e.g. to get a MAP estimate (and then you can get a rough posterior estimate using variational methods or Laplace approximation once you find the MAP). But I agree I wasn’t clear about what I mean when I say graphical models since I think most people would understand graphical models to mean a full MCMC sampling of the posterior and marginalization over hyperparameters. I would say it’s useful to understand why people do that and why that is useful, but many times that is (1) overkill and (2) inspires overconfidence in the result because once we marginalize over our prior distribution people tend to forget that our prior may have been a complete fudge. I just mean graphical models as a tool for model building, understanding how different models relate to one another, and as a recipe for deriving a loss function.
Adam, RMSProp, etc are just flavors of gradient descent so they’re useful on anything from ResNet to logistic regression. There are more flavors like natural gradient that are more useful for smaller problems since they require a Hessian matrix, but gradient descent is gradient descent. We use Adam in production for logistic regression, not for any particular reason really, just happens to work.
I'm not the OP, but personally I see NN's as being really really useful where the input data is unstructured (such as text or images). The deep approach (appears to) build better features than a human can, but I'm not convinced that they are _that_ much better (or indeed at all) than standard methods for tabular data.
Once upon a time, when I used to hire data people, I'd ask them to tell me about a recent data project. They'd normally mention some kind of complex model, and I'd ask them how much better it was than linear/logistic regression. A really large proportion of candidates (around 50%) couldn't answer this because they'd never compared their approach to anything simpler.
One person told me that linear regression wasn't in the top 10 Kaggle models, so they would never use it.
Oh so training time is virtually irrelevant to us and if it weren’t we would have to be a lot more careful about optimization methods and possibly which language to use. We also cannot use NN for the models we build (we are restricted to LR, but LR has as much model capacity as you need as long as you include more and more feature interaction terms).
NN’s are universal function approximators. They can have arbitrary model capacity, and you can sort of control that with architecture decisions, loss function/regularization choices, and early stopping, but depending on the problem they can cause more problems than they solve. Usually you don’t really know if your NN will generalize well outside of your train/test distributions, so many times it’s better to have a simpler, more predictable model that you can control the behavior of. This is all from my personal experience and is completely moot when we’re talking about e.g. NLP or vision tasks or situations where you’re drowning in data. NNs are super interesting and powerful, don’t mean to suggest otherwise but the mantra is: “what is the right solution to my problem”. Lots of great advantages to NN’s as well (you can get them to do anything with enough cajoling and they can be solutions to major headaches you would usually have in e.g. kernel methods).
A strong basis in statistics is certainly a great thing, but that can be maximum likelihood plus Bayes law (i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method) and provide the big picture for almost everything.
Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.