A linear regression model can be written and trained as a neural net, has a loss function, all of that. Most if not all ML problems can be formulated as modelling a probability distribution
That’s too reductive—ML models are statistical models. Statistical models have parameters, and in general cases, you choose the parameters with some kind of optimization algorithm.
If you play fast and loose with your definition of “ML”, you’ll end up defining it so that any statistical model is an ML model… in which case, why even bother using two different terms?
ML models are, broadly speaking, the more complicated ones with more parameters, where the behavior of the models is not really known without training. I’m sure you could nitpick to death any definition I give, but that’s fine.
I am sure there are people teach data science classes who look at it in that "reductive" way.
From the viewpoint of engineering, scikit-learn provides the same interface to linear regression that it supplies to many other models. Huggingface provides an interface to models that is similar in a lot of ways but I think a 'regression' in that it doesn't provide the bare minimum of model selection facilities needed to reliably make calibrated models. There are many problems where you could use either linear regression or a much more complex model. When it comes to "not known without training" I'm not sure how much of that is the limit of what we know right now and how much is fundamental as in the problem of "we can't really know what a computer program with free loops with do" (Halting problem) or "we can't predict what side of the side Pluto is going to be on in 30 million years" (Deterministic chaos)
(The first industrial model trainer I built was a simultaneously over and under engineered mess like most things in this industry... I didn't appreciate scikit-learn's model selection facilities and even though the data sci's I worked with had a book understanding of them, they didn't really put them to work.)
There’s a pedagogical reason to teach things with a kind of reductive definition. It makes a lot of sense.
I remember getting cornered by somebody in a statistics class and interrogated about whether I thought neural networks were statistical techniques. In that situation I’ll only answer yes, they are statistical techniques. As far as I can tell, a big chunk of what we do with machine learning is create complicated models with a large number of parameters. We’re not creating something other than statistical models. We just draw a kind of cultural line between traditional statistics and machine learning techniques.
Now that I think about it, maybe if you asked me about the line between traditional statistics and machine learning, I would say that in traditional statistics, you can understand the parameters.
> in traditional statistics, you can understand the parameters.
I also think that this is the key differentiator between ML and stats.
Statistical models can be understood formally, which means that not only we know how each parameter affects the predictions, we also know what their estimation uncertainties are, under which assumptions, and how to check that these assumptions are satisfied. Usually, we value these models not only because they're predictive but also because they're interpretable.
In ML there is neither the luxury nor the interest in doing this, all we want is something that predicts as well as possible.
So the difference is not the model itself but what you want to get out of it.
I've heard professors call ML a form of applied statistics, and I think it's fair to call ML a subfield of statistics that deals with automatically generating statistical models with computers.
I'm questioning the association of a single maximum likelihood parameter estimation via analytical optimization as ML, because it does not invole any methods beyond calculus and no models other than the system itself, whose parameters we are estimating.
Perhaps I'm wrong, but the power of NN is in an unknown intermediate representation between the data (measurements in estimation) and the prediction. EKF has no such black box.
Which now that I've written it agrees with top level comment so I rescind my question.