One thing that I discovered recently which surprised me (while taking the Udacity SDC)is how effective and resilient these "older" ML algorithms can be. Neural networks was always my go to method for most of my classification or regression problems for my small side projects. But now I learned with the minimal dataset I have (<5K samples), linear regression, SVM, or decision tress is the way to go. I got higher accuracy and it's about 10X faster in terms of computational time!
Yes SVMs are still great models. The advantage neural nets have over them are that they can do automatic feature extraction. By the time you get to the last layer of a neural net, you are basically just doing a simple logistic classification, but the features coming in have been learned from all of the previous layers.
I've even seen people use pretrained ImageNet classifiers, chop off the last layer and use an SVM as the actual classifier, and it works very well for some problems.
Sort of. They can readily do really basic feature engineering along the lines of doing nonlinear transformations of the input. With a bit more doing, they can do spatial feature engineering (e.g., convolutional nets), and with a bit more foresight and planning they can learn the kinds of complex "hidden Markov process" style features you typically use in natural language processing.
But, as far as I'm aware, anyway, they can't necessarily do a great job with things like irregular time series (which is a huge chunk of big data), so you're still stuck doing some of that basic feature engineering. And I hesitate to say that some of the fancier architectures like LSTMs can be characterized as a turnkey solution for feature engineering, considering how much thought and effort and pre-existing knowledge and theory about what the engineered features should look like in the first place needed to go into designing them. So I feel like the "they can learn their own features" thing is a bit overhyped.
On the other hand SVM doesn't scale as well as neural networks do because it has computational complexity between O(n^2) and O(n^3) [1] where n is the number of samples in the training set. So if you plan to add more data later you may eventually encounter scaling problems with SVM.
We found great success with the LIBLINEAR SVM implementation [1] though: Extremely good performance, to the point that it affects scalability too, with predictive performance acceptably close to libSVM with the RBF kernel, for a large cheminformatics dataset:
As can be seen in fig 5 [2] in the paper, a dataset size that took ~1 week with libSVM (actually, the parallel piSVM implementation) on 64 cores, took less than a minute with LIBLINEAR, which runs on just one core.
With an RBF Kernel, you are stuck solving the dual problem which has O(n^2 * m) complexity. With the linear kernel, you can solve the primal scales with O(n*m) complexity.
Scikit uses numpy which uses BLAS which can be implemented with nvBLAS[1]. I don't know what it takes to get this setup and how much performance boost it gives for SVMs though.
I'm curious where the idea that SVM are "older" than neural networks comes from. The SVM wikipedia page claims that they were published by Vapnik & Chervonenkis in 1963, while Neural Networks date back at least to Rosenblatt's work in 1958, if not before.
I think that SVM's were seen in the late 1990s as replacements for three layer networks. This was because the kernel trick allowed the creation of high dimensional decision surfaces over large (for those days) training sets by optimisation. Because of the restrictions of computing power and data collection in those days the idea of very large neural networks was under explored, and most people believed that a very broad network was required to capture detailed learned classifiers and that it was impractical to train such classifiers. The idea of deep networks was not widely considered because it was thought that these would be infeasible to train, and they seemed (to me at least) to be until we found out about stochastic gradient descent, initialization, transfer learning, distributed computing and GPU's. So, SVM's became very fashionable and many people said that they were basically the end state of supervised machine learning. This made people look more at unsupervised learning, apart from some people in Canada and Scotland (and various others too!). Now people think SVM's are old because the old people that they know used to do things with SVM's. Neural networks are new because now you can do things with them that are quite unexpected.
Your history is a little mixed up. LeNet-5 had 7 layers, for example (late 1990s).
Regardless, neural nets fell out of favor because they were seen as overcomplicated and even though they achieved competitive accuracy (yes, even when they were out of favor they were still competitive in performance to other methods), you could get similar performance with simpler methods like SVMs.
Now, the history lesson, all this stuff feels fairly new.
It feels like it's younger than you are.
Here's the history of it.
Vapnik immigrated from the Soviet Union to the United States in about 1991.
Nobody ever heard of this stuff before he immigrated.
He actually had done this work on the basic support vector idea in his Ph.D. thesis at Moscow University in the early '60s.
But it wasn't possible for him to do anything with it, because they didn't have any computers they could try anything out with.
So he spent the next 25 years at some oncology institute in the Soviet Union doing applications.
Somebody from Bell Labs discovers him, invites him over to the United States where, subsequently, he decides to immigrate.
In 1992, or thereabouts, Vapnik submits three papers to NIPS, the Neural Information Processing Systems journal.
All of them were rejected.
He's still sore about it, but it's motivating.
So around 1992, 1993, Bell Labs was interested in hand-written character recognition and in neural nets.
Vapnik thinks that neural nets-- what would be a good word to use?
I can think of the vernacular, but he thinks that they're not very good.
So he bets a colleague a good dinner that support vector machines will eventually do better at handwriting recognition then neural nets.
And it's a dinner bet, right?
It's not that big of deal.
But as Napoleon said, it's amazing what a soldier will do for a bit of ribbon.
So that makes colleague, who's working on this problem with handwritten recognition, decides to try a support vector machine with a kernel, in which n equals 2, just slightly nonlinear, works like a charm.
Was this the first time anybody tried a kernel?
Vapnik actually had the idea in his thesis but never though it was very important.
As soon as it was shown to work in the early '90s on the problem handwriting recognition, Vapnik resuscitated the idea of the kernel, began to develop it, and became an essential part of the whole approach of using support vector machines.
So the main point about this is that it was 30 years in between the concept and anybody ever hearing about it.
It was 30 years between Vapnik's understanding of kernels and his appreciation of their importance.
And that's the way things often go, great ideas followed by long periods of nothing happening, followed by an epiphanous moment when the original idea seemed to have great power with just a little bit of a twist.
And then, the world never looks back.
And Vapnik, who nobody ever heard of until the early '90s, becomes famous for something that everybody knows about today who does machine learning.
Rosenblatt's perceptron has little to do with neural nets. Geoffrey Hinton regrets coining the name "multi-layer perceptron" precisely because they're really unrelated.
Hm, yeah I suppose you could argue it's not really a neural network unless you involve the idea of hidden layers. And according to wikipedia backpropagation wasn't a thing until 1975 so you might have a point.
"It's not really a neural network" is an understatement. Perceptrons are about as related to neural nets as linear regression is (note that you can "train" linear regression with stochastic gradient descent).
If you have good features there is little advantage to a complex model.
In production ML there are still many applications for random forests, linear models or svms. Though I prefer random forests because they require less preprocessing, are super fast to train, and can be easy to explain feature importances.
>But now I learned with the minimal dataset I have (<5K samples), linear regression, SVM, or decision tress is the way to go.
Decision Trees are prone to overfitting and especially susceptible for small datasets. Random Forest is
a good substitute that's become standard practice.