One thing that I discovered recently which surprised me (while taking the Udacit...

akyu · on March 23, 2017

Yes SVMs are still great models. The advantage neural nets have over them are that they can do automatic feature extraction. By the time you get to the last layer of a neural net, you are basically just doing a simple logistic classification, but the features coming in have been learned from all of the previous layers.

I've even seen people use pretrained ImageNet classifiers, chop off the last layer and use an SVM as the actual classifier, and it works very well for some problems.

bunderbunder · on March 24, 2017

> they can do automatic feature extraction

Sort of. They can readily do really basic feature engineering along the lines of doing nonlinear transformations of the input. With a bit more doing, they can do spatial feature engineering (e.g., convolutional nets), and with a bit more foresight and planning they can learn the kinds of complex "hidden Markov process" style features you typically use in natural language processing.

But, as far as I'm aware, anyway, they can't necessarily do a great job with things like irregular time series (which is a huge chunk of big data), so you're still stuck doing some of that basic feature engineering. And I hesitate to say that some of the fancier architectures like LSTMs can be characterized as a turnkey solution for feature engineering, considering how much thought and effort and pre-existing knowledge and theory about what the engineered features should look like in the first place needed to go into designing them. So I feel like the "they can learn their own features" thing is a bit overhyped.

Eridrus · on March 25, 2017

I've had great success with LSTMs for time series classification.

platz · on March 23, 2017

> automatic feature extraction

Hope you have a *ton of data, otherwise it's not gonna happen

trendia · on March 23, 2017

And a lot of tweaking of configuration parameters until it's "automatic".

akyu · on March 24, 2017

This was true maybe 10 years ago. Not so much anymore.

akyu · on March 24, 2017

No, it will still happen. The learned features will just be "overfit" to that small dataset.

inlineint · on March 23, 2017

On the other hand SVM doesn't scale as well as neural networks do because it has computational complexity between O(n^2) and O(n^3) [1] where n is the number of samples in the training set. So if you plan to add more data later you may eventually encounter scaling problems with SVM.

[1] http://scikit-learn.org/stable/modules/svm.html#complexity

samuell · on March 23, 2017

We found great success with the LIBLINEAR SVM implementation [1] though: Extremely good performance, to the point that it affects scalability too, with predictive performance acceptably close to libSVM with the RBF kernel, for a large cheminformatics dataset:

Paper (open acccess): http://dx.doi.org/10.1186/s13321-016-0151-5

As can be seen in fig 5 [2] in the paper, a dataset size that took ~1 week with libSVM (actually, the parallel piSVM implementation) on 64 cores, took less than a minute with LIBLINEAR, which runs on just one core.

[1] https://www.csie.ntu.edu.tw/~cjlin/liblinear

[2] http://jcheminf.springeropen.com/articles/10.1186/s13321-016...

Fede_V · on March 23, 2017

With an RBF Kernel, you are stuck solving the dual problem which has O(n^2 * m) complexity. With the linear kernel, you can solve the primal scales with O(n*m) complexity.

syntaxing · on March 23, 2017

Good to know, I did not know that! I kind of wish scikit had some sort of CUDA capabilities to speed things up.

nl · on March 23, 2017

Scikit uses numpy which uses BLAS which can be implemented with nvBLAS[1]. I don't know what it takes to get this setup and how much performance boost it gives for SVMs though.

[1] https://developer.nvidia.com/cublas

radarsat1 · on March 23, 2017

I'm curious where the idea that SVM are "older" than neural networks comes from. The SVM wikipedia page claims that they were published by Vapnik & Chervonenkis in 1963, while Neural Networks date back at least to Rosenblatt's work in 1958, if not before.

sgt101 · on March 23, 2017

I think that SVM's were seen in the late 1990s as replacements for three layer networks. This was because the kernel trick allowed the creation of high dimensional decision surfaces over large (for those days) training sets by optimisation. Because of the restrictions of computing power and data collection in those days the idea of very large neural networks was under explored, and most people believed that a very broad network was required to capture detailed learned classifiers and that it was impractical to train such classifiers. The idea of deep networks was not widely considered because it was thought that these would be infeasible to train, and they seemed (to me at least) to be until we found out about stochastic gradient descent, initialization, transfer learning, distributed computing and GPU's. So, SVM's became very fashionable and many people said that they were basically the end state of supervised machine learning. This made people look more at unsupervised learning, apart from some people in Canada and Scotland (and various others too!). Now people think SVM's are old because the old people that they know used to do things with SVM's. Neural networks are new because now you can do things with them that are quite unexpected.

argonaut · on March 23, 2017

Your history is a little mixed up. LeNet-5 had 7 layers, for example (late 1990s).

Regardless, neural nets fell out of favor because they were seen as overcomplicated and even though they achieved competitive accuracy (yes, even when they were out of favor they were still competitive in performance to other methods), you could get similar performance with simpler methods like SVMs.

sgt101 · on March 23, 2017

Ok, let's say mid 90's then.

utoku · on March 24, 2017

Here's Vapnik's story, I first forgot where it was from, then I remembered:

https://ocw.mit.edu/courses/electrical-engineering-and-compu...

quote:

Now, the history lesson, all this stuff feels fairly new.

It feels like it's younger than you are.

Here's the history of it.

Vapnik immigrated from the Soviet Union to the United States in about 1991.

Nobody ever heard of this stuff before he immigrated.

He actually had done this work on the basic support vector idea in his Ph.D. thesis at Moscow University in the early '60s.

But it wasn't possible for him to do anything with it, because they didn't have any computers they could try anything out with.

So he spent the next 25 years at some oncology institute in the Soviet Union doing applications.

Somebody from Bell Labs discovers him, invites him over to the United States where, subsequently, he decides to immigrate.

In 1992, or thereabouts, Vapnik submits three papers to NIPS, the Neural Information Processing Systems journal.

All of them were rejected.

He's still sore about it, but it's motivating.

So around 1992, 1993, Bell Labs was interested in hand-written character recognition and in neural nets.

Vapnik thinks that neural nets-- what would be a good word to use?

I can think of the vernacular, but he thinks that they're not very good.

So he bets a colleague a good dinner that support vector machines will eventually do better at handwriting recognition then neural nets.

And it's a dinner bet, right?

It's not that big of deal.

But as Napoleon said, it's amazing what a soldier will do for a bit of ribbon.

So that makes colleague, who's working on this problem with handwritten recognition, decides to try a support vector machine with a kernel, in which n equals 2, just slightly nonlinear, works like a charm.

Was this the first time anybody tried a kernel?

Vapnik actually had the idea in his thesis but never though it was very important.

As soon as it was shown to work in the early '90s on the problem handwriting recognition, Vapnik resuscitated the idea of the kernel, began to develop it, and became an essential part of the whole approach of using support vector machines.

So the main point about this is that it was 30 years in between the concept and anybody ever hearing about it.

It was 30 years between Vapnik's understanding of kernels and his appreciation of their importance.

And that's the way things often go, great ideas followed by long periods of nothing happening, followed by an epiphanous moment when the original idea seemed to have great power with just a little bit of a twist.

And then, the world never looks back.

And Vapnik, who nobody ever heard of until the early '90s, becomes famous for something that everybody knows about today who does machine learning.

argonaut · on March 23, 2017

Rosenblatt's perceptron has little to do with neural nets. Geoffrey Hinton regrets coining the name "multi-layer perceptron" precisely because they're really unrelated.

radarsat1 · on March 23, 2017

Hm, yeah I suppose you could argue it's not really a neural network unless you involve the idea of hidden layers. And according to wikipedia backpropagation wasn't a thing until 1975 so you might have a point.

argonaut · on March 24, 2017

"It's not really a neural network" is an understatement. Perceptrons are about as related to neural nets as linear regression is (note that you can "train" linear regression with stochastic gradient descent).

emcq · on March 23, 2017

If you have good features there is little advantage to a complex model.

In production ML there are still many applications for random forests, linear models or svms. Though I prefer random forests because they require less preprocessing, are super fast to train, and can be easy to explain feature importances.

Scea91 · on March 23, 2017

In addition, random forests often work very well out-of-the-box with 'default' hyperparameter settings.

nl · on March 23, 2017

Boosted trees will nearly always beat neural nets for structured data. Maybe if you have multiple billion rows, but even then..

Large numbers of features are where it gets challenging.

hdespiritu · on March 24, 2017

>But now I learned with the minimal dataset I have (<5K samples), linear regression, SVM, or decision tress is the way to go.

Decision Trees are prone to overfitting and especially susceptible for small datasets. Random Forest is a good substitute that's become standard practice.

partycoder · on March 24, 2017

If you start adding dimensions then neural networks perform better.