Visualizing popular machine learning algorithms

blt · on Oct 24, 2015

For learners it is confusing to see the nonlinear decision boundaries for linear and logistic regression, IMO a note about the feature expansion should be added

wybiral · on Oct 25, 2015

Good point, I've updated my post. For linear and logistic regression there's cubic expansion on the features (which is how they can fit curved problems). The relevant Javascript code is on lines 91 and 96.

PS: It can be changed to "linear" or "quadratic" as well.

edwinksl · on Oct 25, 2015

Yeah, that or label the axes....

sawwit · on Oct 25, 2015

How would labeling the axis help explaining what is going on.

mtw · on Oct 24, 2015

Awesome. Would be great to have execution times.

Also what is nerdy.js? I saw it was related to "Carl Edward Rasmussen" but couldn't find another reference on the net

wybiral · on Oct 25, 2015

It's a Javascript library I put together a long time ago for dealing with datasets and machine learning algorithms. It was used for some of my own personal projects and hasn't been focused on for release in the wild (although I'm considering it now).

The reference to Carl Edward Rasmussen is because I based my minimize function heavily off of this one: http://learning.eng.cam.ac.uk/carl/code/minimize/

mtw · on Oct 25, 2015

I'd be interested in the library :D

treeform · on Nov 7, 2015

Me too!

adriancooney · on Oct 25, 2015

I'm intrigued also. Definitely some sort of Machine Learning related library anyway. Found something related to it but it doesn't really have any substantial information on it either:

http://nerdyjs.appspot.com/

mtw · on Oct 25, 2015

Looks like original author is a David Wybiral http://davywybiral.blogspot.ca/2015/10/visualizing-popular-m...

indubitably · on Oct 24, 2015

Looks like K-nearest neighbor does pretty well.

obmelvin · on Oct 24, 2015

Yes, k-nn is theoretically the one of the best ML algorithms in the sense that it will find the closest items in the training set. For classification or finding similar looking items it is great. However, it has pretty poor running times for evaluation of unseen data (http://nlp.stanford.edu/IR-book/html/htmledition/time-comple...). This is contrary to something like neural networks, which take a while to train, but then evaluate very quickly. For real world use the training times matter to an extent, but in a web app or real time application the latency from knn is just impractical.

jsyedidia · on Oct 25, 2015

That's why we developed the "Boundary Forest" algorithm which is a fast nearest-neighbor type algorithm with generalization at least as good as K-NN, while being able to respond to queries very quickly.

It maintains trees of examples that let it train and respond to test queries in logarithmic time with the number of stored examples, which can be much less than the overall number of training samples. It thus maintains k-NN's property of very fast training time, and is also an online algorithm, and can be used for regression problems as well as classification.

See our paper that was presented at AAAI 2015 here: http://www.disneyresearch.com/publication/the-boundary-fores...

wodenokoto · on Oct 25, 2015

It also suffers from the curse of dimensionality, making it weaker as the number of features increase.

andreasvc · on Oct 25, 2015

I wouldn't call it theoretically the best. It is affected by outliers and doesn't make any generalization at training time. This latter point raises the questions whether it deserves the name learning. I would say linear models are typically a better learning algorithm; I wouldn't know what to call "the best" algorithm, but it might be deep learning nowadays.

jules · on Oct 25, 2015

These visualisations are great but misleading regarding the performance of these classifiers. In practice you don't have a lot of data in a small number of dimensions (2 in this case). You have a little bit of data in zillions of dimensions. Think of classifying a 100x100 pixel image: that's 3x100x100=30000 dimensional data. You may not even have one training sample per class per dimension. Generalizing from comparatively little data to a very high dimensional space is the true difficulty of machine learning. Unfortunately you can't easily visualize that.

darkmighty · on Oct 25, 2015

Try the "Island inside an island" test (put a blue cluster inside an orange island). Only k-means and SVM dealt with it satisfactorily.

wodenokoto · on Oct 25, 2015

Shouldn't a neural net with sufficient unit do that too?

neolefty · on Oct 25, 2015

Yes, if you increase the hidden layer from 5 to 10 nodes:

http://jsfiddle.net/udb95202/

maurits · on Oct 25, 2015

There is also MLDemos [1] which is open source.

[1]: http://mldemos.epfl.ch/

RockyMcNuts · on Oct 24, 2015

I'm a little surprised neural network comes up with a straight line and linear regression doesn't, which I thought by definition it would do. (e.g. on 2 normal groups)

Some discussion of methods, ie how many hidden layers/nodes for the neural network, would probably help make some sense of it.

Random forest could be worth adding.

pedrosorio · on Oct 24, 2015

Looking at the code (http://jsfiddle.net/wybiral/3bdkp5c0/light/) it seems they are expanding the features to include all second and third order terms (options.expansion = cubic), that's why linear regression does not come up with a straight line.

lottin · on Oct 25, 2015

Sounds interesting, but I can't see the results with Firefox 38.2.1.

vinodc · on Oct 25, 2015

Try this one: https://jsfiddle.net/752pqyvp/embedded/result

It's because of the browser blocking mixed content: The JS libraries are being loaded over HTTP but the JSFiddle is over HTTPS.

The version above loads the libraries over HTTPS via cdnjs.com

autoreleasepool · on Oct 25, 2015

I can't see the results either. Chrome 46.0.2490.71 (64-bit)

revorad · on Oct 25, 2015

Can someone please explain this?

wybiral · on Oct 26, 2015

It's using the X and Y location of the dots as training data. Each algorithm is being trained on (x,y)->color in an attempt to buildup a rule for predicting what color an unseen (x,y) pair would be. The hypothesis it builds is then used to color the background so that you can see the decision boundary.

andrelaszlo · on Oct 25, 2015

There's a bug somewhere.

Refresh, choose dataset: curved, algorithm: k means clustering. You get this:

http://imageshack.com/a/img633/7110/sfteaE.png

If you play around and select different algorithms before selecting k means clustering you can get very different results. :)

wybiral · on Oct 25, 2015

I accidentally left k means in there as an option and it doesn't make much sense in the context of this example. So, yeah, it's a bit of a bug. Realistically, linear regression doesn't make sense being included either but it still kinda works.

orliesaurus · on Oct 25, 2015

some content is loaded over HTTP rather than HTTPS so thas why it might display a blank page for some people who have HTTPS forced

throwaway_bob · on Oct 25, 2015

any visualization of these algorithms in 2 dimensions (with cubic feature expansion!) is completely misleading if you intend to work on any real problem with many dimensions. Also, for those asking for execution times, these would be horribly misleading as well.

heinrichhartman · on Oct 25, 2015

+1!

Are you aware of reasonable high dimensional "visualizations". It cant' be accurate of course. But catpuring essential features would be nice.

E.g. here is a 4d cube: https://commons.wikimedia.org/wiki/File:8-cell.gif

chestervonwinch · on Oct 24, 2015

How is there no training time delay? How is training all these classifiers not putting my CPU into a sweat??

edit: I should also mention: these is very cool :)

joshvm · on Oct 25, 2015

The dataset is quite small and you have a fast machine. On my laptop, a 7 year old Core 2, there's a slight delay when running some of the heavier algorithms (e.g. running neural net or svm on the island data set).

p1esk · on Oct 25, 2015

Note: you can click on the graph to add datapoints.

alexanderb · on Oct 25, 2015

No sources on GitHub, but why? Nerdy.js looks like interesting component, but I failed to find any relevant information about it.

0x99 · on Oct 25, 2015

Wow it is awesome. Can we also play around with classifier parameters ( k of kNN) ?

wybiral · on Oct 25, 2015

Only in the code :) Click "Edit in JSFiddle" and look at line 109 in the Javascript section. You'll see: options.k = 5