For learners it is confusing to see the nonlinear decision boundaries for linear and logistic regression, IMO a note about the feature expansion should be added
Good point, I've updated my post. For linear and logistic regression there's cubic expansion on the features (which is how they can fit curved problems). The relevant Javascript code is on lines 91 and 96.
PS: It can be changed to "linear" or "quadratic" as well.
It's a Javascript library I put together a long time ago for dealing with datasets and machine learning algorithms. It was used for some of my own personal projects and hasn't been focused on for release in the wild (although I'm considering it now).
I'm intrigued also. Definitely some sort of Machine Learning related library anyway. Found something related to it but it doesn't really have any substantial information on it either:
Yes, k-nn is theoretically the one of the best ML algorithms in the sense that it will find the closest items in the training set. For classification or finding similar looking items it is great. However, it has pretty poor running times for evaluation of unseen data (http://nlp.stanford.edu/IR-book/html/htmledition/time-comple...). This is contrary to something like neural networks, which take a while to train, but then evaluate very quickly. For real world use the training times matter to an extent, but in a web app or real time application the latency from knn is just impractical.
That's why we developed the "Boundary Forest" algorithm which is a fast nearest-neighbor type algorithm with generalization at least as good as K-NN, while being able to respond to queries very quickly.
It maintains trees of examples that let it train and respond to test queries in logarithmic time with the number of stored examples, which can be much less than the overall number of training samples. It thus maintains k-NN's property of very fast training time, and is also an online algorithm, and can be used for regression problems as well as classification.
I wouldn't call it theoretically the best. It is affected by outliers and doesn't make any generalization at training time. This latter point raises the questions whether it deserves the name learning. I would say linear models are typically a better learning algorithm; I wouldn't know what to call "the best" algorithm, but it might be deep learning nowadays.
These visualisations are great but misleading regarding the performance of these classifiers. In practice you don't have a lot of data in a small number of dimensions (2 in this case). You have a little bit of data in zillions of dimensions. Think of classifying a 100x100 pixel image: that's 3x100x100=30000 dimensional data. You may not even have one training sample per class per dimension. Generalizing from comparatively little data to a very high dimensional space is the true difficulty of machine learning. Unfortunately you can't easily visualize that.
I'm a little surprised neural network comes up with a straight line and linear regression doesn't, which I thought by definition it would do. (e.g. on 2 normal groups)
Some discussion of methods, ie how many hidden layers/nodes for the neural network, would probably help make some sense of it.
Looking at the code (http://jsfiddle.net/wybiral/3bdkp5c0/light/) it seems they are expanding the features to include all second and third order terms (options.expansion = cubic), that's why linear regression does not come up with a straight line.
It's using the X and Y location of the dots as training data. Each algorithm is being trained on (x,y)->color in an attempt to buildup a rule for predicting what color an unseen (x,y) pair would be. The hypothesis it builds is then used to color the background so that you can see the decision boundary.
I accidentally left k means in there as an option and it doesn't make much sense in the context of this example. So, yeah, it's a bit of a bug. Realistically, linear regression doesn't make sense being included either but it still kinda works.
any visualization of these algorithms in 2 dimensions (with cubic feature expansion!) is completely misleading if you intend to work on any real problem with many dimensions. Also, for those asking for execution times, these would be horribly misleading as well.
The dataset is quite small and you have a fast machine. On my laptop, a 7 year old Core 2, there's a slight delay when running some of the heavier algorithms (e.g. running neural net or svm on the island data set).