A Tour of Machine Learning Algorithms

billmalarky · on May 22, 2014

Here's an interesting series of tutorials on intro to data learning / machine learning. I've been able to read through several of them and it is very beginner friendly but still useful.

Main website: http://guidetodatamining.com/chapter-1/

The actual pdfs http://guidetodatamining.com/guide/ch1/DataMining-ch1.pdf

http://guidetodatamining.com/guide/ch2/DataMining-ch2.pdf

http://guidetodatamining.com/guide/ch3/DataMining-ch3.pdf

http://guidetodatamining.com/guide/ch4/DataMining-ch4.pdf

http://guidetodatamining.com/guide/ch5/DataMining-ch5.pdf

http://guidetodatamining.com/guide/ch6/DataMining-ch6.pdf

http://guidetodatamining.com/guide/ch7/DataMining-ch7.pdf

(edit: formatting)

rm999 · on May 22, 2014

Good write-up, but IMO these kinds of posts are not useful for most people (despite there being many just like it) - the 30,000 foot view is too much information for a beginner but too basic for someone who is knowledgeable in the field. In my ten years of being fairly immersed in machine learning there was maybe a one month period where an article like this would have been very useful, and that was a couple of years in after I had mastered the basics of the more useful techniques (like generalized linear models and SVMs) and was ready to learn what else was out there. Honestly, learning the high-level overview was not hard at that point.

I think a solid machine learning book coupled with a chart like this (http://scikit-learn.org/stable/tutorial/machine_learning_map...) is the most useful way to approach machine learning.

nostrademons · on May 22, 2014

This table of contents is beginning to seem very familiar (I've done Google's internal machine-learning course and looked at several other courses/tutorials on the net), but what I'm wondering is: how do you do the more basic operations of feature and parameter selection? When I've tried using machine learning in my daily work, I usually top out at "not quite good enough", largely because the features I end up choosing have a lot of noise in their data. Meanwhile, more experienced experts (those who have been doing this for 10+ years) produce much better results in much shorter time periods because they use totally off-the-wall features that I would never have considered. When they've presented at these classes, they've said that feature selection is really the vast majority of the work in practical machine learning. How do you develop an intuition about which features will be useful? Are there any resources that are more like "A practitioner's guide to machine learning", talking about the stuff that's useful in practice and not just in theory?

eshvk · on May 22, 2014

> This table of contents is beginning to seem very familiar (I've done Google's internal machine-learning course and looked at several other courses/tutorials on the net), but what I'm wondering is: how do you do the more basic operations of feature and parameter selection?

I have thought about this problem for a long time. Initially, I used to think it was a mere matter of finding clever Machine Learning algorithms that could auto-magically detect features for you. There are others who think it is a matter of throwing a bunch of subject matter experts into the problem and building feature vectors. I have worked with both philosophies of people. I think the answer lies somewhere in the middle. And it is a fucking hard problem.

You need to do some clever algorithmic work. Maybe use a PCA or your favorite dimensionality reduction trick but also be clever enough (mathematically that is) to realize when PCA is brittle and when not to use it. You also need to have enough understanding of the subject to be able to ask the right questions from your subject matter experts. It is easy to get five people in a room and ask them to describe a song. But it is much harder to find out the right way of asking them: Is this song the kind of song that you would dance to while drunk on three beers in a podunk town in the Mid West? That requires experience to get that sort of intuition.

The only thing I can recommend (pun unintended) is to keep solving these problems, focus on the domain(s) that you are deeply passionate about. Don't spread yourself too thin and make yourself to be a general purpose Machine Learner. That is the only way to get to building beautiful end to end products.

gms · on May 22, 2014

Experience, as there is no canonical feature set that applies to all problems. Feature sets instead tend to be domain-specific.

The usual iteration loop for a machine-learning system is:

1) Try features.

2) Evaluate your system. If you are happy with your scores/metrics, stop.

3) Do error analysis/run diagnostics. Go to 1.

It is the last step I find inexperienced people usually lacking. You need to examine your errors and find commonalities among them.

For example, is there a certain class of error that hints at a certain feature you haven't tried? Do your existing features provide decent results to begin with? Or maybe you should just get more data? Whatever path you take, it needs to be defensible.

So, in terms of general advice:

1) Realise it's an iteration loop, not a fixed recipe.

2) Be able to justify why you are trying whatever it is you are trying.

With time, you'll develop intuition for given domains.

jimbokun · on May 23, 2014

"2) Be able to justify why you are trying whatever it is you are trying."

This is just a restatement, I think, of the question you are responding to. I suspect a book on this topic would be quite popular.

elliott34 · on May 23, 2014

So it's not just me! The stuff people do in Kaggle competitions is mind blowing. They do models within models within models, do weird interaction terms with their features and transform them with different functions, although most of that is to go from 94% accuracy to 97% accuracy...

nanidin · on May 22, 2014

Unless you have a massive amount of features already, can't you just try throwing stuff at the wall until something sticks? And even then you could run PCA on your features to reduce dimensionality.

cf · on May 23, 2014

I have thought about this problem, and I think the better approach is something way more domain specific. Start with an application in mind, see which machine learning algorithms they are using and just try to get them working. Once you have something working, notice the strengths of the algorithms for that particular domain. There is usually some nice learning theory to explain that.

Then move on to another domain and repeat. As you continue doing this patterns will emerge across the domains. This is usually the best time to start working through a machine learning textbook.

When you take this approach you defer a chunk of the theory for practical examples, but can immediately start to apply the algorithms and have some results.

catilac · on May 22, 2014

Which ML book would you recommend?

maurits · on May 22, 2014

The most popular choices seem to be:

Machine Learning: a Probabilistic Perspective, by Murphy

http://www.cs.ubc.ca/~murphyk/MLbook/

Pattern classification, by Duda et all

http://www.amazon.com/Pattern-Classification-Pt-1-Richard-Du...

The Elements of Statistical Learning, by Hastie et all. It is free from Stanford.

http://www-stat.stanford.edu/~tibs/ElemStatLearn

Mining of Massive Datasets, free from Stanford.

http://infolab.stanford.edu/~ullman/mmds.html

Bayesian Reasoning and Machine Learning, by Barber, free available online.

http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=...

Learning from data, by Abu-Mostafa.

It comes with Caltech video lectures: http://work.caltech.edu/telecourse.html

Pattern Recognition and Machine Learning, by Bischop

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

Also noteworthy

Information Theory, Inference, and Learning Algorithms, by Mackay, free.

http://www.inference.phy.cam.ac.uk/itprnn/book.html

Classification, Parameter Estimation and State Estimation, by van der Heijden.

http://prtools.org

Computer Vision: Models, Learning, and Inference, by Prince, available for free

http://www.computervisionmodels.com/

Probabilistic Graphical Models, by Koller. Has an accompanying course on Coursera.

simplekoala · on May 23, 2014

There was post on HN of a blog post link which contained a list of all free machine learning/data mining books. Wondering, if someone can post the link to it. I am unable to find it through search.

craigching · on May 23, 2014

This one?

http://christonard.com/12-free-data-mining-books/

gtani · on May 23, 2014

or this which includes good backgrounders on lin.alg, probability and stat,

http://www.reddit.com/r/MachineLearning/comments/1jeawf/mach...

Or http://www.electronicsforu.com/newelectronics/articles/hitsc...

this is an excellent review (but doesn't cover books by Mohri, Rostamizadeh, Talwalkar and Abu-Mostafa , Magdon-Ismail, Lin: http://www.amazon.com/review/R32N9EIEOMIPQU/ref=cm_cr_pr_per...

simplekoala · on May 27, 2014

yes, yes. Thanks!

syvlo · on May 23, 2014

If you can afford it (both financially and regarding math background), Bishop is a really great choice. Almost everything you need to know is in it. I have it and just love it!

But he goes quite deep in the mathematical explanations (which is a great point, there is no better way to learn and understand) meaning you have to be willing to work on your math for this book.

maurycy · on May 22, 2014

It depends on your math background, but I've found Yaser S. Abu-Mostafa's Learning From Data the best, although Christopher M. Bishop's Pattern Recognition and Machine Learning is the gold standard. ;-)

j2kun · on May 22, 2014

I wrote a series of didactic posts actually deriving and implementing some of these algorithms, with long-term plans to get the heftier ones like SVM and boosting.

http://jeremykun.com/main-content/

gone35 · on May 23, 2014

For an earlier, non-stultifying version of the site with an actually usable, tablet-scrollable wordpress theme, see:

https://web.archive.org/web/20121114212050/http://jeremykun....

Great content, otherwise. But I swear all this 'web 3.0' gimmickry has set the web back at least a decade.

j2kun · on May 23, 2014

TIL that "stultify" is a word. And that my theme is bad for tablets.

j2kun · on May 25, 2014

After learning some new CSS features (ironically, the web 3.0 gimmickry), I believe I have fixed the tablet issue. Can you now view it acceptably well on a tablet?

The problem was my poor CSS practices.

lpolovets · on May 22, 2014

Wow, this looks fantastic!

ssw1n · on May 22, 2014

Of course ! He is THE Jeremy Kun ... :)

ColinWright · on May 22, 2014

Some are saying they're getting a 500 response "Internal server error." It's working for me, but just in case, here's the Google cache:

http://webcache.googleusercontent.com/search?q=cache:wt-Qc3D...

Edit: fixed link

willis77 · on May 22, 2014

This is a tour like being handed a list of animals is a tour of a zoo.

weavie · on May 22, 2014

It seems every week now there is a new guide to Machine Learning coming out. Is it just that I have started noticing them, or is there a real resurgence of popularity for these algorithms?

genofon · on May 22, 2014

Why are there so many post about Machine Learning for beginners, all these posts are basically saying the same things (I'm part of the problem as I did it as well some time ago)

My explanation is that it's easy to write about that and feel good after.

cjauvin · on May 22, 2014

I wasn't aware of Association Rule Learning, which seems like a very interesting and useful concept.

weishigoname · on May 26, 2014

there are so many machine learning algorithm out there, it is hard to decide where to start, I know it isn't realistic to learn one by one, no one have that much energy, any suggestions for who love this field, and don't know much about it? very appreciate.

waterloong · on May 22, 2014

Nothing for recommender systems?

Kip9000 · on May 22, 2014

broken

ColinWright · on May 22, 2014

In what sense "broken" - it loads for me and is extremely useful as a catalog and description.

Can you be more specific?

cruise02 · on May 22, 2014

I'm getting a 500 response "Internal server error."

riffraff · on May 22, 2014

same here, 500 and "This site is hosted by HostGator!"

angersock · on May 22, 2014

Snappy strikes again!