Here's an interesting series of tutorials on intro to data learning / machine learning. I've been able to read through several of them and it is very beginner friendly but still useful.
Good write-up, but IMO these kinds of posts are not useful for most people (despite there being many just like it) - the 30,000 foot view is too much information for a beginner but too basic for someone who is knowledgeable in the field. In my ten years of being fairly immersed in machine learning there was maybe a one month period where an article like this would have been very useful, and that was a couple of years in after I had mastered the basics of the more useful techniques (like generalized linear models and SVMs) and was ready to learn what else was out there. Honestly, learning the high-level overview was not hard at that point.
This table of contents is beginning to seem very familiar (I've done Google's internal machine-learning course and looked at several other courses/tutorials on the net), but what I'm wondering is: how do you do the more basic operations of feature and parameter selection? When I've tried using machine learning in my daily work, I usually top out at "not quite good enough", largely because the features I end up choosing have a lot of noise in their data. Meanwhile, more experienced experts (those who have been doing this for 10+ years) produce much better results in much shorter time periods because they use totally off-the-wall features that I would never have considered. When they've presented at these classes, they've said that feature selection is really the vast majority of the work in practical machine learning. How do you develop an intuition about which features will be useful? Are there any resources that are more like "A practitioner's guide to machine learning", talking about the stuff that's useful in practice and not just in theory?
> This table of contents is beginning to seem very familiar (I've done Google's internal machine-learning course and looked at several other courses/tutorials on the net), but what I'm wondering is: how do you do the more basic operations of feature and parameter selection?
I have thought about this problem for a long time. Initially, I used to think it was a mere matter of finding clever Machine Learning algorithms that could auto-magically detect features for you. There are others who think it is a matter of throwing a bunch of subject matter experts into the problem and building feature vectors. I have worked with both philosophies of people. I think the answer lies somewhere in the middle. And it is a fucking hard problem.
You need to do some clever algorithmic work. Maybe use a PCA or your favorite dimensionality reduction trick but also be clever enough (mathematically that is) to realize when PCA is brittle and when not to use it. You also need to have enough understanding of the subject to be able to ask the right questions from your subject matter experts. It is easy to get five people in a room and ask them to describe a song. But it is much harder to find out the right way of asking them: Is this song the kind of song that you would dance to while drunk on three beers in a podunk town in the Mid West? That requires experience to get that sort of intuition.
The only thing I can recommend (pun unintended) is to keep solving these problems, focus on the domain(s) that you are deeply passionate about. Don't spread yourself too thin and make yourself to be a general purpose Machine Learner. That is the only way to get to building beautiful end to end products.
Experience, as there is no canonical feature set that applies to all problems. Feature sets instead tend to be domain-specific.
The usual iteration loop for a machine-learning system is:
1) Try features.
2) Evaluate your system. If you are happy with your scores/metrics, stop.
3) Do error analysis/run diagnostics. Go to 1.
It is the last step I find inexperienced people usually lacking. You need to examine your errors and find commonalities among them.
For example, is there a certain class of error that hints at a certain feature you haven't tried? Do your existing features provide decent results to begin with? Or maybe you should just get more data? Whatever path you take, it needs to be defensible.
So, in terms of general advice:
1) Realise it's an iteration loop, not a fixed recipe.
2) Be able to justify why you are trying whatever it is you are trying.
With time, you'll develop intuition for given domains.
So it's not just me! The stuff people do in Kaggle competitions is mind blowing. They do models within models within models, do weird interaction terms with their features and transform them with different functions, although most of that is to go from 94% accuracy to 97% accuracy...
Unless you have a massive amount of features already, can't you just try throwing stuff at the wall until something sticks? And even then you could run PCA on your features to reduce dimensionality.
I have thought about this problem, and I think the better approach is something way more domain specific. Start with an application in mind, see which machine learning algorithms they are using and just try to get them working. Once you have something working, notice the strengths of the algorithms for that particular domain. There is usually some nice learning theory to explain that.
Then move on to another domain and repeat. As you continue doing this patterns will emerge across the domains. This is usually the best time to start working through a machine learning textbook.
When you take this approach you defer a chunk of the theory for practical examples, but can immediately start to apply the algorithms and have some results.
There was post on HN of a blog post link which contained a list of all free machine learning/data mining books. Wondering, if someone can post the link to it. I am unable to find it through search.
If you can afford it (both financially and regarding math background), Bishop is a really great choice. Almost everything you need to know is in it. I have it and just love it!
But he goes quite deep in the mathematical explanations (which is a great point, there is no better way to learn and understand) meaning you have to be willing to work on your math for this book.
It depends on your math background, but I've found Yaser S. Abu-Mostafa's Learning From Data the best, although Christopher M. Bishop's Pattern Recognition and Machine Learning is the gold standard. ;-)
I wrote a series of didactic posts actually deriving and implementing some of these algorithms, with long-term plans to get the heftier ones like SVM and boosting.
After learning some new CSS features (ironically, the web 3.0 gimmickry), I believe I have fixed the tablet issue. Can you now view it acceptably well on a tablet?
It seems every week now there is a new guide to Machine Learning coming out. Is it just that I have started noticing them, or is there a real resurgence of popularity for these algorithms?
Why are there so many post about Machine Learning for beginners, all these posts are basically saying the same things (I'm part of the problem as I did it as well some time ago)
My explanation is that it's easy to write about that and feel good after.
there are so many machine learning algorithm out there, it is hard to decide where to start, I know it isn't realistic to learn one by one, no one have that much energy, any suggestions for who love this field, and don't know much about it? very appreciate.
Main website: http://guidetodatamining.com/chapter-1/
The actual pdfs http://guidetodatamining.com/guide/ch1/DataMining-ch1.pdf
http://guidetodatamining.com/guide/ch2/DataMining-ch2.pdf
http://guidetodatamining.com/guide/ch3/DataMining-ch3.pdf
http://guidetodatamining.com/guide/ch4/DataMining-ch4.pdf
http://guidetodatamining.com/guide/ch5/DataMining-ch5.pdf
http://guidetodatamining.com/guide/ch6/DataMining-ch6.pdf
http://guidetodatamining.com/guide/ch7/DataMining-ch7.pdf
(edit: formatting)