Mining of Massive Datasets

amitkgupta84 · on Nov 3, 2014

I'm taking this course right now, I'm a little ambivalent about it. They cover various machine learning algorithms, which one can learn anywhere, but they also talk about how to deal with these things in a massive data context. The pragmatic tools needed to wrangle large amounts of data so that you can apply your usual ML algorithms to it is very nice to see.

That said, I don't feel like I'm learning concepts. So far, the techniques have felt like: break up the data into chunks this way, apply a bunch of hash functions that way, this is what ended up working for this particular problem. I guess if you work in the field, the tools you're exposed to will inspire things in your own work, and you'll feel more like you're building a general framework.

The homeworks are terrible. There are no mandatory programming assignments, and the one optional one does nothing to gradually work up to applying the stuff they teach you, it's just, here's a massive zip that won't fit on your hard drive, here's an uninteresting computational question to answer about it, go for it.

The remaining (basic) homeworks are quizzes and they're incredibly tedious. (There are advanced homeworks as well, but it hasn't been that inspiring). One of the recent homeworks was really just a rehash of some high school linear algebra, another one involved doing some computations with a bunch of different points. The points weren't provided in a list, they were drawn onto a jpeg so you had to manually copy all of them down. That's the kind of course it's been.

It's a very light weight course, which is nice if you're working. If your basic math skills are good and you already have some familiarity with ML and distributed computing, 5 hrs/wk is enough to watch the videos (at 2x speed plus occasionally hitting the 10-second-fast-forward) and do the basic homeworks.

mathattack · on Nov 3, 2014

Forgive my ignorance, but are you taking this as a MOOC? If so, what were you expecting?

amitkgupta84 · on Nov 3, 2014

A good course. There are MOOCs that are good courses. This one is so-so. I don't understand your question.

mathattack · on Nov 4, 2014

I guess I consider it harsh to judge a book based on the MOOC that uses it.

amitkgupta84 · on Nov 5, 2014

That would be harsh. Who's doing that?

koffiekop · on Nov 3, 2014

We are currently using this book for a course at the Leiden institute for advanced computer science. It's pretty up to date.

It covers LSH, cosine similarity, Jaccard similarity as well as recommender systems applicable to the Netflix challenge and so forth.

aquance · on Nov 3, 2014

I am a student highly interested in data mining, do you think that book would be a good start? What prerequisites do you think it needs?

koffiekop · on Nov 3, 2014

It's a good book and the entry-level is not that high. However, you probably want to have some kind of basis in maths(algebra and stuff). Also you want to know some of the datamining terminology. But, it's free and open, so check it out. Also; the slides are very helpful.

polskibus · on Nov 3, 2014

There's a related course on Coursera done by the book authors. https://class.coursera.org/mmds-001/lecture

amelius · on Nov 3, 2014

I guess the focus of this book is on computing rather than on the underlying math (statistics). Is the math of this book still up to date? I.e., are these the methods that are still used in practice?

incunix · on Nov 3, 2014

Pretty good resource but not sure where the large-scale part is other than Chapter 12

thecopy · on Nov 3, 2014

Chapter 2 & 3 goes through LSH and map-reduce which is used for large data sets, where comparing all-with-all is impossible. Chapter 4 goes through streams where you take one item at a time and fit your model to that (so instead of optimizing a (for example) SVM with the whole data-set your stream it one after another. Chapter 9 also includes "online" recommendation algorithms and 11 is dimension reduction.

Sidenote: A nice way to reduce data-set size for clustering is by constructing coresets from the original dataset [0], it is possible to create a coreset in parallell using map-reduce. After this k-means will produce a very good approximation

[0] http://las.ethz.ch/files/feldman11scalable.pdf

krat0sprakhar · on Nov 3, 2014

I'm currently pursuing this course on Coursera. We are currently only halfway through the course but I think I can share a few thoughts on the course.

Pros

- Faculty: Like most MOOCs, MMDS is taught by one of the best faculty from the field. I've been an avid follower of Anand Rajaraman's blog [0] before I joined this course and I have to say the enthusiasm of the faculty is infectious and their expertise with the material is markedly evident.

- Difficulty: MMDS is a CS graduate level course (CS246) from Stanford. That means the topics are not trivial, the lectures are dense and you as a student are expected to invest significant time into understanding the material. Since this is hard, grasping the concepts and getting the quiz right is quite gratifying. Few lectures from every week are tagged as advanced and students who view and answer all advanced questions get a certificate of distinction [ not quite relevant but might provide the necessary incentive / motivation to a few students]

- Material: The syllabus and the topics covered in this blog are extremely relevant for any one aspiring to work in the data mining / machine learning field. Having done Andrew Ng's ML course, this course acts a perfect supplement and covers a lot of practical aspects of implementing the algorithms when applied to massive data sets. For example, a recent lecture talked about how the BFR algorithm[1] for finding clusters works better than k-means for a very large dataset.

- Book: The accompanying MMDS book is just awesome and the lectures build upon the content and examples from it. For someone who finds the book a bit too challenging (probably because your math is a bit rusty) the lectures make the material quite approachable.

Cons

- Theoretical: The course is primarily theoretical in both its presentation and exercises. This is not to say that algorithms are presented without examples, but that the examples are trivial and do not illustrate the issues with implementing or applying various algorithms in real-life datasets.

- Programming Assignments: In sharp contrast to Andrew Ng's course, there are no compulsory programming assignments. The exercises are all quizzes which check how well you have understood the concepts. There is just one programming assignment which is also optional.

Overall, I'm really liking the course. The professors emphasize citing industry examples wherever necessary (the PageRank algorithm and accompanying Google's implementation was covered for 3 lectures), which is a welcome change from other CS courses. Along with the book, I believe the course is a wonderful primer to the field of Data Mining.

[0] - http://anand.typepad.com/datawocky/

[1] - http://www.dmi.unict.it/~apulvirenti/agd/BFR98.pdf

garyrob · on Nov 3, 2014

Thanks for your review. One question. You say: "Difficulty: MMDS is a CS graduate level course (CS246) from Stanford."

But the post itself says: "The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. "

Any thoughts about the discrepancy?

krat0sprakhar · on Nov 3, 2014

In my opinion, MOOCs tend to underplay pre-requisites. The content in the initial classes lead to a bit of furore amongst the students on exactly the same topic. Evidently, a lot of students found the notation and mathematics (e.g. Algebra, matrices, eigenvalues, calculus) very hard to understand.

In response to this, one of the faculty - Jeff Ullman categorically stated in the forums that this course is taught to graduates in Stanford (CS 2xx is grad level) and an undergraduate course in mathematics is pre-requisite. Although most of the mathematics covered in the course is covered in a typical undergrad class, IMO the overall content is quite advanced and graduate class worthy.

That being said, the forums (and the book) are quite helpful, and given you put in enough time, you will sail through.

Hope this answers your question!

radiowave · on Nov 3, 2014

Agreed. I'm taking the course at the moment, and enjoying it quite a bit, but for me it is rather too dense - I could do with it having been spread over twice as many weeks, so I probably won't complete it in time, but that's no big deal.