While learning Hadoop is a good idea, if you're bored in grad school, you're doi...

achompas · on May 9, 2012

Agreed 100%. Grad school is a time to explore all kinds of CS fields (or various topics within a field you're interested in). Right now I wish I had 28-hour days to mess around with GPU computing, compiler design, language theory, and random machine learning projects.

EDIT: its also worth pointing out that Hadoop is not the 'be all, end all' framework for large-scale data analysis. Many problems do not fit the map-reduce paradigm, and are better suited for other frameworks (looking into GraphLab once I get free time, for example).

srean · on May 9, 2012

Hahaha I hear you. "Me too"s are frowned upon here, but I could not resist. There are so many things that I want to learn that I wish sleep was just optional.

I know Hadoop is the poster child of all things good, but its API really makes me fall asleep. Not a big fan. Add to that the fact that Google's implementation is (or atleast used to be) 4~6 times faster for similar sized processing jobs and more fun to code in (the latter may be entirely subjective). The funny part is that Google's clusters then were made of weaker machines ! I dont know how it is now.

EDIT: It is indeed in C++ that alone cannot fully explain the discrepancy though. Java can be slower but shouldnt be that much slower. I am sure caching, memory footprint and as you said data access latencies and overall design plays a big role. If I remember correctly UIUC has an open source mapreduce framework written in C++ and they claim a similar speedup over hadoop.

tumanian · on May 9, 2012

Hadoop API is a pain. Well, it is fine when doing the vanilla mapreduce, but once one steps away from that, things get ugly real fast with the zoo of Jobs, Tasks, Contexts, two mapred/mapreduce apis which do the same thing, plus hadoop.23 specific calls. I haven't touched Hadoop 1.0 yet, though I doubt they have cleaned all the mess. Hope they will streamline the API by 2.0

throwaway1979 · on May 9, 2012

I haven't used Google's software. Hadoop is in Java. If Google's implementation is in C or C++, that could explain a bit of the performance overhead.

4-6 times is crazy though. You sure it just wasn't due to data access latencies?

scott_s · on May 9, 2012

You're right, but I imagine most grad students will inevitably go through a "doing it wrong" phase. Mine lasted close to a year.