Map Reduce for the People

kurtosis · on April 9, 2009

In my experience the biggest barrier to dealing with huge data volumes isn't the cost of supercomputer time. - it's the time it takes to

(a) curate and prepare the data

(b) develop a model to analyze it, adapt the model to whatever platform I'm working with e.g. hadoop, MPI

(c) get the results into a presentable form.

Compared to the hours of labor involved in these three activities supercomputer time isn't really that expensive. It can be leased if you know where to look.

dmix · on April 9, 2009

> if you know where to look.

Wasn't that the point of the article? Supercomputing is now accessible to "the people" so part of the problem is close to being solved.

kurtosis · on April 9, 2009

Yeah, no doubt cheaper cycles are a good thing. So help me out here - do you know of supercomputing (MPI or hadoop) apps where the bottleneck was finding time on a computer. I think development cost is still the bottleneck

In bio and geo anyone with a grant can get access with little difficulty. I guess people with grants covers most of the people here.

rjurney · on April 9, 2009

Could you help me out here - which is easier to get started developing with - MPI or hadoop?

Its not just about cheap processor time, its about ease of use. Hadoop is a platform that does much much of the hard part for you, giving you an enormous head start to solving problems in parallel because all you have to really think about is the algorithm you're executing. Thinking in mapreduce is still hard, but its a whole lot easier than thinking in MPI, and it takes a lot less time to get real computations happening on your data.

The cloud computing time could be MORE expensive than supercomputer time. That doesn't matter. The point is that this stuff is now accessible to a much wider group than MPI was, and they're learning the skills to crunch gobs of data.

Its pretty clear that you're a 'smart kid' and that both MPI and mapreduce are pretty easy for you. Thats not the case for most people. You're talking about hard scientists. I'm talking about normal developers.

kurtosis · on April 9, 2009

you have a good point, but for what it's worth here's the practical example that is the basis of my opinion on this. I had a very large corpus of text and I wanted to construct a word association graph. This involves multiplying together lots of sparse 10^5 x 10^9 matrices that are too large to fit in single memory. This is out of core. Getting together 6 computers and setting up hadoop and HDFS took me about 2 days starting from nothing.

Figuring out how to do all of these out of core sparse matrix manipulations honestly took me about a week of tinkering before I had anything worth even trying. I'm certainly not an expert at this stuff so maybe a better coder could have done it a lot faster but that's my problem.

What would be a real advance in bringing massive scale data analysis to "the people" would be something like hadoop R or MATLAB that makes doing things like this completely automagic.

rjurney · on April 9, 2009

Regarding the real advance: That stuff is the very 'revolution' thats coming that I talk about :) Visicalc kindly sucked, but Excel is pretty good. Its still very early, but the way forward has become clear.

bcl · on April 10, 2009

I think this is the leading edge of the next revolution. MapReduce, Hadoop, Cloud CPU cycles are the building blocks but we still need the tools that make it easier for scientists to operate on their datasets without having to be a hard-core coder. I wouldn't be surprised to see Matlab or some other offering from Wolfram take on that role.

rjurney · on April 10, 2009

The examples were from science because they struck me as the most powerful, but I think this will go well beyond science to many problems. It started with search, after all.