Map Reduce: A really simple introduction

nadam · on Sept 30, 2010

"If homicide, suicide or resigining the job is not an option, how would you solve it?"

I would create an n-ary tree (n is arbitrary, maybe 100) from coworkers. The 'leafs' would produce an array of ten numbers based on the blogs assigned to them. Then each node in the tree would take n arrays and sum them up into one array, until me, the root have the final array of ten numbers.

First: this is much simpler than what is described in the blog. Second: the 'grouper' in his example is totally a bottleneck. (And in fact the reducers are also bottlenecks in his case. While my system scales practically to infinity.)

jchrisa · on Sept 30, 2010

Yours is much more like CouchDB's MapReduce

gaius · on Sept 30, 2010

Before I clicked this, I knew it would be some variant on word count. Same as every other "mapreduce tutorial" written by someone who doesn't actually use it for real work...

iskander · on Sept 30, 2010

When I was using MapReduce for "real work" it was mostly to count n-grams. I don't think I was the only one...

jbooth · on Sept 30, 2010

Big counting problems naturally make a lot of sense for MR, join the like keys in the reduce and add them up, it's beautifully parallel the whole way through. So they're the canonical example.

More complex logic happens of course - but problems that "fit" for MR tend to follow a similar pattern to counting.

duhprey · on Oct 1, 2010

It's too bad because it could take one more step and do something slightly cooler.

I just do counting in MapReduce, but it was to use Expectation-Maximization on mixtures of Gaussians to cluster a data set of 16 million documents. Although computing the Guassian's isn't really trivial, all you do at the end is sum up and normalize.

It's very cool and isn't much more than counting (replace the map step). In fact the paper I took it from (and the whole Mahout on Hadoup thing) is a bunch of machine learning algorithms that are just big summations at the reduce step.

DanielBMarkham · on Sept 30, 2010

I liked this, and sent it to some friends.

As an aside, it would be cool to have several "stories" like this for simple map-reduce. I've found that sometimes it takes multiple attempts for the concept to get across.

ulf · on Sept 30, 2010

Isnt the grouper unnecessary in this example? Seems to me the mappers could divide their text into 10 sheets, since they already are determining the key which later determines the sheet.

jmaygarden · on Sept 30, 2010

That's exactly what I was thinking. It's map-reduce, not map-group-reduce.

sixtypoundhound · on Oct 1, 2010

Being a multi-thread newbie (my wife says I have a one track mind), I actually enjoyed the article. However, it's pretty clear that there this is only one member of a larger class of algorithms related to multi-processing.

Is anyone aware of a good primer on the subject?

Generally speaking, it sounds like a lot of the divide and conquer CS algorithms could potentially use this approach, assuming other constraints (IO and network volume) aren't an issue...

ergo98 · on Sept 30, 2010

Map - Split a problem up and distribute to workers

Reduce - Gather and aggregate/collate the results from the workers

That's it. That's the big thing to understand. The fact that everyone keeps looking for some new way to describe MapReduce I think adds to the mystique and the belief that it is somehow profoundly complex.

Aside from the many typos, I don't see how this is useful. MapReduce is an absurdly simple concept. Once you litter any explanation with analogies or scenarios, you distract the comprehension of the reader.

Am I to think about parsing text on a blog? Bad CEOs? Google and programs? What?

city41 · on Sept 30, 2010

I think the article is just fine. People learn things differently. For some, a "real world" example can help drive things home.

ergo98 · on Sept 30, 2010

I suppose, but there are better ways people can spend their time. This article is rife with misspellings and typos, which essentially shows a disrespect for readers. Secondly it's 1300+ words split into 6 "chapters" to say, poorly, what can be said in two sentences. I would not recommend it, especially on a site like HN (which purports to lean towards a higher standard).

In fact as a more general rule I would say that when people try to explain something, it is lazy, almost always confusing -- and usually an indication of the writer's own ignorance -- when said explanation resorts to contrived scenarios or analogies.

It's like if the UPS shipper had to deliver an elephant and twelve penguins, one of which suffered from gastroenteritis. By driving the truck using the hybrid energy recovery system, just how much conflict-gas that was fueled by the death of thousands of soldiers would a Catholic priest yield?

raphaelb · on Sept 30, 2010

This makes sense to me than the article did and I had no prior understanding of it, though some people might also like real world examples.

I feel like seeing the essential (as you've done) and then finding examples that are specific implementations is how I learn best.

nadam · on Sept 30, 2010

Not only is the algorithm described in the article complex for the simple task, but it is also incorrect. It has a single bottleneck: the grouper. It is interesting that I've been downmodded for providing a simpler but correct algorithm. (I don't know too much about MapReduce (almost nothing), but I like designing distributed algorithms for fun.)

jbooth · on Sept 30, 2010

He did a decent job of explaining how the different parts of the process relate to each other when it comes to the real world situation of parallelizing MR. The shuffle (grouper) actually happens per reducer in many implementations using hash-bucketed outputs from the mappers, but that's an optimization.