Hacker News new | past | comments | ask | show | jobs | submit login
Mapreduce Bash Script (last.fm)
47 points by danw on April 6, 2009 | hide | past | favorite | 5 comments



I guess that Hadoop should rethink it's "user interface" to the programmer if a bash script can be more handy sometimes. A lot of great code lacks so much taste in the interface that I simply refuse to use it. I want not say it's the case of Hadoop but the "interface with the programmer", and "it should be simple to do simple things" is somewhat not in the culture of many project leaders.


One thing I've always wondered about MapReduce...is what it's used for? I mean what kind of data do you put in, and what do you aim to get out?


You put lists, and you take lists :p

Imagine a key-value store/database. Each key is a word, the value is a list of keys from webpages, these keys are the webpage contents.

Get every value for the word "hacker", get every value for the word "news", intersect these values (distributing the computation, or DTC), get the webpages for this intersection. Now you have webpages that contain the term "hacker news".

Key -> Value (word, webpage ids)

hacker -> page_1,page_255,page_600,page_5041

news -> page_5,page_600,page_1001,page_5041

(so, intersect == page_600,page_5041)

Key -> Value (webpage ids, contents)

page_600 -> "hacker news new threads comments leaders"

page_5041 -> "where I can find news for hackers"

Now let's sort these webpages. Take the relevancy algorithm, apply to your list of webpages (DTC), so now you have another list. Now take the list of urls that the user has "banned" (think Google results wiki), and remove them from the list (DTC). Now take the content from the webpages, and select a snippet where the words "hacker" and "news" appears, and wrap them around bold tags (you guessed... DTC).

The thing with the so called MapReduce is that this distribution is somewhat made easier. You map your data, and you reduce, ad-infinitum-or-how-much-you-want, each time distributing the computation. I think I read somewhere in the past that a single query on Google can use up to 100 machines.


I suggest checking out the first video here: http://www.cloudera.com/hadoop-training-basic. It's a little more than 20 minutes long, and the lecturer talks about the problems that MapReduce and HDFS (Hadoop Distributed File System) are designed to solve.

If you're super curious about MapReduce and are down to spend a few hours learning about and playing around with Hadoop, most certainly check out the rest of the videos (you can leave out the ones about Hive) and work through the first two exercises -- the virtual machine they provide makes it very easy to implement and run your first MapReduce job. Doing this will answer your question better than anyone explaining it to you can. If you'd like to go even further and learn about how Hadoop works under the hood, buy the rough cuts version of Hadoop: The Definitive Guide here: http://oreilly.com/catalog/9780596521998/.


Anything that can be divided and conquered in a batch process, especially where there is a ton of distributed data, and you can do the computation on the machines where it's stored.

Building search indexes is one of the greatest uses of it. You put in (say) a ton of crawled webpages, you generate link graphs and other internal representations from that, then feed that in again to generate rankings, etc.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: