Hacker News new | past | comments | ask | show | jobs | submit login

You put lists, and you take lists :p

Imagine a key-value store/database. Each key is a word, the value is a list of keys from webpages, these keys are the webpage contents.

Get every value for the word "hacker", get every value for the word "news", intersect these values (distributing the computation, or DTC), get the webpages for this intersection. Now you have webpages that contain the term "hacker news".

Key -> Value (word, webpage ids)

hacker -> page_1,page_255,page_600,page_5041

news -> page_5,page_600,page_1001,page_5041

(so, intersect == page_600,page_5041)

Key -> Value (webpage ids, contents)

page_600 -> "hacker news new threads comments leaders"

page_5041 -> "where I can find news for hackers"

Now let's sort these webpages. Take the relevancy algorithm, apply to your list of webpages (DTC), so now you have another list. Now take the list of urls that the user has "banned" (think Google results wiki), and remove them from the list (DTC). Now take the content from the webpages, and select a snippet where the words "hacker" and "news" appears, and wrap them around bold tags (you guessed... DTC).

The thing with the so called MapReduce is that this distribution is somewhat made easier. You map your data, and you reduce, ad-infinitum-or-how-much-you-want, each time distributing the computation. I think I read somewhere in the past that a single query on Google can use up to 100 machines.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: