Imagine a key-value store/database. Each key is a word, the value is a list of keys from webpages, these keys are the webpage contents.
Get every value for the word "hacker", get every value for the word "news", intersect these values (distributing the computation, or DTC), get the webpages for this intersection. Now you have webpages that contain the term "hacker news".
Key -> Value (word, webpage ids)
hacker -> page_1,page_255,page_600,page_5041
news -> page_5,page_600,page_1001,page_5041
(so, intersect == page_600,page_5041)
Key -> Value (webpage ids, contents)
page_600 -> "hacker news new threads comments leaders"
page_5041 -> "where I can find news for hackers"
Now let's sort these webpages. Take the relevancy algorithm, apply to your list of webpages (DTC), so now you have another list. Now take the list of urls that the user has "banned" (think Google results wiki), and remove them from the list (DTC). Now take the content from the webpages, and select a snippet where the words "hacker" and "news" appears, and wrap them around bold tags (you guessed... DTC).
The thing with the so called MapReduce is that this distribution is somewhat made easier. You map your data, and you reduce, ad-infinitum-or-how-much-you-want, each time distributing the computation. I think I read somewhere in the past that a single query on Google can use up to 100 machines.
Imagine a key-value store/database. Each key is a word, the value is a list of keys from webpages, these keys are the webpage contents.
Get every value for the word "hacker", get every value for the word "news", intersect these values (distributing the computation, or DTC), get the webpages for this intersection. Now you have webpages that contain the term "hacker news".
Key -> Value (word, webpage ids)
hacker -> page_1,page_255,page_600,page_5041
news -> page_5,page_600,page_1001,page_5041
(so, intersect == page_600,page_5041)
Key -> Value (webpage ids, contents)
page_600 -> "hacker news new threads comments leaders"
page_5041 -> "where I can find news for hackers"
Now let's sort these webpages. Take the relevancy algorithm, apply to your list of webpages (DTC), so now you have another list. Now take the list of urls that the user has "banned" (think Google results wiki), and remove them from the list (DTC). Now take the content from the webpages, and select a snippet where the words "hacker" and "news" appears, and wrap them around bold tags (you guessed... DTC).
The thing with the so called MapReduce is that this distribution is somewhat made easier. You map your data, and you reduce, ad-infinitum-or-how-much-you-want, each time distributing the computation. I think I read somewhere in the past that a single query on Google can use up to 100 machines.