Apparently I am either horrible at explaining myself or you are being deliberately obtuse. The argument about intermediate data size is a sufficient but unnecessary argument to prove the author has no point to make.
First, let's show that I can write a job that can produce more data that it inputs. I have a map of user to score, and want to compute pair wise summed scores for every user pair. This is clearly O(n^2) outputs. Computing pair wise
scores is a common algorithm for recommender systems (I realize in practice you generally do not compute the entire space because it will be too slow. However your output will be closer to n^2 than n, ie, it will be much larger than your input.)
Now, this is a complicated example. A simpler example
is "I want to compute aggregations on all the fields my log file." If you have N fields, hadoop is going to sort the data N times. Ie, you will be producing lots of intermediate data, almost certainly more than the input size, just by using map reduce (the code doing this merge sort is not your code, it's hadoop.)
But again, the point you keep missing and conveniently ignore when you quote my posts is that the point of map reduce on Aws is not about data locality (obviously) but about downstream flexibility. I can run 100 jobs, in parallel, on 10k machines, and output much more data than I input, without running a cluster of my own and I get to pay by the hour. I am isolated from other devs and spin the machines down when finished. If you buy the argument that this is a useful feature (as any EMR customer would attest too) than this too is a separate more pragmatic reason why the author has no idea what he is taking about when he says all EMR customers are a cargo cult.
> First, let's show that I can write a job that can produce more data that it inputs. I have a map of user to score, and want to compute pair wise summed scores for every user pair. This is clearly O(n^2) outputs. Computing pair wise scores is a common algorithm for recommender systems (I realize in practice you generally do not compute the entire space because it will be too slow. However your output will be closer to n^2 than n, ie, it will be much larger than your input.)
But that's exactly it - you admit yourself that this is a contrived example! For all but trivial numbers of users N^2 storage requirements would be impractical for any system, no programmer in their right mind would take such an approach.
And this is the point that you seem to be missing: the question is not whether you can come up with some contrived scenario where there is vastly more intermediate data than input data, but when you can come up with a realistic practical example - and apparently you cannot.
> A simpler example is "I want to compute aggregations on all the fields my log file."
In this case the amount of intermediate data is no greater than the amount of output data.
> But again, the point you keep missing and conveniently ignore when you quote my posts is that the point of map reduce on Aws is not about data locality (obviously) but about downstream flexibility. I can run 100 jobs, in parallel, on 10k machines, and output much more data than I input, without running a cluster of my own and I get to pay by the hour. I am isolated from other devs and spin the machines down when finished.
These are benefits of EC2, not of map-reduce. The critique was specific to map-reduce, not of the entire principle of on-demand computing.
> If you buy the argument that this is a useful feature (as any EMR customer would attest too)
Strawman. You're attempting to conflate the general benefits of on-demand computing with the specific benefits of EMR. The author was criticizing EMR in particular, not the entire principle of on-demand computing.
> than this too is a separate more pragmatic reason why the author has no idea what he is taking about when he says all EMR customers are a cargo cult.
And yet you still haven't been able to cite any realistic, non-contrived examples of where EMR is the appropriate tool for the job.
Ok, it's clear now you have horrible reading comprehension and/or have never used Hadoop in practice. If you perform an aggregation on 10 fields in a log file, a common pattern, you are going to be generating more intermediate data than your log files during the shuffle. (Hadoop is, in this case.) If you generate a dataset that has pairwise scores for users, even if you are only sampling 1% of them than you are generating way more data than your input. (You are generating this data in this example, not Hadoop, which is why I included it.) Say N=1M. You want to compute the top 1% pairwise scores (pruned via some locality sensitive hashing.) Guess what, you are computing N^2 * 0.01 = 10B records. Perfectly tractable for a big data system. You seem too busy yelling at me to even read what I am writing, since I stated this clearly: "(I realize in practice you generally do not compute the entire space because it will be too slow. However your output will be closer to n^2 than n, ie, it will be much larger than your input.)"
You say: "These are benefits of EC2, not of map-reduce. The critique was specific to map-reduce, not of the entire principle of on-demand computing." This is patently false. The criticism of the author is that people using EMR are a cargo cult and do not understand the purpose of map reduce. Ie, they should not be using EMR, and he is so clever for realizing the folly of their ways. My claim is the author does not understand the purpose of EMR. By definition, EMR brings with it the benefits of EC2. I don't see how this is a strawman.
At least I am sure now that it's not my communication skills, I illustrated two perfectly fine examples and you glossed right over them without understanding. I've already spent enough time on this and argued in good faith and at this point I can at least be confident that you have no interest in understanding what I am saying and instead want to childishly prove me wrong on the internet.
First, let's show that I can write a job that can produce more data that it inputs. I have a map of user to score, and want to compute pair wise summed scores for every user pair. This is clearly O(n^2) outputs. Computing pair wise scores is a common algorithm for recommender systems (I realize in practice you generally do not compute the entire space because it will be too slow. However your output will be closer to n^2 than n, ie, it will be much larger than your input.)
Now, this is a complicated example. A simpler example is "I want to compute aggregations on all the fields my log file." If you have N fields, hadoop is going to sort the data N times. Ie, you will be producing lots of intermediate data, almost certainly more than the input size, just by using map reduce (the code doing this merge sort is not your code, it's hadoop.)
But again, the point you keep missing and conveniently ignore when you quote my posts is that the point of map reduce on Aws is not about data locality (obviously) but about downstream flexibility. I can run 100 jobs, in parallel, on 10k machines, and output much more data than I input, without running a cluster of my own and I get to pay by the hour. I am isolated from other devs and spin the machines down when finished. If you buy the argument that this is a useful feature (as any EMR customer would attest too) than this too is a separate more pragmatic reason why the author has no idea what he is taking about when he says all EMR customers are a cargo cult.