MapReduce is a useful tool in some contexts, but the the hype surrounding it alw...

lumost · on Aug 10, 2021

The big hype was always due to the ability to shed large oracle based data warehouses. When Hadoop was full of hype in 2010 Oracle was charging 150k/cpu core/year for a rac cluster license.

Considering that oracle is not in fact magic, this meant that a large number of firms were spending 7-8 figures annually on oracle licenses. Map reduce/Hadoop was the first accepted alternative that didn’t involve spending outrageous license fees and instead involved outrageous hardware expenditure.

exikyut · on Aug 10, 2021

Enlightenment moment.

Con: OSS is less optimized than proprietary solution, requiring bigger hardware

Pro: OSS allows you to buy bigger hardware, use all of it without logical restrictions, and scale infinitely beyond the arbitrary point you were locked into with licensing.

And then the new-found efficiency frees up time to discover/identify $(x,)xxx,xxx+ in manual work that can also now be done with your new-found compute...

Wow. Way to prevent us from progressing beyond the industrial revolution.

($catchup_speed++)

rytis · on Aug 10, 2021

And there's also an added cost of engineers supporting all that hardware and the Hadoop components running on it.

zippy5 · on Aug 10, 2021

My interpretation is that is why it’s so brilliant.

It’s incredibly simple for the end user conceptually but encapsulates optimizing processing across a distributed file system, fault tolerance, shuffling key value pairs, job stage planning, handling intermediates ect.

Hadoop a big data framework that reduces the level of competence required to write data pipelines because it was able to hide a massive amount of complexity behind the map reduce abstraction.

Id even argue that hive, snowflake, and other sql data warehouses have taken this idea further, where most sql primitives can be implemented as map reduce derivatives. With this next level of abstraction, dbas and non-engineers are witting map reduce computations.

I think my point is that abstractions like map reduce have had a democratizing effect on who can implement high scale data processing and their value is that they took something incredibly complex and made it simple.

psfried · on Aug 10, 2021

I agree with this. As soon as the MapReduce paper came out, people were criticizing it for a lack of novelty, claiming that so-and-so has been using these same techniques for years. And of course those critics are still around saying the same things. But I think there's a reason we keep going back to these techniques, and I think it's because they repeatedly prove to be practical and effective.

willvarfar · on Aug 10, 2021

It reminds me more of timescale’s continuous aggregates and the new snowflake slayer, firebolt’s, aggregation indexes.

urthor · on Aug 10, 2021

Isn't the entire field of software just people patting themselves on the back for some extremely basic relational algebra?

I don't know what I should be proud of when I learn something new, it all seems extremely basic compared as soon as I learn it.

mint2 · on Aug 10, 2021

I wouldn’t say that anymore than I’d say soccer World Cup is a bunch of people worshipped for jogging around passing a ball around.

It’s pretty easy to simplify things down until they sound unimpressive.

hansvm · on Aug 10, 2021

It's kind of like math; all math concepts are impossible until you figure out why they're actually easy.

amelius · on Aug 10, 2021

It was more about scaling than about relational algebra though.

_huayra_ · on Aug 10, 2021

The scale factor was really overblown and without caveats. If you have Google-scale computations where it is not unlikely that at least 1 / 10k machines serving that request will bite the dust during the query, then of course MapReduce makes sense.

However, in most other cases there are now far better alternatives (although tbh I'm not sure how many were around when MapReduce was introduced).

The main limitation around mapreduce is the barries imposed by the shuffle stage and after the end of the reduce if chaining together multiple mapreduce operations. Dataflow frameworks remove these barriers to various degrees, which often lowers latency and can improve resource utilization.

prasadjoglekar · on Aug 10, 2021

Exactly this. I remember in 2007 being able to process TBs of data on commodity hardware with Hadoop. You got decent throughput, decent fault tolerance out of the box wrapped in Java that many average software developers (yours truly included) were comfortable with. You could scale data and people.

It dramatically reduced the cost of entry for many ad-tech applications.

CRConrad · on Aug 10, 2021

> It dramatically reduced the cost of entry for many ad-tech applications.

You say that as if it was unequivocally a point in its favour.

psfried · on Aug 10, 2021

Apart from Google, which has a patent related to their 2004 paper, I don't know how much people are trying to "take credit" for map-reduce. I'm certainly not. But I do think the approach of running map-reduce continuously in realtime is interesting and worth sharing. And I hope that some folks will be interested enough to try it out, either with Flow or in a system of their own design, and report on how it goes for them.