A Guide to Python Frameworks for Hadoop

irskep · on Jan 7, 2013

Former[1] primary mrjob maintainer here, thanks for the shout-out! I'd like to make a couple of notes and corrections. In particular, I thank you for recommending mrjob for EMR usage, as it's something we've made a point of trying to be the best at.

First of all, all of these frameworks use Hadoop Streaming. As mentioned in the mrjob 0.4-dev docs [2] (pardon the run-on sentence):

"Although Hadoop is primarly designed to work with JVM code, it supports other languages via Hadoop Streaming, a special jar which calls an arbitrary program as a subprocess, passing input via stdin and gathering results via stdout."

mrjob's role is to give you a structured way to write Hadoop Streaming jobs in Python (or, recently, any language). When your task runs, it's taking input and output the same way as your raw Python example does, except mrjob is passing it through to the methods you've defined after running each line through a deserialization function. It picks the code to run based on command line arguments such as --mapper, --reducer, etc. The output of the code is again serialized. The methods of [de]serialization are defined declaritively in your class, as you showed.

So why did you find mrjob to be slower than bare Hadoop Streaming? I don't know! In theory, you're running approximately the same code between the mrjob and the bare Python versions of your script. If anyone has time to dig into this and find out where that time is being spent, I would be grateful. Results should be sent to the issue tracker on the Github page [3].

Feel free to ask clarifying questions. I realize I may not be explaining this effectively to people unfamiliar with the ins and outs of Python MapReduce frameworks.

I'm thinking of organizing the 2nd "mrjob hackathon" in the near future, so please ping me if you're interested in contributing to an easy-to-handle lots-of-low-hanging-fruit OSS project. (Particularly if you're a Rubyist, because we have an experimental way to use Ruby with it.)

[1] mrjob is maintained by Yelp, where I worked until recently. It's still under active development, though it's slowed somewhat since Dave and I moved on.

[2] http://mrjob.readthedocs.org/en/latest/guides/concepts.html#...

[3] http://www.github.com/yelp/mrjob/issues/new

irskep · on Jan 7, 2013

One last thing I forgot to mention: there was a typedbytes support pull request for mrjob at one point, but it was behind the dev branch enough that it wasn't practical to merge. It is possible (probable?) that typedbytes support will make it into a future release if an interested party could put in the time. I would have done it myself if I had been able to make head or tail of the typedbytes documentation.

Also, if you're just starting out or want to look over the docs, I'd recommend using the dev version hosted on readthedocs instead of the PyPI version, as the author did: http://mrjob.readthedocs.org/en/latest/index.html

laserson · on Jan 7, 2013

I did only a bit of profiling on mrjob, but it seemed that the ser/de was significantly slowing down the overall computation.

irskep · on Jan 7, 2013

Here's the code that's doing the actual line parsing: https://github.com/Yelp/mrjob/blob/master/mrjob/protocol.py#...

It's just splitting on tab for input and re-joining on tab for output. Some extra logic for lines without tabs.

EDIT: The example in the post is using JSON for communication between intermediate steps, while the Hadoop Streaming example is using a custom delimiter format. So this isn't really a fair comparison; the mrjob example could just as easily use the same efficient intermediate format.

laserson · on Jan 7, 2013

Here is the profiling output: https://gist.github.com/4478737

The input is RawProtocol, which simply splits on tab. But after that, mrjob defaults to using JSON internally, and this is causing a lot of slowdown.

irskep · on Jan 7, 2013

So mrjob isn't slower at all! You just chose to use JSON for your intermediate steps in the mrjob example, instead of the delimiter format you used in the raw Python example. I believe the mrjob docs do say what the defaults are. RawProtocol can be used for intermediate steps just fine.

Please either mention this difference in your post or update the code and conclusions. If there's a place in the documentation where we should mention optimizations or details like this, I'd be interested to know.

I should have thought of this before. Oh well.

laserson · on Jan 7, 2013

My goal was to use mrjob's features, not strip them down for performance. I find mrjob's natural use of JSON very appealing in terms of user-experience. It means that keys can be more complex types without the user having to manually figure out the best way to encode them. I make it clear that this is an appealing property of mrjob, and I make it clear that this is the reason for the slowdown. As is, the code will not work with RawProtocol internally because the key is a tuple of words.

irskep · on Jan 7, 2013

I can't find the part of the post where you explain that JSON parsing is the reason for the slowdown. You just say mrjob itself is slower. While I agree that mrjob's defaults encourage the use of JSON, I think it's unfair to blame lack of optimization on the framework, given that the bare Python example could just as easily have used JSON.

One real issue with mrjob is that it assumes you're only going to have one key and one value. It isn't straightforward to use multiple key fields. The workaround is to write a custom protocol (which, btw, is very simple [1]) that uses the line up to the first tab as the key, and the rest of the line as the value, probably splitting it on tab as well and passing it through as a tuple. If we had made multipart keys simpler to use, maybe you would have chosen to use a more efficient format.

Anyway, the main part I take issue with is:

"mrjob seems highly active, easy-to-use, and mature...but it appears to perform the slowest."

That's just not true. It would be fair to say that optimizing jobs with multipart keys isn't straightforward and therefore encourages non-optimal code, but that's moot if you're just using one key and one value, as most people do.

I'm really not trying to dump on you here. I liked the post! I would just prefer that it was more precise about these things.

[1] http://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs...

EDIT: If anyone's thinking about downvoting this guy (someone did), don't. This is a discussion in good faith.

throwaway54-762 · on Jan 7, 2013

Slightly tangential, but I'd like to shamelessly plug my HDFS client library (with nice Python bindings)[0].

If you want to access files on HDFS from your Python tasks outside of your typical map / shuffle inputs from the streaming API, it might be handy? It doesn't go through the JVM (the library is in C), so it might save a little latency for short Python tasks.

[0]: https://github.com/cemeyer/hadoofus

Also, I'm pretty new to publishing my own open source libraries. If people would be so kind, I'd love some constructive criticism. Thanks HN!

irskep · on Jan 7, 2013

The main thing you need to do is publish your Python package to PyPI. This should fill in the gaps in your Python packaging knowledge: http://guide.python-distribute.org/

throwaway54-762 · on Jan 7, 2013

It's a C library, though — is that acceptable for PyPI? Thanks for the feedback!

irskep · on Jan 7, 2013

Yep - if it's a Python library that can be 'python setup.py install'ed, then it should be on PyPI. That's true of modules requiring C extensions.

bcbrown · on Jan 7, 2013

The only framework I've used of the ones discussed is Hadoop Streaming with Python. For our use case (rapid prototyping of statistical analytics on several-terabyte structured data), it worked perfectly and was almost frictionless.

As the article calls out, we had to detect the boundaries between keys manually, but that didn't add much complexity. We called into Python through Hive scripts. Our group had little prior experience; some had used Python before, no one had used Hive/Hadoop much or at all, but we were all productive within a day of ramp-up time. I'm positive we were more productive than if we'd implemented everything in Java, even for the people more experienced in Java than Python.

If I have any future projects requiring similar analysis, I'd like to use Hive with Hadoop Streaming and Python again.

monstrado · on Jan 7, 2013

A while back I did some tests to compare the Hadoop Streaming library to the Dumbo library. I have to agree, if your MapReduce jobs are not crazy complicated, and don't require chaining...it's best to just write it using the raw Streaming library.

seany · on Jan 7, 2013

You can do hbase stuff via streaming as well with this library https://github.com/vanship82/hadoop-hbase-streaming