Out of curiosity, is anyone using Spark in production? We're evaluating whether ...

iskander · on March 13, 2015

Spark has a much nicer API than Hadoop and theoretically can give you significant speedups on iterative in-memory work loads. On the other hand, I've had a terrible experience with the stability and debuggability of previous versions of Spark. They do seem to be rapidly improving so it's hard to recommend anything other trying it out and seeing if Spark works for your needs.

(Also, you probably know this already, but many people don't really have data big enough to necessitate a distributed framework. If your datasets are counted in gigabytes, you can do everything more simply on one machine and/or with a traditional database)

rxin · on March 13, 2015

We are aware of at least 500 production use cases. Howeve,r since it is is open source software, there are a lot more that we don't know about.

You can find some public ones here:

http://spark-summit.org/east/2015/agenda

http://spark-summit.org/2014

https://cwiki.apache.org/confluence/display/SPARK/Powered+By...

gandalfu · on March 14, 2015

I believe one of the key advantages of Spark over Hadoop is being able to run the full stack on a small environment (single machine) and do all the coding there without the need of a cluster just for development.

virmundi · on March 14, 2015

This is why I like cascading [1]. It has a higher level API on Hadoop. It also works in local mode with little change. I've actually used out to do transformation work from local files (csv), join them into structured documents and dump them into ArangoDB. I liked it so much I wrote a third party library to work ArangoDB in Hadoop[2].

1 cascading.org/ 2 https://github.com/deusdat/guacaphant

EdwardDiego · on March 14, 2015

This is a big sell for me. There are some small catches to it - for example, for operations requiring an associative function (such as reduceByKey), the need for it to be associative may not arise until the data-set becomes suitably large to be split across multiple workers, so in my team our testers specifically check reduce function associativity has been demonstrated in a unit test.

lambdafunc · on March 14, 2015

You can also run mapreduce in local mode on a single machine.

EdwardDiego · on March 14, 2015

Yep, we're using Spark 1.1 currently for aggregating log data on a one big bang per day basis, and I'm currently experimenting with Spark streaming also. I'd say go with Spark if it's green fields dev, the API is far nicer, and it's far less fiddly to work with.

Spark uses a lot of Hadoop under the covers, so you still benefit from that ecosystem.

What I really like about it is that it can be easily unit and integration tested.

wpietri · on March 14, 2015

Interesting! Could you say more about the unit and integration testing? E.g., what sort of things you find it useful to test, and particular toolkits or approaches you like?

EdwardDiego · on March 14, 2015

Well, we unit test each of the functions which are typically mapping, reducing or filtering.

The only thing special we do here is explicitly test that any reduce function which is required to be associative is actually associative, due to a dumb bug I wrote once.

With the integration test, we have structured our code so that the processing occurs in a function that takes an RDD and returns an RDD. We then start a SparkContext in local mode[1], create an RDD with test data and can easily test that our processing produces the correct results.

We're just using JUnit for this, as we're largely a Java shop, so while we're coding in Scala for the cleaner API, we haven't jumped into the Scala ecosystem fully.

We also run end to end tests on our cluster using a subset of production data stored on S3 (Spark workers have to read/write from a distributed file system, S3 is one that the Hadoop ecosystem supports), and just verify the output against expectations derived from crunching that same subset via traditional means.

Hope that helps! I found Spark very easy to get up and running with, you can do a lot of experimentation in your IDE, and when you want to try a cluster, it ships with some convenience scripts that make it very easy to start a cluster on AWS.

[1]: http://spark.apache.org/docs/1.2.0/programming-guide.html#in...

wpietri · on March 16, 2015

Thanks. That's very helpful.

threeseed · on March 14, 2015

We are using Spark in production at a very large enterprise. One thing that has really helped us is Spark Job Server from Ooyala. I love the way you can share SparkContexts and just track what is happening:

http://engineering.ooyala.com/blog/open-sourcing-our-spark-j...

mdaniel · on March 14, 2015

That PR referenced in their blog post is 404; the actual project code is here:

https://github.com/spark-jobserver/spark-jobserver#readme

eranation · on March 14, 2015

We plan to... so far so good. Not in production yet though. We focus more on GraphX though, which is even less mature than Spark. We feel pretty comfortable with Spark core...