Hacker News new | past | comments | ask | show | jobs | submit login

Out of curiosity, is anyone using Spark in production? We're evaluating whether we should invest in Hadoop or Spark. They're certainly not mutually exclusive, but I would rather invest fully in Spark than have infrastructure split between Spark and Hadoop.



Spark has a much nicer API than Hadoop and theoretically can give you significant speedups on iterative in-memory work loads. On the other hand, I've had a terrible experience with the stability and debuggability of previous versions of Spark. They do seem to be rapidly improving so it's hard to recommend anything other trying it out and seeing if Spark works for your needs.

(Also, you probably know this already, but many people don't really have data big enough to necessitate a distributed framework. If your datasets are counted in gigabytes, you can do everything more simply on one machine and/or with a traditional database)


We are aware of at least 500 production use cases. Howeve,r since it is is open source software, there are a lot more that we don't know about.

You can find some public ones here:

http://spark-summit.org/east/2015/agenda

http://spark-summit.org/2014

https://cwiki.apache.org/confluence/display/SPARK/Powered+By...


I believe one of the key advantages of Spark over Hadoop is being able to run the full stack on a small environment (single machine) and do all the coding there without the need of a cluster just for development.


This is why I like cascading [1]. It has a higher level API on Hadoop. It also works in local mode with little change. I've actually used out to do transformation work from local files (csv), join them into structured documents and dump them into ArangoDB. I liked it so much I wrote a third party library to work ArangoDB in Hadoop[2].

1 cascading.org/ 2 https://github.com/deusdat/guacaphant


This is a big sell for me. There are some small catches to it - for example, for operations requiring an associative function (such as reduceByKey), the need for it to be associative may not arise until the data-set becomes suitably large to be split across multiple workers, so in my team our testers specifically check reduce function associativity has been demonstrated in a unit test.


You can also run mapreduce in local mode on a single machine.


Yep, we're using Spark 1.1 currently for aggregating log data on a one big bang per day basis, and I'm currently experimenting with Spark streaming also. I'd say go with Spark if it's green fields dev, the API is far nicer, and it's far less fiddly to work with.

Spark uses a lot of Hadoop under the covers, so you still benefit from that ecosystem.

What I really like about it is that it can be easily unit and integration tested.


Interesting! Could you say more about the unit and integration testing? E.g., what sort of things you find it useful to test, and particular toolkits or approaches you like?


Well, we unit test each of the functions which are typically mapping, reducing or filtering.

The only thing special we do here is explicitly test that any reduce function which is required to be associative is actually associative, due to a dumb bug I wrote once.

With the integration test, we have structured our code so that the processing occurs in a function that takes an RDD and returns an RDD. We then start a SparkContext in local mode[1], create an RDD with test data and can easily test that our processing produces the correct results.

We're just using JUnit for this, as we're largely a Java shop, so while we're coding in Scala for the cleaner API, we haven't jumped into the Scala ecosystem fully.

We also run end to end tests on our cluster using a subset of production data stored on S3 (Spark workers have to read/write from a distributed file system, S3 is one that the Hadoop ecosystem supports), and just verify the output against expectations derived from crunching that same subset via traditional means.

Hope that helps! I found Spark very easy to get up and running with, you can do a lot of experimentation in your IDE, and when you want to try a cluster, it ships with some convenience scripts that make it very easy to start a cluster on AWS.

[1]: http://spark.apache.org/docs/1.2.0/programming-guide.html#in...


Thanks. That's very helpful.


We are using Spark in production at a very large enterprise. One thing that has really helped us is Spark Job Server from Ooyala. I love the way you can share SparkContexts and just track what is happening:

http://engineering.ooyala.com/blog/open-sourcing-our-spark-j...


That PR referenced in their blog post is 404; the actual project code is here:

https://github.com/spark-jobserver/spark-jobserver#readme


We plan to... so far so good. Not in production yet though. We focus more on GraphX though, which is even less mature than Spark. We feel pretty comfortable with Spark core...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: