Interesting! Could you say more about the unit and integration testing? E.g., wh...

EdwardDiego · on March 14, 2015

Well, we unit test each of the functions which are typically mapping, reducing or filtering.

The only thing special we do here is explicitly test that any reduce function which is required to be associative is actually associative, due to a dumb bug I wrote once.

With the integration test, we have structured our code so that the processing occurs in a function that takes an RDD and returns an RDD. We then start a SparkContext in local mode[1], create an RDD with test data and can easily test that our processing produces the correct results.

We're just using JUnit for this, as we're largely a Java shop, so while we're coding in Scala for the cleaner API, we haven't jumped into the Scala ecosystem fully.

We also run end to end tests on our cluster using a subset of production data stored on S3 (Spark workers have to read/write from a distributed file system, S3 is one that the Hadoop ecosystem supports), and just verify the output against expectations derived from crunching that same subset via traditional means.

Hope that helps! I found Spark very easy to get up and running with, you can do a lot of experimentation in your IDE, and when you want to try a cluster, it ships with some convenience scripts that make it very easy to start a cluster on AWS.

[1]: http://spark.apache.org/docs/1.2.0/programming-guide.html#in...

wpietri · on March 16, 2015

Thanks. That's very helpful.