Hacker News new | past | comments | ask | show | jobs | submit login

Interesting! Could you say more about the unit and integration testing? E.g., what sort of things you find it useful to test, and particular toolkits or approaches you like?



Well, we unit test each of the functions which are typically mapping, reducing or filtering.

The only thing special we do here is explicitly test that any reduce function which is required to be associative is actually associative, due to a dumb bug I wrote once.

With the integration test, we have structured our code so that the processing occurs in a function that takes an RDD and returns an RDD. We then start a SparkContext in local mode[1], create an RDD with test data and can easily test that our processing produces the correct results.

We're just using JUnit for this, as we're largely a Java shop, so while we're coding in Scala for the cleaner API, we haven't jumped into the Scala ecosystem fully.

We also run end to end tests on our cluster using a subset of production data stored on S3 (Spark workers have to read/write from a distributed file system, S3 is one that the Hadoop ecosystem supports), and just verify the output against expectations derived from crunching that same subset via traditional means.

Hope that helps! I found Spark very easy to get up and running with, you can do a lot of experimentation in your IDE, and when you want to try a cluster, it ships with some convenience scripts that make it very easy to start a cluster on AWS.

[1]: http://spark.apache.org/docs/1.2.0/programming-guide.html#in...


Thanks. That's very helpful.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: