This is a really great release, I'm excited to start playing with DataFrames!
One question if anyone from Databricks reads it - what about GraphX? Will it also get the same level of attention that SQL, Mlib and Spark core got recently? e.g. is adding support for Gremlin (or any other graph query languages) on the roadmap? what about an R API for GraphX? Is that planned?
p.s. when is GraphX planned to exit Alpha?
in any case, great product, nice usage of Akka and Scala, and a very intuitive API. I feel lucky to be working with it on a daily basis.
I guess the DataFrame API needs to be spelled out for me. Does this mean RDDs will be deprecated in the future if the new DataFrame is faster? As someone who's gateway drug into programming was R it's been fun to watch data frames grow across programming languages. I'm a huge fan of Python's Pandas library and very interested in Spark.
The DataFrame is an evolution of the RDD model, where Spark knows explicit schema information. The core Spark RDD API is very generic and assumes nothing about the structure of the user's data. This is powerful, but ultimately the generic nature imposes limits on how much we can optimize.
DataFrames impose just a bit more structure: we assume that you have a tabular schema, named fields with types, etc. Given this assumption, Spark can optimize a lot of internal execution details, and also provide slicker API's to users. It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.
Is the core RDD API going anywhere? Nope - not any time soon. Sometimes it really is necessary to drop into that lower level API. But I do anticipate that within a year or two most Spark applications will let DataFrames do the heavy lifting.
In fact, DataFrames and RDDs are completely inter-operable, either can be converted to the other. This means that even if you don't want to use DataFrames you can benefit from all of the cool input/output capabilities they have, even just to create regular old RDDs.
Out of curiosity, is anyone using Spark in production? We're evaluating whether we should invest in Hadoop or Spark. They're certainly not mutually exclusive, but I would rather invest fully in Spark than have infrastructure split between Spark and Hadoop.
Spark has a much nicer API than Hadoop and theoretically can give you significant speedups on iterative in-memory work loads. On the other hand, I've had a terrible experience with the stability and debuggability of previous versions of Spark. They do seem to be rapidly improving so it's hard to recommend anything other trying it out and seeing if Spark works for your needs.
(Also, you probably know this already, but many people don't really have data big enough to necessitate a distributed framework. If your datasets are counted in gigabytes, you can do everything more simply on one machine and/or with a traditional database)
I believe one of the key advantages of Spark over Hadoop is being able to run the full stack on a small environment (single machine) and do all the coding there without the need of a cluster just for development.
This is why I like cascading [1]. It has a higher level API on Hadoop. It also works in local mode with little change. I've actually used out to do transformation work from local files (csv), join them into structured documents and dump them into ArangoDB. I liked it so much I wrote a third party library to work ArangoDB in Hadoop[2].
This is a big sell for me. There are some small catches to it - for example, for operations requiring an associative function (such as reduceByKey), the need for it to be associative may not arise until the data-set becomes suitably large to be split across multiple workers, so in my team our testers specifically check reduce function associativity has been demonstrated in a unit test.
Yep, we're using Spark 1.1 currently for aggregating log data on a one big bang per day basis, and I'm currently experimenting with Spark streaming also. I'd say go with Spark if it's green fields dev, the API is far nicer, and it's far less fiddly to work with.
Spark uses a lot of Hadoop under the covers, so you still benefit from that ecosystem.
What I really like about it is that it can be easily unit and integration tested.
Interesting! Could you say more about the unit and integration testing? E.g., what sort of things you find it useful to test, and particular toolkits or approaches you like?
Well, we unit test each of the functions which are typically mapping, reducing or filtering.
The only thing special we do here is explicitly test that any reduce function which is required to be associative is actually associative, due to a dumb bug I wrote once.
With the integration test, we have structured our code so that the processing occurs in a function that takes an RDD and returns an RDD. We then start a SparkContext in local mode[1], create an RDD with test data and can easily test that our processing produces the correct results.
We're just using JUnit for this, as we're largely a Java shop, so while we're coding in Scala for the cleaner API, we haven't jumped into the Scala ecosystem fully.
We also run end to end tests on our cluster using a subset of production data stored on S3 (Spark workers have to read/write from a distributed file system, S3 is one that the Hadoop ecosystem supports), and just verify the output against expectations derived from crunching that same subset via traditional means.
Hope that helps! I found Spark very easy to get up and running with, you can do a lot of experimentation in your IDE, and when you want to try a cluster, it ships with some convenience scripts that make it very easy to start a cluster on AWS.
We are using Spark in production at a very large enterprise. One thing that has really helped us is Spark Job Server from Ooyala. I love the way you can share SparkContexts and just track what is happening:
We plan to... so far so good. Not in production yet though. We focus more on GraphX though, which is even less mature than Spark. We feel pretty comfortable with Spark core...
Spark on YARN has been very stable and zero hassle for us. Run one command and it is deployed around the cluster and up and running. For performance Spark absolutely destroys stock MapReduce/Tez especially if like us you have a cluster with lots of basically unused RAM.
And Spark SQL, Spark Shell are both fantastic additions to the Hadoop ecosystem.
in any case, great product, nice usage of Akka and Scala, and a very intuitive API. I feel lucky to be working with it on a daily basis.