Announcing Spark 1.3

eranation · on March 14, 2015

This is a really great release, I'm excited to start playing with DataFrames! One question if anyone from Databricks reads it - what about GraphX? Will it also get the same level of attention that SQL, Mlib and Spark core got recently? e.g. is adding support for Gremlin (or any other graph query languages) on the roadmap? what about an R API for GraphX? Is that planned? p.s. when is GraphX planned to exit Alpha?

in any case, great product, nice usage of Akka and Scala, and a very intuitive API. I feel lucky to be working with it on a daily basis.

rxin · on March 14, 2015

GraphX actually graduated from Alpha in Spark 1.2.

We have a few important improvements and changes to GraphX planned for 1.4, including Java API, and possibly a Python API.

Wonnk13 · on March 13, 2015

I guess the DataFrame API needs to be spelled out for me. Does this mean RDDs will be deprecated in the future if the new DataFrame is faster? As someone who's gateway drug into programming was R it's been fun to watch data frames grow across programming languages. I'm a huge fan of Python's Pandas library and very interested in Spark.

pwendell · on March 13, 2015

The DataFrame is an evolution of the RDD model, where Spark knows explicit schema information. The core Spark RDD API is very generic and assumes nothing about the structure of the user's data. This is powerful, but ultimately the generic nature imposes limits on how much we can optimize.

DataFrames impose just a bit more structure: we assume that you have a tabular schema, named fields with types, etc. Given this assumption, Spark can optimize a lot of internal execution details, and also provide slicker API's to users. It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.

Is the core RDD API going anywhere? Nope - not any time soon. Sometimes it really is necessary to drop into that lower level API. But I do anticipate that within a year or two most Spark applications will let DataFrames do the heavy lifting.

In fact, DataFrames and RDDs are completely inter-operable, either can be converted to the other. This means that even if you don't want to use DataFrames you can benefit from all of the cool input/output capabilities they have, even just to create regular old RDDs.

EdwardDiego · on March 14, 2015

> It turns out that a huge fraction of Spark workloads fall into this model, especially since we support complex types and nested structures.

The first step of all my Spark tasks is "turn this RDD[String] into an RDD of parsed JSON", or turning CSV into case classes.

What JSON parser will dataframes be using? I presume Jackson?

thethimble · on March 13, 2015

Out of curiosity, is anyone using Spark in production? We're evaluating whether we should invest in Hadoop or Spark. They're certainly not mutually exclusive, but I would rather invest fully in Spark than have infrastructure split between Spark and Hadoop.

iskander · on March 13, 2015

Spark has a much nicer API than Hadoop and theoretically can give you significant speedups on iterative in-memory work loads. On the other hand, I've had a terrible experience with the stability and debuggability of previous versions of Spark. They do seem to be rapidly improving so it's hard to recommend anything other trying it out and seeing if Spark works for your needs.

(Also, you probably know this already, but many people don't really have data big enough to necessitate a distributed framework. If your datasets are counted in gigabytes, you can do everything more simply on one machine and/or with a traditional database)

rxin · on March 13, 2015

We are aware of at least 500 production use cases. Howeve,r since it is is open source software, there are a lot more that we don't know about.

You can find some public ones here:

http://spark-summit.org/east/2015/agenda

http://spark-summit.org/2014

https://cwiki.apache.org/confluence/display/SPARK/Powered+By...

gandalfu · on March 14, 2015

I believe one of the key advantages of Spark over Hadoop is being able to run the full stack on a small environment (single machine) and do all the coding there without the need of a cluster just for development.

virmundi · on March 14, 2015

This is why I like cascading [1]. It has a higher level API on Hadoop. It also works in local mode with little change. I've actually used out to do transformation work from local files (csv), join them into structured documents and dump them into ArangoDB. I liked it so much I wrote a third party library to work ArangoDB in Hadoop[2].

1 cascading.org/ 2 https://github.com/deusdat/guacaphant

EdwardDiego · on March 14, 2015

This is a big sell for me. There are some small catches to it - for example, for operations requiring an associative function (such as reduceByKey), the need for it to be associative may not arise until the data-set becomes suitably large to be split across multiple workers, so in my team our testers specifically check reduce function associativity has been demonstrated in a unit test.

lambdafunc · on March 14, 2015

You can also run mapreduce in local mode on a single machine.

EdwardDiego · on March 14, 2015

Yep, we're using Spark 1.1 currently for aggregating log data on a one big bang per day basis, and I'm currently experimenting with Spark streaming also. I'd say go with Spark if it's green fields dev, the API is far nicer, and it's far less fiddly to work with.

Spark uses a lot of Hadoop under the covers, so you still benefit from that ecosystem.

What I really like about it is that it can be easily unit and integration tested.

wpietri · on March 14, 2015

Interesting! Could you say more about the unit and integration testing? E.g., what sort of things you find it useful to test, and particular toolkits or approaches you like?

EdwardDiego · on March 14, 2015

Well, we unit test each of the functions which are typically mapping, reducing or filtering.

The only thing special we do here is explicitly test that any reduce function which is required to be associative is actually associative, due to a dumb bug I wrote once.

With the integration test, we have structured our code so that the processing occurs in a function that takes an RDD and returns an RDD. We then start a SparkContext in local mode[1], create an RDD with test data and can easily test that our processing produces the correct results.

We're just using JUnit for this, as we're largely a Java shop, so while we're coding in Scala for the cleaner API, we haven't jumped into the Scala ecosystem fully.

We also run end to end tests on our cluster using a subset of production data stored on S3 (Spark workers have to read/write from a distributed file system, S3 is one that the Hadoop ecosystem supports), and just verify the output against expectations derived from crunching that same subset via traditional means.

Hope that helps! I found Spark very easy to get up and running with, you can do a lot of experimentation in your IDE, and when you want to try a cluster, it ships with some convenience scripts that make it very easy to start a cluster on AWS.

[1]: http://spark.apache.org/docs/1.2.0/programming-guide.html#in...

wpietri · on March 16, 2015

Thanks. That's very helpful.

threeseed · on March 14, 2015

We are using Spark in production at a very large enterprise. One thing that has really helped us is Spark Job Server from Ooyala. I love the way you can share SparkContexts and just track what is happening:

http://engineering.ooyala.com/blog/open-sourcing-our-spark-j...

mdaniel · on March 14, 2015

That PR referenced in their blog post is 404; the actual project code is here:

https://github.com/spark-jobserver/spark-jobserver#readme

eranation · on March 14, 2015

We plan to... so far so good. Not in production yet though. We focus more on GraphX though, which is even less mature than Spark. We feel pretty comfortable with Spark core...

djcater · on March 14, 2015

Thanks for the hard work. When can we get the long-awaited ORC file support, including predicate-pushdown?

ampermad · on March 14, 2015

Our experience with Spark has been horrendous. Very unstable. Marginal improvements. Big hassle.

Would strongly advise you to consider hadoop. We also used storm and found it to be much stable.

Databricks makes a lot of noise though.

kiyoto · on March 14, 2015

Can you share the specifics of your experience (the cluster size/use case/etc.)

I am partly asking this because you clearly feel strongly about this topic (your account was created an hour ago, most likely to comment on this).

threeseed · on March 14, 2015

Have to disagree on all points there.

Spark on YARN has been very stable and zero hassle for us. Run one command and it is deployed around the cluster and up and running. For performance Spark absolutely destroys stock MapReduce/Tez especially if like us you have a cluster with lots of basically unused RAM.

And Spark SQL, Spark Shell are both fantastic additions to the Hadoop ecosystem.