Setting a new world record in CloudSort with Apache Spark

devonkim · on Nov 15, 2016

It's not clear how much of these improvements are from reductions in pricing rather than algorithms and design decisions. They've documented things like using Netty for network latency, avoiding GC, and getting better with Spark, but it'd be interesting if the team could go back and run the benchmark using the same infrastructure as their 2014 benchmark for a code-vs-code comparison to separate engineering improvements from economies of scale.

rxin · on Nov 15, 2016

I definitely agree that it'd be great to decouple the improvements in software and drops in cloud pricing. In reality it is pretty difficult because the Nanjing U/Alibaba team spent a lot of time also optimizing the software specifically for the AliCloud environment, which might not be applicable when running on Amazon EC2, which was the environment of the 2014 record.

This is a great task for a rigorous academic paper!

Disclaimer: I wrote the blog post.

wsmith · on Nov 15, 2016

Is Spark faster than MemSQL?

gopalv · on Nov 15, 2016

MemSQL is a transactional database (system of record).

Spark is a way of processing data, ideally stored in a system of record (Hive/HDFS/S3/MemSQL etc).

They're not the same.

wsmith · on Nov 16, 2016

There are similarities. A database is also a way of processing data.

For the kinds of processing both Spark and MemSQL do (e.g. join operation) is Spark faster than MemSQL?

Bedon292 · on Nov 15, 2016

I agree, hard to understand the changes. Actually fond some addition info as I was writing this though. http://sortbenchmark.org/

Old record was 330 r3.4xlarge machines. Which actually cost very similar today. It looks like the old record used a LOT more RAM than this one. 40,260GB vs 3,152GB and similar CPU 5,280 vs 4,728 cores. Although having the RAM doesn't mean it was used, but unless they were using all the RAM I don't see why they would use the more expensive r3 instances.

Edit: Follow on, I think the cost savings is definitely in the low RAM usage. Can't really get that many cores without more than 3 times as much RAM in AWS.

embiggen · on Nov 15, 2016

Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+ years ago. We've still got a long ways to go in OSS land to compete with the big boys.

fhoffa · on Nov 16, 2016

This post tells the "History of massive-scale sorting experiments at Google"

- https://cloud.google.com/blog/big-data/2016/02/history-of-ma...

When I asked why BigQuery doesn't do these sorts, the answer came straight from the post "Nobody really wants a huge globally-sorted output. We haven’t found a single use case for the problem as stated."

These accomplishments are awesome nevertheless!

Disclaimer: I'm Felipe Hoffa, and I work for Google (http://twitter.com/felipehoffa).

andrioni · on Nov 16, 2016

Do you think you could ask someone and find out the cluster sizes they used for those sorts? They mention "With the largest cluster at Google under our control", but it would be more interesting to have an idea of actual numbers, even if just an order of magnitude.

fhoffa · on Nov 16, 2016

I could ask - but then I wouldn't be able to publish unpublished numbers on my own (if I want to keep my job).

:)

flukus · on Nov 15, 2016

A price record not a performance one.

Also, seeing how expensive it is to sort 100TB ($144) you have to wonder why it wouldn't be better to do it on your own hardware.

kermatt · on Nov 16, 2016

Databricks is a cloud service. Publishing a premise approach would not benefit their business?

flukus · on Nov 16, 2016

Which also makes it more of an ad than anything.

rxin · on Nov 16, 2016

I'd argue this (performance/cost) is exactly the right metric to measure. One of the biggest benefits of the cloud is elasticity. In on-premise world, one would have to provision based on peak demand, and most of the time the cluster utilization rate is pretty low.

flukus · on Nov 16, 2016

But it's cost for every iteration for cloud vs a once off cost for on premise. It doesn't take very long for on premise to get cheaper.

rxin · on Nov 16, 2016

The one-off cost is very high, and there is higher ongoing maintenance cost as well. Most organizations are moving to the cloud, because it in general makes more economic sense.

flukus · on Nov 16, 2016

I'm really not seeing most organizations moving to the cloud. It only makes sense economically for the smallest companies/organisations that can't afford maintenance costs. Most bigger ones won't risk the loss of their core data and a lot can't go to the cloud for various reasons like legality, speed and redundancy.

rxin · on Nov 16, 2016

Take a look at the customers featured at reinvent. Computing as "utility" is the future, and the future is arriving fast.

iaw · on Nov 15, 2016

I got excited and then I saw that this was for sorting not storage...

falaki · on Nov 15, 2016

Apache Spark is agnostic to storage layer.

iaw · on Nov 15, 2016

The title has been changed, the original post didn't reference CloudSort and I'm not sure if it included Apache Spark either. It was something along the lines of "New record, 1TB at $1.44"

Hence my confusion.