Hacker News new | past | comments | ask | show | jobs | submit login
Setting a new world record in CloudSort with Apache Spark (databricks.com)
86 points by rxin on Nov 15, 2016 | hide | past | favorite | 21 comments



It's not clear how much of these improvements are from reductions in pricing rather than algorithms and design decisions. They've documented things like using Netty for network latency, avoiding GC, and getting better with Spark, but it'd be interesting if the team could go back and run the benchmark using the same infrastructure as their 2014 benchmark for a code-vs-code comparison to separate engineering improvements from economies of scale.


I definitely agree that it'd be great to decouple the improvements in software and drops in cloud pricing. In reality it is pretty difficult because the Nanjing U/Alibaba team spent a lot of time also optimizing the software specifically for the AliCloud environment, which might not be applicable when running on Amazon EC2, which was the environment of the 2014 record.

This is a great task for a rigorous academic paper!

Disclaimer: I wrote the blog post.


Is Spark faster than MemSQL?


MemSQL is a transactional database (system of record).

Spark is a way of processing data, ideally stored in a system of record (Hive/HDFS/S3/MemSQL etc).

They're not the same.


There are similarities. A database is also a way of processing data.

For the kinds of processing both Spark and MemSQL do (e.g. join operation) is Spark faster than MemSQL?


I agree, hard to understand the changes. Actually fond some addition info as I was writing this though. http://sortbenchmark.org/

Old record was 330 r3.4xlarge machines. Which actually cost very similar today. It looks like the old record used a LOT more RAM than this one. 40,260GB vs 3,152GB and similar CPU 5,280 vs 4,728 cores. Although having the RAM doesn't mean it was used, but unless they were using all the RAM I don't see why they would use the more expensive r3 instances.

Edit: Follow on, I think the cost savings is definitely in the low RAM usage. Can't really get that many cores without more than 3 times as much RAM in AWS.


Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+ years ago. We've still got a long ways to go in OSS land to compete with the big boys.


This post tells the "History of massive-scale sorting experiments at Google"

- https://cloud.google.com/blog/big-data/2016/02/history-of-ma...

When I asked why BigQuery doesn't do these sorts, the answer came straight from the post "Nobody really wants a huge globally-sorted output. We haven’t found a single use case for the problem as stated."

These accomplishments are awesome nevertheless!

Disclaimer: I'm Felipe Hoffa, and I work for Google (http://twitter.com/felipehoffa).


Do you think you could ask someone and find out the cluster sizes they used for those sorts? They mention "With the largest cluster at Google under our control", but it would be more interesting to have an idea of actual numbers, even if just an order of magnitude.


I could ask - but then I wouldn't be able to publish unpublished numbers on my own (if I want to keep my job).

:)


A price record not a performance one.

Also, seeing how expensive it is to sort 100TB ($144) you have to wonder why it wouldn't be better to do it on your own hardware.


Databricks is a cloud service. Publishing a premise approach would not benefit their business?


Which also makes it more of an ad than anything.


I'd argue this (performance/cost) is exactly the right metric to measure. One of the biggest benefits of the cloud is elasticity. In on-premise world, one would have to provision based on peak demand, and most of the time the cluster utilization rate is pretty low.


But it's cost for every iteration for cloud vs a once off cost for on premise. It doesn't take very long for on premise to get cheaper.


The one-off cost is very high, and there is higher ongoing maintenance cost as well. Most organizations are moving to the cloud, because it in general makes more economic sense.


I'm really not seeing most organizations moving to the cloud. It only makes sense economically for the smallest companies/organisations that can't afford maintenance costs. Most bigger ones won't risk the loss of their core data and a lot can't go to the cloud for various reasons like legality, speed and redundancy.


Take a look at the customers featured at reinvent. Computing as "utility" is the future, and the future is arriving fast.


I got excited and then I saw that this was for sorting not storage...


Apache Spark is agnostic to storage layer.


The title has been changed, the original post didn't reference CloudSort and I'm not sure if it included Apache Spark either. It was something along the lines of "New record, 1TB at $1.44"

Hence my confusion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: