It's not clear how much of these improvements are from reductions in pricing rather than algorithms and design decisions. They've documented things like using Netty for network latency, avoiding GC, and getting better with Spark, but it'd be interesting if the team could go back and run the benchmark using the same infrastructure as their 2014 benchmark for a code-vs-code comparison to separate engineering improvements from economies of scale.
I definitely agree that it'd be great to decouple the improvements in software and drops in cloud pricing. In reality it is pretty difficult because the Nanjing U/Alibaba team spent a lot of time also optimizing the software specifically for the AliCloud environment, which might not be applicable when running on Amazon EC2, which was the environment of the 2014 record.
This is a great task for a rigorous academic paper!
I agree, hard to understand the changes. Actually fond some addition info as I was writing this though. http://sortbenchmark.org/
Old record was 330 r3.4xlarge machines. Which actually cost very similar today. It looks like the old record used a LOT more RAM than this one. 40,260GB vs 3,152GB and similar CPU 5,280 vs 4,728 cores. Although having the RAM doesn't mean it was used, but unless they were using all the RAM I don't see why they would use the more expensive r3 instances.
Edit: Follow on, I think the cost savings is definitely in the low RAM usage. Can't really get that many cores without more than 3 times as much RAM in AWS.
Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+ years ago. We've still got a long ways to go in OSS land to compete with the big boys.
When I asked why BigQuery doesn't do these sorts, the answer came straight from the post "Nobody really wants a huge globally-sorted output. We haven’t found a single use case for the problem as stated."
Do you think you could ask someone and find out the cluster sizes they used for those sorts? They mention "With the largest cluster at Google under our control", but it would be more interesting to have an idea of actual numbers, even if just an order of magnitude.
I'd argue this (performance/cost) is exactly the right metric to measure. One of the biggest benefits of the cloud is elasticity. In on-premise world, one would have to provision based on peak demand, and most of the time the cluster utilization rate is pretty low.
The one-off cost is very high, and there is higher ongoing maintenance cost as well. Most organizations are moving to the cloud, because it in general makes more economic sense.
I'm really not seeing most organizations moving to the cloud. It only makes sense economically for the smallest companies/organisations that can't afford maintenance costs. Most bigger ones won't risk the loss of their core data and a lot can't go to the cloud for various reasons like legality, speed and redundancy.
The title has been changed, the original post didn't reference CloudSort and I'm not sure if it included Apache Spark either. It was something along the lines of "New record, 1TB at $1.44"