Hacker News new | past | comments | ask | show | jobs | submit login

Ah, so you're saying as this is network bound, you want to launch much fewer wimpy nodes in exchange for a few bigger ones (those w/ SSDs -- over 1PB worth!). If it was memory-bound, you wouldn't care how they're spread as long as it was still in-memory.

Another way of looking at this is performance per watt or dollar. The r3.2 has 60GB, so comparing to that, Spark cost the same ~$1.4K while giving a 3X speedup. (Or on host that charges per minute, it'd be the same performance at 3X cheaper.)

This is me not knowing the space: would MR (or more modern things like Tez) perform worse on this HW setup, or is this a reminder that hardware/config tuning matters?




Yes absolutely. If I can get a single machine with 200TB of SSDs, that'd have been great :)

But as soon as we have more than 1 node, then having more nodes is better. We can actually demonstrate this quantitatively. We are required to replicate the output, which means 100 TB of data would generate 100 TB of network for replication, and 100 * (N-1)/N TB of network for shuffle, where N = num nodes. That is the overall 200 - 100/N in network. Assuming each node has 1GB/s network, then you'd need (200 - 100/N) / N seconds just to transfer the data across network, i.e. the optimal number of nodes is the one that gives you the lowest (200/N - 100/N^2).

The problem is with last year's MR run was that it wasn't saturating the network at all. It had roughly 1.5GB/s/node HDDs, and the overall network throughput was probably around 20MB/s/node when they were using 10Gbps network (I'm assuming only half of the time is doing network. If they were doing network the full time, then network throughput was at ~10MB/s/node).

Had we been using the same HDDs last year's entry had, our map phase would slow down by about 2x, and the reduce phase shouldn't change. This would mean a total run time of less than 30 mins on ~200 nodes. Still way better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: