I can chime in and say that memory bandwidth is our primary bottleneck on server...

E6300 · on July 14, 2017

The only situation where I imagine that could happen is if you need to apply a small number of instructions to a massive data set that's fully loaded in memory. What sort of application are you running? If you can say, obviously.

sliken · on July 15, 2017

For sufficiently large values of small. An 8 core CPU @ 3 GHz runs 24 Billion clocks/sec. In any of which a CPU can execute multiple instructions.

Two channel memory systems assuming DDR4@2400 can do about 40GB/sec. Thread ripper is about double that (of the skylake-x, but NOT the kabylake-x). The new skylake xeons are 6 channel (about 120GB/sec) and the new AMD Epyc is 8 channel (about 160GB/sec).

Assuming a perfectly sequential access pattern and something simple like a=b+c (which reads 16 bytes and writes 8 bytes) you can run 1.6 billion of those a second.

So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.

Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.

Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.

So as you can see it can be quite easy to be memory limited. Sure some things do an amazing amount of calculations on very little data. But many things are data intensive, which justifies the large ram and large memory bandwidth machines that make up pretty much all servers shipped today. Memory bandwidth is expensive (CPU package, pins, sockets, motherboard traces, additional motherboard layers, power, etc), but well justified in many cases.

E6300 · on July 15, 2017

You are again conflating high bandwidth and low latency.

> So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.

Yes, if you do operations in a lineal access pattern like this the performance will be bound by the bandwidth. This is the situation I was referring to above.

> Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.

> Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.

> So as you can see it can be quite easy to be memory limited.

No, in this case you will not be limited by the bandwidth, but by the latency. Having more bandwidth will do nothing, because at 8 bytes per 70 ns you're only moving about 109 MiB/s. If 100% of the memory accesses are cache misses (they won't be) and the application uses all cores then yes, doubling the number of memory channels will double the multi-thread performance (unless channel count = core count), although the single-threaded performance will stay unchanged. Additionally, in this particular load you could get away with relatively low frequency RAM, which won't significantly affect the latency but will lower the total bandwidth (it will still be way higher than 109 MiB/s) and will be cheaper.

sliken · on July 17, 2017

Er, when I say "memory limited" you say I'm wrong because it's latency limited not bandwidth limited. I think we are violently agreeing. Latency limited is just one specific form of memory limited.

In my testing (to my surprise) it turns out that throughput keeps increasing at up to 2 times the number of memory channels. So with 8 memory channels throughput keeps increasing at up to 16 threads, which upon reflection makes sense. Generally it takes 25-40ns to miss through L1, L2, and L3 -> memory controller. So with 16 misses and 8 channels you end up with all 8 channels busy, and 8 more misses queued and waiting in the memory controller. So your throughput approximately doubles from just 8 threads.

In any case, I agree that single thread performance isn't improved by multiple channels and that latency limited workloads get a small fraction of the potential memory bandwidth.

merb · on July 15, 2017

> The only situation

There are many more:

- Caches/Databases that should keep as much stuff in memory as possible (i.e. if you have a 32gb es instance you actually gain a lot from "fast memory")

- In-Memory Databases

- In-Memory RPC/message broker which of course has as a limitation factor the memory bandwidth.

In all cases the memory bandwidth might be a important factor.

E6300 · on July 15, 2017

For all the cases you mention, the critical factor is the product of the average transaction size and the transaction count per second. As long as this value is smaller than the RAM bandwidth, the application will not be RAM bandwidth-bound.

Generally speaking, databases are kept in memory to minimize latency, not maximize throughput. Bandwidth is not really a problem. Having to update 10 GB/s of a database would be highly unusual. Having to get data from random positions in a disk or SSD is much more common.

As for the message broken, it's not clear to me why the bandwidth would "of course" be the limiting factor.

dijit · on July 15, 2017

That's exactly the case.

I run dedicated gameservers. We preload "phases" which you can affect and transition to/from as a player.