How much bandwidth does the L2 have to give, anyway?

BeeOnRope · on Nov 20, 2018

The code to reproduce my results can be found at [1] (Linux only for now, but porting is welcome). Basically `run-all.sh` generates the results and `plot-all.sh` plots them.

It would be especially interesting to see Zen2 results since that's the first AMD chip with 256-bit wide loads executed in a single op, so the first one that can do better than 16 bytes per cycle from any level of the cache.

[1] https://github.com/travisdowns/uarch-bench/tree/master/scrip...

pedrocr · on Nov 21, 2018

> Basically `run-all.sh` generates the results and `plot-all.sh` plots them.

I've done things like this in the past and it's awesome. It would be great if more and more research was done like this so reproducing and extending results was much easier. Congratulations.

BeeOnRope · on Nov 21, 2018

Thanks!

So I started to do that for myself, to save time and make everything reproducible. I found that when I wanted to make a small change to anything I'd have to go back and dig up the old command lines, and sometimes I couldn't reproduce my old results.

By recording the result generation in the script, including things like turning on and off the prefetchers - it made things reproducible for myself, and also encouraged experimentation since it was very easy to generate all the results after any change.

Then, once you make that script for yourself, a nice side effect is that everyone else can use it too and you can skip a lot of the description about how you got your results: it's kind of a self-documenting way to explain how you got your results.

pedrocr · on Nov 21, 2018

That was my experience as well. Experimentation becomes much easier and for things where you want to run the same analysis every year/month/etc when more data comes out it makes things much easier. It's also great when you have computationally expensive steps to just be able to issue a build the world command at the end of the day and have the computer regenerate everything overnight.

This was my version of that:

https://github.com/pedrocr/codecomp

Ruby made for a good way to have scripting together with more declarative build like tooks (e.g., rake). I also automated the plotting like you did by embedding R snippets. There's probably space for a good framework for this. To be the rails of scientific workflow as it were. Integrate nicely with R/LaTeX/etc for bonus points. Maybe a procrastinating PhD student somewhere will make a name for himself doing this :)

Tuna-Fish · on Nov 21, 2018

Can't Zen1 do 2x 16-byte load ops in one cycle, for 32 bytes per cycle from the L1?

BeeOnRope · on Nov 21, 2018

Yes, it can. That was a typo, I meant to say "... can do better than 32 bytes per cycle".

wiz21c · on Nov 21, 2018

FTA :

>>> unless somehow your workload really, really wants optimized L2 access.

It's been 20 years since I have optimized assembly code at high level (think VTune, pipelining, etc). What kind of workload needs that kind of optimisation nowadays ?

cbzoiav · on Nov 21, 2018

Drivers. High frequency / low latency trading. Networking/telecoms equipment. Core routines (i.e if 10% of your workload is running a single block and this runs at scale then your savings can far outweigh the cost). Cheap electronics produced at scale (if you sell enough of them saving 1c on hardware outweighs the engineering effort).

BeeOnRope · on Nov 21, 2018

HPC applications for sure. Optimizing the core loops of anything that runs on a supercomputing cluster pays back pretty quickly.

zbjornson · on Nov 21, 2018

Neat trick! I'm interested to know if this holds up on server configurations or is unique to client only. In addition to the published architecture differences between the two (size and exclusive layout), the prefetcher seems to behave differently in forward and reverse in SKL-SP.

BeeOnRope · on Nov 21, 2018

Yes, it also applies on server architectures that derive from Skylake (aka SKX), and I've tested it there. However, it is probably of less utility there since currently all Skylake server architectures support AVX-512, which lets you load an entire 64-byte cache line in a single load, and SKX can do these at a rate of about 1 per cycle - so you can already do better than the technique described here simply by using AVX-512.

As I mentioned near the end of the wiki page, this might still be useful in scenarios where you don't want to use AVX-512 for some reason.

servrite · on Nov 21, 2018

One reason to avoid copious AVX-512 instructions is that doing so is guaranteed to cause the processor to reduce its clock rate (see the Optimization manual for crossover points when workloads make heavy (or even medium) use of AVX-512 (or in some cases AVX2).