> The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.
Sure. But that limit is one cycle--not two. This is getting pretty above my pay grade.
That tool is nifty, but I couldn't really figure out why it supports your assertion. I plugged in the two loop bodies and got these predicted throughput results:
Unrolled 4x:
uiCA 6.00
llvm-mca 6.20
Regular:
uiCA 2.00
llvm-mca 2.40
LLVM pretty strongly supports my experience that unrolling/reordering should be a substantial gain here, no? uiCA still has a meaningful gain as well.
Sorry, mixed things up - store has a limit of one 32-byte store per cycle (load is what has two 32-byte/cycle). Of additional note is Zen 4's tiny store queue of 64 32-byte entries, which could be at play (esp. with it pointing at L3; or it might not be important, I don't know how exactly it interacts with things).
The uiCA link was just to show how out-of-order execution works; the exact timings don't apply as they're for Tiger Lake, not Zen 4. My assertion being that your diagram of spaces between instructions is completely meaningless with OoO execution, as those empty spaces will be filled with instructions from another iteration of the loop or other surrounding code.
Clang is extremely unroll-happy in general; from its perspective, it's ~free to do, has basically zero downsides (other than code size, but noone's benchmarking that too much) and can indeed maybe sometimes improve things slightly.
Is there a limit to how much the OoO execution buffer can handle? I don't want to dig up the benchmarks again, but I remember for very long calculations (like computing a bunch of exp(x[i]), which boil down to about 40 AVX512 instructions), the unrolling+reordering was very effective.
Zen 4 has two 32-entry schedulers for SIMD, which I think is the relevant buffer here (many of my numbers in this thread, including these, are coming from https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...), which would get mostly exhausted by one iteration of a 40-instruction loop. Whereas the 5-instruction body here should easily manage many iterations.
I think, at least. Now I'm at the limit of my knowledge, it took me a bit to figure out whether the reorder buffer or scheduler is the relevant thing here, and I'm still not too sure. Though, either way, 32 is the smallest number of all for Zen 4, and still fits 6 iterations of the 5-instruction loop and should happily reorder them. (and the per-iteration-overhead scalar code goes to separate schedulers)
Sure. But that limit is one cycle--not two. This is getting pretty above my pay grade.
That tool is nifty, but I couldn't really figure out why it supports your assertion. I plugged in the two loop bodies and got these predicted throughput results:
Unrolled 4x: uiCA 6.00 llvm-mca 6.20
Regular: uiCA 2.00 llvm-mca 2.40
LLVM pretty strongly supports my experience that unrolling/reordering should be a substantial gain here, no? uiCA still has a meaningful gain as well.