> The store will still take up throughput even if nothing depends on it right no...

dzaima · 2024-07-29T20:20:50 1722284450

Sorry, mixed things up - store has a limit of one 32-byte store per cycle (load is what has two 32-byte/cycle). Of additional note is Zen 4's tiny store queue of 64 32-byte entries, which could be at play (esp. with it pointing at L3; or it might not be important, I don't know how exactly it interacts with things).

The uiCA link was just to show how out-of-order execution works; the exact timings don't apply as they're for Tiger Lake, not Zen 4. My assertion being that your diagram of spaces between instructions is completely meaningless with OoO execution, as those empty spaces will be filled with instructions from another iteration of the loop or other surrounding code.

Clang is extremely unroll-happy in general; from its perspective, it's ~free to do, has basically zero downsides (other than code size, but noone's benchmarking that too much) and can indeed maybe sometimes improve things slightly.

kolbe · 2024-08-01T15:08:30 1722524910

Is there a limit to how much the OoO execution buffer can handle? I don't want to dig up the benchmarks again, but I remember for very long calculations (like computing a bunch of exp(x[i]), which boil down to about 40 AVX512 instructions), the unrolling+reordering was very effective.

dzaima · 2024-08-02T00:51:17 1722559877

Zen 4 has two 32-entry schedulers for SIMD, which I think is the relevant buffer here (many of my numbers in this thread, including these, are coming from https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...), which would get mostly exhausted by one iteration of a 40-instruction loop. Whereas the 5-instruction body here should easily manage many iterations.

I think, at least. Now I'm at the limit of my knowledge, it took me a bit to figure out whether the reorder buffer or scheduler is the relevant thing here, and I'm still not too sure. Though, either way, 32 is the smallest number of all for Zen 4, and still fits 6 iterations of the 5-instruction loop and should happily reorder them. (and the per-iteration-overhead scalar code goes to separate schedulers)