It kinda depends, I wouldn't be surprises, if properly optimized avx2 could get the same performance, since it looks like the operation is memory bottlenecked.
Nah, AVX512 is more performant design due to the support of masking. It does not depend in fact on anything. Those who compares favorably or equally AVX2 with 512 never used either of them.
Which is I’m really looking forward to Intel’s upcoming AVX10 / APX extensions, created to alleviate the P/E core issues they had in Alder Lake (really, what the fuck was Intel thinking back then?) It’s still inferior compared to full AVX-512 since you don’t have 512-bit wide registers, but you still have 32 256-bit registers to play with as well as the rich instruction set of AVX-512 (with all that masking and gather/scatter ops and that jazz…) so much better than AVX2.
Now hearing that the upcoming Ryzen is also going with the hybrid P/E core approach I wonder if AMD is also preparing to adopt these instructions as well…
The thing about Alder Lake AVX 512 is that (actually was, but i still occasionally run it with AVX512 enabled, for some specific tasks) the Alder Lake is the only consumer grade CPU that supports AVX 512 FP16. It is very performant for ML tasks, not like a real GPU, obviously, but a lot, much much easier to use.
Besides Intel do not gurantee compatibility AVX10 with AVX512/256.