The code is unfortunately not (yet) open source. The CPU with 50x is an SKX Gold...

menaerus · 2025-01-14T11:12:42 1736853162

I'd be curious if you measured 50x on a single core implementation or is the algorithm distributed to multiple cores?

I ask because you say that the results are similar to Zen4 so this would sorta imply that you run and measure single-core implementation? Intel in multi-core load-store looses a lot of bandwidth when compared to Zen3/4/5 since there's a lot of contention going on due to Intel cache architecture.