Hacker News new | past | comments | ask | show | jobs | submit login

The code is unfortunately not (yet) open source. The CPU with 50x is an SKX Gold, and it is similar for Zen4. I compute this ratio as #FMA * 4 / total system memory bandwidth. We are indeed not fully memBW bound :)



I'd be curious if you measured 50x on a single core implementation or is the algorithm distributed to multiple cores?

I ask because you say that the results are similar to Zen4 so this would sorta imply that you run and measure single-core implementation? Intel in multi-core load-store looses a lot of bandwidth when compared to Zen3/4/5 since there's a lot of contention going on due to Intel cache architecture.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: