i'm using an m1 laptop, so my sgemm implementation relies on arm neon vector extension, and i saw literally no performance improvements when i added prefetching, which could either be a result of their speculative optimisations or an issue in my own implementation
but i have noticed some ppl saying that adding prefetching on x86 has little performance improvement for sgemm, bc it's a scenario in which it's easier to predict the next memory address that's gonna be used
Probably preaching to the choir here, but if you want high-performance sgemm on M-series, you should consider using the system-provided BLAS. It will invoke the dedicated matmul hardware on Apple CPUs, which has access to significantly higher memory and compute bandwidth than Neon. SIMD-based matmul on M1 only makes sense if your matrices are small and you need the result immediately (e.g. geometric transformations).
I hope that Apple will implement SVE/SME one day which will allow us to leverage this hardware directly...
wait, are you suggesting apple has some hardware level GEMM instructions that are undocumented or smth? also, what do you mean by system-provided BLAS? their accelerate framework?
I am talking about the matrix/vector coprocessor (AMX). You can find some reverse-engineered documentation here: https://github.com/corsix/amx
On M3 a singe matrix block can achieve ~ 1TFLOP on DGEMM, I assume it will be closer to 4TFLOPS for SGEMM. The Max variants have two such blocks. Didn't do precise benchmarking myself, but switching Python/R matrix libraries to use Apple's BLAS result in 5-6x perf improvement on matrix heavy code for me.
yep. they have a neural engine that is separate from the CPU and GPU that does really fast matmuls https://github.com/hollance/neural-engine. it's basically completely undocumented and for programs like blas it has more horsepower than all the cores combined.
i'm using an m1 laptop, so my sgemm implementation relies on arm neon vector extension, and i saw literally no performance improvements when i added prefetching, which could either be a result of their speculative optimisations or an issue in my own implementation
but i have noticed some ppl saying that adding prefetching on x86 has little performance improvement for sgemm, bc it's a scenario in which it's easier to predict the next memory address that's gonna be used