yeah, that's probably it i'm using an m1 laptop, so my sgemm implementation reli...

ribit · 2024-02-29T09:56:16 1709200576

Probably preaching to the choir here, but if you want high-performance sgemm on M-series, you should consider using the system-provided BLAS. It will invoke the dedicated matmul hardware on Apple CPUs, which has access to significantly higher memory and compute bandwidth than Neon. SIMD-based matmul on M1 only makes sense if your matrices are small and you need the result immediately (e.g. geometric transformations).

I hope that Apple will implement SVE/SME one day which will allow us to leverage this hardware directly...

taminka · 2024-02-29T18:01:12 1709229672

wait, are you suggesting apple has some hardware level GEMM instructions that are undocumented or smth? also, what do you mean by system-provided BLAS? their accelerate framework?

ribit · 2024-03-01T10:28:52 1709288932

I am talking about the matrix/vector coprocessor (AMX). You can find some reverse-engineered documentation here: https://github.com/corsix/amx

On M3 a singe matrix block can achieve ~ 1TFLOP on DGEMM, I assume it will be closer to 4TFLOPS for SGEMM. The Max variants have two such blocks. Didn't do precise benchmarking myself, but switching Python/R matrix libraries to use Apple's BLAS result in 5-6x perf improvement on matrix heavy code for me.

adgjlsfhk1 · 2024-03-01T06:19:47 1709273987

yep. they have a neural engine that is separate from the CPU and GPU that does really fast matmuls https://github.com/hollance/neural-engine. it's basically completely undocumented and for programs like blas it has more horsepower than all the cores combined.

meteorlake and zen4 (8000) also both have npus which are similar. https://www.pcmag.com/news/the-meteor-lake-npu-meet-intels-d...

ribit · 2024-03-01T10:29:22 1709288962

What I had in mind is not the neutral engine but the AMX (Apple's matrix/long vector coprocessor).