For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.
Hopefully the "cold" neurons eventually get offloaded to the IGP instead?
Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?
The only thing I could think of on the question of Apple Silicon and Metal is that they think they could still split out the cold neurons to the CPU/Accelerate and the hot ones on the GPU and utilize both. The speedup is likely less if there is already no copying of data between GPU/CPU and using the unified memory. Still, it would be great if you could use even more of the capabilities of the chip simultaneously. In order to avoid thermal throttling they should use the efficiency cores only (I think this is what game mode does).
That doesn't make much sense to me. The GPU's task energy is so much lower than even the e cores, and AFIAK the GPU's compute isn't even fully utilized for local inference.
For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.
Hopefully the "cold" neurons eventually get offloaded to the IGP instead?
Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?