This is super cool. For all the love llama.cpp gets, its method of dGPU offloadi...

sroussey · on Dec 20, 2023

The only thing I could think of on the question of Apple Silicon and Metal is that they think they could still split out the cold neurons to the CPU/Accelerate and the hot ones on the GPU and utilize both. The speedup is likely less if there is already no copying of data between GPU/CPU and using the unified memory. Still, it would be great if you could use even more of the capabilities of the chip simultaneously. In order to avoid thermal throttling they should use the efficiency cores only (I think this is what game mode does).

brucethemoose2 · on Dec 21, 2023

That doesn't make much sense to me. The GPU's task energy is so much lower than even the e cores, and AFIAK the GPU's compute isn't even fully utilized for local inference.