>On the consumer side, Apple silicon is also quite capable. I am not sure that i...

serialx · on May 13, 2024

Actually, llama.cpp running on Apple silicon uses GPU(Metal Compute Shader) to inference LLM models. Token generation is also very memory bandwidth bottlenecked. On high end Apple silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA RTX 4090, which has memory bandwidth of 1000MB/s. Not to mention that Apple silicon has unified memory architecture and has high memory models (128GB, up to 192GB), which is necessary to run large LLMs like Llama 3 70B, which roughly takes 40~75GB of RAM to work reasonably.

resource_waste · on May 13, 2024

[flagged]

brrrrrm · on May 13, 2024

I use it all the time?

imtringued · on May 13, 2024

The number of people running llama3 70b on NVidia gaming GPUs is absolutely tiny. You're going to need at least two of the highest end 24 GB VRAM GPUs and even then you are still reliant on 4 bit quantization with almost nothing left for your context window.

resource_waste · on May 14, 2024

The cognitive dissonance here.

70B models arent better than 7B models outside roleplay. The logic all sucks the same. No one even cares about 70B models.

imiric · on May 13, 2024

I don't think NVIDIA's reign will last long. The recent AI resurgence is not even a decade old. We can't expect the entire industry to shift overnight, but we are seeing rapid improvements in the capability of non-GPU hardware to run AI workloads. The architecture change has been instrumental for this, and Apple is well positioned to move the field forward, even if their current gen hardware is lacking compared to traditional GPUs. Their silicon is not even 5 years old, yet it's unbeatable for traditional workloads and power efficiency, and competitive for AI ones. What do you think it will be capable of in 5 years from now? Same for Groq, and other NPU manufacturers. Betting on NVIDIA doesn't seem like a good long-term strategy, unless they also shift their architecture.