Even the M3 Max seems to be slower than my 3090 for LLMs that fit onto the 3090, but it’s hard to find comprehensive numbers. The primary advantage is that you can spec out more memory with the M3 Max to fit larger models, but with the exception of CodeLlama-70B today, it really seems like the trend is for models to be getting smaller and better, not bigger. Mixtral runs circles around Llama2-70B and arguably ChatGPT-3.5. Mistral-7B often seems fairly close to Llama2-70B.
Microsoft accidentally leaked that ChatGPT-3.5-Turbo is apparently only 20B parameters.
24GB of VRAM is enough to run ~33B parameter models, and enough to run Mixtral (which is a MoE, which makes direct comparisons to “traditional” LLMs a little more confusing.)
I don’t think there’s a clear answer of what hardware someone should get. It depends. Should you give up performance on the models most people run locally in hopes of running very large models, or give up the ability to run very large models in favor of prioritizing performance on the models that are popular and proven today?
M3 Max is actually less than ideal because it peaks at 400 Gb/s for memory. What you really want is M1 or M2 Ultra, which offers up to 800 Gb/s (for comparison, RTX 3090 runs at 936 GB/s). A Mac Studio suitable for running 70B models with speeds fast enough for realtime chat can be had for ~$3K
The downside of Apple's hardware at the moment is that the training ecosystem is very much focused on CUDA; llama.cpp has an open issue about Metal-accelerated training: https://github.com/ggerganov/llama.cpp/issues/3799 - but no work on it so far. This is likely because training at any significant sizes requires enough juice that it's pretty much always better to do it in the cloud currently, where, again, CUDA is the well-established ecosystem, and it's cheaper and easier for datacenter operators to scale. But, in principle, much faster training on Apple hardware should be possible, and eventually someone will get it done.
Yep, I seriously considered a Mac Studio a few months ago when I was putting together an “AI server” for home usage, but I had my old 3090 just sitting around, and I was ready to upgrade the CPU on my gaming desktop… so then I had that desktop’s previous CPU. I just had too many parts already, and it deeply annoys me that Apple won’t put standard, user-upgradable NVMe SSDs on their desktops. Otherwise, the Mac Studio is a very appealing option for sure.
Microsoft accidentally leaked that ChatGPT-3.5-Turbo is apparently only 20B parameters.
24GB of VRAM is enough to run ~33B parameter models, and enough to run Mixtral (which is a MoE, which makes direct comparisons to “traditional” LLMs a little more confusing.)
I don’t think there’s a clear answer of what hardware someone should get. It depends. Should you give up performance on the models most people run locally in hopes of running very large models, or give up the ability to run very large models in favor of prioritizing performance on the models that are popular and proven today?