While everyone has focused on Apple's power-efficiency on the M series chips, on...

AceJohnny2 · 2024-04-16T17:26:12 1713288372

> unified memory model (by having the memory on-package with CPU)

That's not what "unified memory model" means.

It means that the CPU and GPU (and ANE!) have access to the same banks of memory, unlike PC GPUs that have their own memory, separated from the CPU's by the PCIe bottleneck (as fast as that is, it's still smaller than direct shared DRAM access).

It allows the hardware more flexibility in how the single pool of memory is allocated across devices, and faster sharing of data across devices. (throughput/latency depends on the internal system bus ports and how many each device have access to)

The Apple M-Series chips also has the memory on-package with the CPU (technically SoC, "System-on-Chip"), but that provides different benefits.

cmovq · 2024-04-16T18:21:34 1713291694

Having separated GPU memory also has its benefits. Once the data makes it through the PCIe bus, graphics memory typically has much higher bandwidth which also doesn’t need to split with the CPU.

mmaniac · 2024-04-17T10:22:15 1713349335

The benefit when having an ecosystem of discrete GPUs is that CPUs can get away with having low bandwidth memory. This is great if you want motherboards with socketed CPU and socketed RAM which are compatible with the whole range of product segments.

CPUs don't really care about memory bandwidth until you get to extreme core counts (Threadripper/Xeon territory). Mainstream desktop and laptop CPUs are fine with just two channels of reasonably fast memory.

This would bottleneck an iGP, but those are always weak anyway. The PC market has told users who need more to get a discrete GPU and to pay the extra costs involved with high bandwidth soldered memory only if they need it.

The calculation Apple has made is different. You'll get exactly what you need as a complete package. You get CPU, GPU, and the bandwidth you need to feed both as a single integrated SoC all the way to the high end. Modularity is something PC users love but doing away with it does have advantages for integration.

crawshaw · 2024-04-16T19:53:55 1713297235

An M2 Ultra has 800GB/s of memory bandwidth, an Nvidia 4090 has 1008GB/s. Apple have chosen to use relatively little system memory at unusually high bandwidth.

fulafel · 2024-04-16T19:15:01 1713294901

Most x86 machines have integrated GPUs and hardware-wise are UMA.

alacritas0 · 2024-04-16T19:26:09 1713295569

integrated GPUs are not powerful in comparison to dedicated GPUs

fulafel · 2024-04-17T05:37:38 1713332258

Generally so, but the gap is closing (like Apple is showing).

thsksbd · 2024-04-16T15:01:20 1713279680

Old becomes new, the SGI O2 had (off chip) a unified memory model for performance reasons.

Not a CS guy, but it seems to me that NUMA like architecture has to come back. Large RAM on chip (balancing a thermal budget between #ofcores vs ram), a much larger RAM off chip and even more RAM through a fast interconnect on a single kernel image. Like the Origin 300 had.

Rinzler89 · 2024-04-16T15:25:31 1713281131

UMA in the SGI machines (and gaming consoles) made sense because all the memory chips at that time were equally slow, or fast, depending how you wanna look at it.

PC HW split the video memory from system memory once GDDRAM become so much faster than system RAM, but GDDRAM has too high latency for CPUs and DDR has too low bandwidth for GPUs, so the separation made sense for each's strengths and still does to this day. Unifying it again, like with AMD's APUs, means either compromises for the CPU or for the GPU. There's no free lunch.

Currently AMD APUs on the PC use unified DDRAM so CPU performance is top but GPU/NPU perforce is bottlenecked. If they were to use unified GDDRAM like in the PS5/Xbox then GPU/NPU performance would be top and CPU performance would be bottlenecked.

Dalewyn · 2024-04-16T16:22:16 1713284536

>Unifying it again, like with AMD's APUs, means either compromises for the CPU or for the GPU. There's no free lunch.

I think the lunch here (it still ain't free) is that RAM speed means nothing if you don't have enough RAM in the first place, and this is a compromise solution to that practical problem.

Dylan16807 · 2024-04-16T18:10:24 1713291024

The real problem is a lack of competition in GPU production.

GDDR is not very expensive. You should be able to get a GPU with a mid-level chip and tons of memory, but it's just not offered. Instead, please pay triple or quadruple the price of a high end gaming GPU to get a model with double the memory and basically the same core.

The level of markup is astonishing. I can go from 8GB to 16GB on AMD for $60, but going from 24GB to 48GB costs $3000. And nvidia isn't better.

nsteel · 2024-04-16T19:55:57 1713297357

This isn't my area but won't it be quite expensive to use that GDDR? The PHY and the controller are complicated and you've got max 2GB devices, so if you want more memory you need a wider bus. That requires more beachfront and therefore a bigger die. That must make it expensive once you go beyond what you can fit on your small, cheap chip. Do their 24GB+ cards really use GDDR?*

And you need to ensure you don't shoot yourself in the foot by making anything (relatively) cheap that could be useful for AI...

*Edit: wow yeh, they do! A 384-bit interface on some! Sounds hot.

Dylan16807 · 2024-04-17T01:50:37 1713318637

The upgrade is actually remarkably similar.

RX7600 and RX7600XT have 8GB or 16GB attached to a 128-bit bus.

RX7900XT and W7900 have 24GB or 48GB attached to a 384-bit bus.

Neither upgrade changes the bus width, only the memory chips.

The two upgraded models even use the same memory chips, as far as I can tell! GDDR6, 18000MT/s, 2GB per 16 pins. I couldn't confirm chip count on the W7900 but it's the same density.

nsteel · 2024-04-17T13:43:09 1713361389

Perhaps I should have been clearer and said "if you want more memory than the 16GB model you need a wider bus" but this confirms what I described above.

  - cheap gfx die with a 128-bit memory interface.
  - vastly more expensive gfx die with a 384-bit memory interface.

Essentially there's a cheap upgrade option available for each die, swapping the 1GB memory chips for 2GB memory chips (GDDR doesn't support a mix). If you have the 16GB model and you want more memory, there are no bigger memory chips available, and so you need a wider bus and that's going to cost significantly more to produce and hence they charge more.

As a side note, I would expect the the GDDR chips to be x32, rather than 16.

Dylan16807 · 2024-04-17T18:18:28 1713377908

> Essentially there's a cheap upgrade option available for each die

Hmm, I think you missed what my actual point was.

You can buy a card with the cheap die for $270.

You can buy a card with the expensive die for $1000.

So far, so good.

The cost of just the memory upgrade for the cheap die is $60.

The cost of just the memory upgrade for the expensive die is $3000.

None of that $3000 is going toward upgrading the bus width. Maybe one percent of it goes toward upgrading the circuit board. It's a huge market segmentation fee.

> As a side note, I would expect the the GDDR chips to be x32, rather than 16.

The pictures I've found show the 7600 using 4 ram chips and the 7600 XT using 8 ram chips.

nsteel · 2024-04-17T19:51:17 1713383477

I did miss your point, I misunderstood the hefty price tag was just for the memory. Thank you for not biting my head off! That is pretty mad. I'm struggling to think of any explanation other than yours. Even a few supporting upgrades maybe required for the extra memory (power supplies etc) wouldn't be anywhere near that cost.

I couldn't find much to actually see how many chips used but on https://www.techpowerup.com/review/sapphire-radeon-rx-7600-x... it says H56G42AS8DX-014 which is a x32 (https://product.skhynix.com/products/dram/gddr/gddr6.go?appT...). But either way it can't explain that pricing!

Dylan16807 · 2024-04-18T00:03:10 1713398590

The first two pictures on that page show 4 RAM chips on each side of the board.

https://media-www.micron.com/-/media/client/global/documents...

This document directly talks about splitting a 32 data bit connection across two GDDR6 chips, on page 7.

"Allows for a doubling of density. Two 8Gb devices appear to the controller as a single, logical 16Gb device with two 16-bite wide channels."

Do that with 16Gb chips and you match the GPUs we're talking about.

nsteel · 2024-04-18T08:10:49 1713427849

Of course - clamshell mode! They don't connect half the data bus from each chip. Also explains how they fit it on the card so easily without (and cheeply).

Dylan16807 · 2024-04-18T17:28:45 1713461325

Yeah, though a way to connect fewer data pins to each chip doesn't particularly have to be clamshell, it just requires a little bit of flexibility in the design.

paulmd · 2024-04-17T20:44:10 1713386650

what exactly do you expect them to do about the routing? 384b gpus are as big as memory buses get in the modern era and that gives you 24gb capacity or 48gb when clamshelled. Higher-density 3GB modules that would allow 36GB/72GB have been repeatedly pushed back, which has kinda screwed over consoles as well - they are in the same bind with "insufficient" VRAM on MS and while sony is less awful they didn't give a VRAM increase with PS5 Pro either.

GB202 might be going to 512b bus which is unprecedented in the modern era (nobody has done it since the Hawaii/GCN2 days) but that's really as big as anybody cares to route right now. What do you propose for going past that?

Ironically AMD actually does have the capability to experiment and put bigger memory buses on smaller dies. And the MCM packaging actually does give you some physical fanout that makes the routing easier. But again, ultimately there is just no appetite for going to 512b or 768b buses at a design level, for very good technical reasons.

like it just is what it is - GDDR is just not very dense, this is what you can get out of it. Production of HBM-based gpus is constrained by stacking capacity, which is the same reason people can't get enough datacenter cards in the first place.

Higher-density LPDDR does exist, but you still have to route it, and bandwidth goes down quite a bit. And that's the Apple Silicon approach, which solves your problem but unfortunately a lot of people just flatly reject any offering from the Fruit Company for interpersonal reasons.

Dylan16807 · 2024-04-18T04:15:56 1713413756

> what exactly do you expect them to do about the routing? 384b gpus are as big as memory buses get in the modern era and that gives you 24gb capacity or 48gb when clamshelled.

I'm not asking for more than that. I'm asking for that to be available on a mainstream consumer model.

Most of the mid-tier GPUS from AMD and nVidia have 256 bit memory busses. I want 32GB on that bus at a reasonable price. Let's say $150 or $200 more than the 16GB version of the same model.

I appreciate the information about higher densities, though.

zozbot234 · 2024-04-16T18:38:47 1713292727

> You should be able to get a GPU with a mid-level chip and tons of memory, but it's just not offered.

Apple unified memory is the easiest way to get exactly that. There is a markup on memory upgrades but it's quite reasonable, not at all extreme.

Dylan16807 · 2024-04-16T18:43:51 1713293031

But I can't put that GPU into my existing machine, so I'm still paying $3000 extra if I don't want that Apple machine to be my computer.

Rinzler89 · 2024-04-16T16:26:39 1713284799

>if you don't have enough RAM in the first place

Enough RAM for what task exactly? System RAM is plentiful and cheap nowadays(unless you buy Apple). I got new a laptop with 32GB RAM for about 750 Euros. But the speeds are too low for high-end gamming or LLM training for the poor APU.

numpad0 · 2024-04-16T16:35:15 1713285315

Enough RAM for LLM. There are GPUs faster than M2 Ultra but can't run LLMs normally, which make that speed a moot point for LLM use-cases.

numpad0 · 2024-04-16T17:12:22 1713287542

I suspect there are difficulties with DRAM latency and/or signal integrity with APUs and RAM-expandable GPUs. Wasted ALUs are wasted if you'd be stalling deep SIMD pipelines all the time.

fulafel · 2024-04-16T19:20:50 1713295250

The fancier and more expensive SGIs had higher bandwisth non UMA memory systems on the GPUs.

The thing about bandwidth is that you can just make wider buses with more of the same memory chips in parallel.

Dylan16807 · 2024-04-16T17:57:04 1713290224

> Old becomes new

I disagree. This is like pointing at a 2024 model hatchback and saying "old becomes new" because you can cite a hatchback from 50 years ago.

There's a bevy of ways to isolate or combine memory pools, and mainstream hardware has consistently used many of these methods the entire time.

aidenn0 · 2024-04-16T16:31:54 1713285114

Several PC graphics standards attempted to offer high-speed access to system memory (though I think only VLB offered direct access to the memory controller at the same speed as the CPU). Not having to have dedicated GPU memory has obvious advantages, but it's hard to get right.

numpad0 · 2024-04-16T16:32:25 1713285145

Note that while UMA is great in the sense that they allow LLM models to be run at all, M-series chips aren't faster[1] when the model fits in VRAM.

  1: screenshot from[2]: https://www.igorslab.de/wp-content/uploads/2023/06/Apple-M2-ULtra-SoC-Geekbench-5-OpenCL-Compute.jpg
  2: https://wccftech.com/apple-m2-ultra-soc-isnt-faster-than-amd-intel-last-year-desktop-cpus-50-slower-than-nvidia-rtx-4080/

cstejerean · 2024-04-16T16:45:45 1713285945

The problem is you're limited to 24 GB of VRAM unless you pay through the nose for datacenter GPUs, whereas you can get an M-series chip with 128 GB or 192 GB of unified memory.

numpad0 · 2024-04-16T17:03:31 1713287011

Surely! The point is that they're not million times faster magic chips that makes NVIDIA bankrupt tomorrow. That's all. A laptop with up to 128GB "VRAM" is a great option, absolutely no doubt about that.

john_alan · 2024-04-16T17:24:11 1713288251

They are powerful, but I agree with you, it's nice to be able to run Goliath locally, but it's a lot slower than my 4070.

paulmd · 2024-04-16T18:43:11 1713292991

that's openCL compute, LLM models ideally should be hitting the neural accelerator, not running on generalized gpu compute shaders.

atty · 2024-04-16T14:12:53 1713276773

Apples memory is on package, not on die.

vbezhenar · 2024-04-16T13:58:57 1713275937

Apple does not make on-die RAM.

spamizbad · 2024-04-16T17:12:49 1713287569

My understanding is the unified RAM on the M-series die does not contribute significantly to their performance. You get a little bit better latency but not much. The real value to Apple is likely it greatly simplifies your design since you don't have to route out tons of DRAM signaling and power management on your logic board. Might make DRAM training easier too but that's way beyond my expertise.

rowanG077 · 2024-04-16T23:00:38 1713308438

So why are you proposing Apple did that if not for performance as you claim? They waste a lot of silicon for those extra memory controllers. Basically a comparable amount to all the CPU cores in an M1 Max.

john_alan · 2024-04-16T17:22:01 1713288121

also provides the GPU with serious RAM allocation, a 64GB M3 chip comes with more ram for the GPU than a 4090

v1sea · 2024-04-16T15:19:48 1713280788

edit: I was wrong.

smallmancontrov · 2024-04-16T15:52:52 1713282772

> low latency results each frame

What does that do to your utilization?

I've been out of this space for a while, but in game dev any backwards information flow (GPU->CPU) completely murdered performance. "Whatever you do, don't stall the pipeline." Instant 50%-90% performance hit. Even if you had to awkwardly duplicate calculations on the CPU, it was almost always worth it, and not by a small amount. The caveat to "almost" was that if you were willing to wait 2-4 frames to get data back, you could do that without stalling the pipeline.

I didn't think this was a memory architecture thing, I thought it was a data dependency thing. If you have to finish all calculations before readback and if you have to readback before starting new calculations, the time for all cores to empty out and fill back up is guaranteed to be dead time, regardless of whether the job was rendering polygons or finite element calculations or neural nets.

Does shared memory actually change this somehow? Or does it just make it more convenient to shoot yourself in the foot?

EDIT: or is the difference that HPC operates in a regime where long "frame time" dwarfs the pipeline empty/refill "dead time"?

v1sea · 2024-04-16T16:13:50 1713284030

It was probably from my workloads being relatively small that I could get away with 90Hz read on the cpu side. I'll need to dig deeper into it. The metrics I was seeing were showing 200-300 microseconds of GPU time for physics calculations and within the same frame the cpu reading from that buffer. Maybe I'm wrong, need to test more.

smallmancontrov · 2024-04-16T17:04:48 1713287088

> edit: I was wrong.

If the only reason you were "wrong" was because you intuitively understood that it wasn't worth a large amount of valuable human time to save a small amount of cheap machine time, you were right in the way that matters (time allocation) and should keep it up :)

oflordal · 2024-04-16T15:34:32 1713281672

You can do that on both HIP and cuda through e.g. hipHostMalloc and the cuda equivalent (Not officially supported on the AMD APUs but works in practice). With a discrete GPU the GPU will access memory across PCIe but on an APU it will go full speed to RAM as far as I can tell.

chaostheory · 2024-04-16T14:04:07 1713276247

What Apple has is theoretically great on paper, but it fails to live up to expectations. Whats the point of having the RAM for running an LLM locally when the performance is abysmal compared to running it on even a consumer Nvidia GPU. It’s a missed opportunity that I hope either the M4 or M5 addresses

bearjaws · 2024-04-16T14:25:32 1713277532

It's a 25w processor. How will it ever live up to a 400w GPU? Also you can't even run large models on a single 4090, but you can on M series laptops with enough RAM.

The fact a laptop can run 70B+ parameter models is a miracle, it's not what the chip was built to do at all.

wongarsu · 2024-04-16T15:05:07 1713279907

It's a valid comparison in the very limited sense of "I have $2000 to spend on a way to run LLMs, should I get an RTX4090 for the computer I have or should I get a 24GB MacBook", or "I have $5000, should I get an RTX A6000 48GB or a 96GB MacBook".

Those comparisons are unreasonable in a sense, but they are implied by statements like GPs "Hence a lot of people in the local LLMA community are really going after high-memory Macs".

oceanplexian · 2024-04-16T16:54:43 1713286483

The answer depends on what you plan to do with it.

Do you need to do fine tuning on a smaller model and need the highest inference performance with smaller models? Are you planning to use it as a lab to learn how to work with tools that are used in Big Tech (i.e. CUDA)? Or do you just want to do slow inference on super huge models (e.g Grok)?

Personally, I chose the Nvidia route because as a backend engineer, Macs aren’t seriously used in datacenters. The same frameworks I use to develop on a 3090 are transferable to massive, infiniband-connected clusters with TB of VRAM.

0x457 · 2024-04-16T17:11:00 1713287460

I think there are two communities:

- the "hobbyists" with $5k GPUs

- People that work in the industry that never used "not mac" or even if they did - explaining to IT that you need a PC with RTX A6000 48GB instead of a mac like literally everyone else in the company is a loosing battle.

wongarsu · 2024-04-16T17:31:57 1713288717

There is also an important third group:

- people that work outside Silicon Valley, where the entire company uses Windows centrally managed through Active Directory, and explaining IT that you need an Mac is an uphill battle. So you just submit your request for an RTX A6000 48GB to be added to your existing workstation

Those people are the intended target customer of the A6000, and there are a lot of them.

0x457 · 2024-04-17T17:06:39 1713373599

While there are many people "that work outside Silicon Valley, where the entire company uses Windows centrally managed through Active Directory, and explaining IT that you need a Mac is an uphill battle" I think these companies either don't care about AI or get into it by acquiring a startup (that runs on macs).

Companies that run windows and AD are too busy to make sure you move your mouse every 5 minutes while you're on the clock more. At least that is my experience.

MichaelZuo · 2024-04-18T14:12:19 1713449539

There are some genuine, non-Dilbert-esque uses for AD. Where there really isn't a viable, similarly performant, alternative.

fckgw · 2024-04-16T15:34:37 1713281677

No, it is not a valid comparison to make between an entire laptop and a single PC part. "The computer I have" is doing a ton of heavy lifting here.

michaelt · 2024-04-16T16:13:23 1713284003

Sorta yes, sorta no.

You're certainly right that with a macbook you get a whole computer, so you're getting more for your money. And it's a luxury high-end computer too!

But personally, I've never seen anyone step directly from not-even-having-a-PC to buying a 4090 for $1800. Folks that aren't technically inclined by and large stick with hosted models like ChatGPT.

More common in my experience is for technical folks with, say, an 8GB GPU to experiment with local ML, decide they're interested in it, then step up to a 4090 or something.

wongarsu · 2024-04-16T15:55:19 1713282919

"Should I upgrade what I have or buy something new" is a completely normal everyday decision. Of course it doesn't apply to everyone since it presumes you have something compatible to upgrade, but it is a real decision lots of people are making

fckgw · 2024-04-16T16:05:47 1713283547

But it's also assuming everyone has a desktop PC capable of this stuff that can be upgraded.

chaostheory · 2024-04-16T20:28:38 1713299318

The problem is that it extends to both Mac Studio and Mac Pro.

dragonwriter · 2024-04-17T01:57:19 1713319039

> The fact a laptop can run 70B+ parameter models is a miracle

Most laptops with 64+GB of RAM can run a 70B model at 4-bit quantization. It’s not a miracle, it’s just math. M2 can do it faster than systems with slower memory bandwidth.

evilduck · 2024-04-16T16:00:29 1713283229

That completely depends on your expectations and uses.

I have a gaming rig with a 4080 with 16GB of RAM and it can't even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized. Yeah it's fast when something fits on it, but I don't see much point in very fast generation of bad output. A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM and for ~$300 less, and it's a whole computer instead of a single part that still needs a computer around it. Compared to my 4080, that GPU and the surrounding computer get you half the VRAM capacity for greater cost than the Mac.

If you're building a rig with multiple GPUs to support many users or for internal private services and are willing to drop more than $3k then I think the equation swings back in favor of Nvidia, but not until then.

dragonwriter · 2024-04-17T02:17:10 1713320230

> I have a gaming rig with a 4080 with 16GB of RAM and it can’t even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized.

Which Mixtral? What does “heavily quantized” mean?

> A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM

Not assuming the 4090 is in a computer with non-trivial system RAM (which it kind of needs to be), since models can be split between GPU and CPU. The M1 Max might have better performance for models that take >24GB and <= 32GB than the 4090 machine with 24GB of VRAM, but assuming the 4090 is in a machine with 16GB+ of system RAM, it will be able to run bigger models than the M2, as well as outperforming it for models requiring up to 24GB of RAM.

The system I actually use is a gaming laptop with a 16GB 3080Ti and 64GB of system RAM, and Mixtral 8x7B @ 4-bit (~30GB RAM) works.

cjk2 · 2024-04-16T16:20:03 1713284403

Yeah this.

Also as an anecdote, my daily driver machine is a bottom end M2 Mac mini because I am a cheap ass. I paid less than it for the 4070 card in my desktop PC. The M2 Mac does a dehaze from RAW in lightroom in 18 seconds. My 4070 takes 9 seconds. So the GPU is twice as fast but the mac has a whole free computer stuck to it.

soupbowl · 2024-04-16T16:10:53 1713283853

Just buy another 16gb of ram for 80$....

evilduck · 2024-04-16T16:30:58 1713285058

Running your larger-than-your-GPU-VRAM LLM model on regular DDR ram will completely slaughter your token/s speed to the point that the Mac comes out ahead again.

programd · 2024-04-16T20:08:28 1713298108

Depends on what you're doing. Just chatting with the AI?

I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in). 2 year old low end CPU that goes for, what $150? these days. As far as I'm concerned that's conversational speed. Pop in some 8 core modern CPU and I'm betting you can double that, without even involving any GPU.

People way overestimate what they need in order to play around with models these days. Use llama.cpp and buy that extra $80 worth of RAM and pay about half the price of a comparable Mac all in. Bigger models? Buy more RAM, which is very cheap these days.

There's a $487 special on Newegg today with an i7-12700KF, motherboard and 32G of ram. Add another $300 worth of case, power supply, SSD and more RAM and you're under the price of a Macbook Air. There's your LLM inference machine (not for training obviously) which can run even the 70B models at home at acceptable conversational speed.

elzbardico · 2024-04-16T16:14:04 1713284044

GPU ram?

zitterbewegung · 2024-04-16T15:06:13 1713279973

Buying a m3 max with 128gb of RAM while will underperform any consumer NVIDIA GPU it will be able to load larger models in practice but slowly.

I think a way for the m series chips to aggressively target GPU inference or training would need a strategy that increases the speed of the RAM to start to match GDDR6 or HBM3 or use it directly.

chaostheory · 2024-04-16T20:33:24 1713299604

You summed up my point better than I did

InTheArena · 2024-04-16T14:24:27 1713277467

The performance of oolama on my M1 MAX is pretty solid - and does things that my 2070 GPU can't do because of memory.

dangus · 2024-04-16T14:48:13 1713278893

Not that I don’t believe you but the 2070 is two generations and 5 years old. Maybe a comparison to a 4000 series would be more appropriate?

Kirby64 · 2024-04-16T14:52:16 1713279136

The M1 Max is also 2 generations old, and ~3 years old at this point. Seems like a fair comparison to me.

talldayo · 2024-04-16T15:01:48 1713279708

Maybe it's controversial, but I don't think comparing 5nm mobile hardware from 2021 is a fair fight against 12nm desktop hardware from 2018.

And still, performance-wise, the 2070 still wins out by a ~33% margin: https://browser.geekbench.com/opencl-benchmarks

chessgecko · 2024-04-16T15:35:01 1713281701

For this comparison the generation of chip doesn’t really matter because the llm decode (which is the costly step) barely uses any of the perf and just needs the model weights to fit in memory

JudasGoat · 2024-04-16T18:40:36 1713292836

I found it interesting that the Apple M3 scored nearly identical to the Radeon 780M. I know the memory bandwidth is slower but you can add 2 32gb sodimms to the AMD APU for short money.

dangus · 2024-04-16T15:00:43 1713279643

The 4000 series still has a bigger gap in how much of a generational leap that product was.

The M3 Max has something like 33% faster overall graphics performance than the M1 Max (average benchmark) while the 4090 is something like 138% faster than the 2080Ti.

Depending on which 2070 and 4070 models you compare the difference is similar, close to or exceeding 100% uplift.

whizzter · 2024-04-16T18:00:13 1713290413

Googling power draw the 4090 goes up to 450w whilst the 2080ti was at 250w, adjusting for power consumption the increase is somewhere around 32%. Some architectural gains and probably optimizations in chipset workings but we're not seeing as many amazing generational leaps anymore regardless of manufacturer/designer.

dangus · 2024-04-18T16:14:15 1713456855

I’m still seeing over a 100% uplift comparing mobile to mobile on Nvidia products: https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...

As far as desktop products, power consumption is irrelevant.

Teever · 2024-04-16T14:55:22 1713279322

Well, you know that it would still be able to do more than a 4000 series GPU from Nvidia because you can have more system memory in a mac than you can have video ram in a 4000 series GPU.

dangus · 2024-04-16T15:01:44 1713279704

Yes, obviously I’m aware that you can throw more RAM at an M-series GPU.

But of course that’s only helpful for specific workflows.

instagib · 2024-04-16T17:13:44 1713287624

One thing I would consider is usage throttling on a MacBook Pro. Would repeated LLM usage run into throttling?

No idea what specifically everyone is pulling their performance data from or what task(s).

Here is a video to help visualize the differences with a maxed out m3 max vs 16gbm1 pro vs 4090 on llm 7B/13b/70b llama 2. https://youtu.be/jaM02mb6JFM

Here’s a Reddit comparison of 4090 vs M2 Ultra 96gb with tokens/s

https://old.reddit.com/r/LocalLLaMA/comments/14319ra/rtx_409...

M3 pro memory BW 150 gb/s M3 max 10/30 300 gb/s M3 max 12/40 400 gb/s

“Llama models are mostly limited by memory bandwidth. rtx 3090 has 935.8 gb/s rtx 4090 has 1008 gb/s m2 ultra has 800 gb/s m2 max has 400 gb/s so 4090 is 10% faster for llama inference than 3090 and more than 2x faster than apple m2 max https://github.com/turboderp/exllama using exllama you can get 160 tokens/s in 7b model and 97 tokens/s in 13b model while m2 max has only 40 tokens/s in 7b model and 24 tokens/s in 13b apple 40/s Memory bandwidth cap is also the reason why llamas work so well on cpu (…)

buying second gpu will increase memory capacity to 48gb but has no effect on bandwidth so 2x 4090 will have 48gb vram and 1008 gb/s bandwidth and 50% utilization”

jwr · 2024-04-16T19:27:39 1713295659

Hmm. I'm running decent LLMs locally (deepseek-coder:33b-instruct-q8_0, mistral:7b-instruct-v0.2-q8_0, mixtral:8x7b-instruct-v0.1-q4_0) on my MacBook Pro and they respond pretty quickly. At least for interactive use they are fine and comparable to Anthropic Opus in speed.

That MacBook has an M3 Max and 64GB RAM.

I'd say it does live up to my expectations, perhaps even slightly exceeds them.

john_alan · 2024-04-16T17:36:56 1713289016

what are you talking about, it's literally the fastest single core retail CPU globally, and multicore is close too - https://browser.geekbench.com/processor-benchmarks

edward28 · 2024-04-17T11:24:00 1713353040

I would take those benchmarks with a grain of salt, given that it shows a 96 core epyc loosing in multi thread to a 64 core epyc and a 32 core xeon.