I know it's not ideal, but I am really rooting for Intel's effort for graphics. From launch nearly a year ago to today, the driver team has been knocking out driver issues and slowly but surely trying to achieve parity with AMD and Nvidia. From what I hear, the team really going for broke on trying to get ML workloads fully supported. When Intel shipped a 16GB card at that price point, you knew that they were really trying, it would be cool if Battlemage (the next gen.) shipped with 32GB for larger models.
I think you'll be in for Another Massive Disappointment (AMD).
The software/driver woes in their graphics division have been going on for years at this point with no clear path forward, just some statements about how they plan to fix it in the future (e.g. handling the geohot AMD rant back in June).
It's not just ML either - on the gaming side, some people swallow the Nvidia price premium for the better drivers.
I made irrational purchasing decisions in the past to support the underdog, likening it to using Firefox. Unfortunately with AMD you just get a worse product.
What's annoying is that things don't have to be this way. If our chipmakers like Apple, AMD and Intel were able to actually work together, we could make a proper GPGPU library that puts CUDA in the grave. Nvidia is betting on these companies being driven further apart though, and they just won't stop being proven right.
We need FAANG to put aside their differences to fund an alternative, or else FAANG will suffer the consequences.
> I think you'll be in for Another Massive Disappointment (AMD).
Nah, I pretty much gave up on AMD when the 7000 series launch was imminent and still no ML real support for the 6000 series. Not that I was even expecting anything. With the way Intel Arc is coming along, any ML support for Radeon is just a bonus. I refuse to pay the Nvidia tax until it becomes clear there is no other option again. Nvidia just really rubbed me the wrong way with the latest generation.
Nvidia’s drivers are of horrid quality. They basically been break-fix-break-fix-break-fix on Destiny 2 and Sea of Thieves, with the broken version dropping a locked 144FPS PC to ~50FPS.
I'm a loyal Intel customer because their stuff is, in general, straight up less janky and more practically reliable than AMD's stuff. I agree that Intel knows how to do software in ways that AMD simply doesn't.
But if Intel "wins", it means yet another instance of corporate consolidation, where a majority market holder of one markets (CPUs) destroys a whole different market (GPUs) by integrating and price dumping their products into the majority/monopoly market.
It'll be really hard to compete for AMD Radeon cards and nVidia itself against Intel if they aggresively price dump their GPUs and use typical megacorporation tactics to push out and destroy competition.
Why would you want that? Why not root for an independent GPU company? Where does this love for megaconglomerates consolidating whole markets come from?
> It'll be really hard to compete for AMD Radeon cards and nVidia itself against Intel if they aggresively price dump their GPUs and use typical megacorporation tactics to push out and destroy competition.
AMD's market cap has been higher than Intel's for the last year, and NVIDIA is more than seven times higher. Apple is twice again as big. Surely this logic goes the other way. It's Intel and AMD and ARM Ltd. that are at the mercy of the big players dumping their way to dominance.
But obviously that's not happening. The competitive landscape in this industry is just fine. More products means more innovation and lower prices and bigger markets. That's just capitalism 101, really.
> Where does this love for megaconglomerates consolidating whole markets come from?
Nowhere, because that's just a strawman. I'm sure people here would love to see a GPU startup launch a product. But we're talking about Arc in this thread, and that seems like a good product too.
It's a measure (probably the single best one) of the financial resources a company can bring to bear on a problem. The contention upthread was about "price dumping" and "megacorporation tactics", which surely implies that big companies can do bad things to smaller players with their bigger resources. And it turns out that Intel is a small player in this particular field, as it happens. The logic just doesn't work.
To me is a mystery that CPUs/motherboards still have extensible RAM, integrated RAM (if you don't skimp and go for cheap RAM) can be very fast - for example the Steam Deck has quad-channel 5500 MT/s LPDDR5, very console-like.
What is it that make extensible RAM inherently slower? Is it the connectors raised off the boards, adding physical distance to the traces that's the issue?
The mechanical connector layout forces sub-optimal signalling paths.
You're forced to make everything slower to make things consistent.
That being said, a more optimized connector, board layout and form factor could make large-capacity, lower-latency modules possible. Not sure for example if there is anything out there besides the Dell proposed CAMM form factor, and I don't remember that one being specifically promoted as "faster", mostly "smaller".
Yhe main reason is speed. For example DDR5 has a transfer rate of about 50GB/s. The GDDR6X used in GPUs has a transfer rate of about 1TB/s. Nvidia just announced H200 AI processors have HBM3e memory that they report can go as fast at 10TB/s.
Lisa Su is CEO not owner of AMD. It's a public company. if you're implying she's not competing with Nvidia cause it's all in the family, well..... that's kinda out there.
That software experience does not look as bad as I was expecting.
That performance is shockingly good, am I missing something? Comparable to a Titan RTX? Yes, that card is 4 years older but it cost an order of magnitude more, has almost twice the die area, and has twice the GDDR6 bus width. The A770 theoretical FP16 is apparently a bit higher (39.2T vs 32.6T) but in gaming workloads the Arc cards far under-perform their theoretical performance compared to competitors. I guess ML is less demanding on the drivers, or perhaps some micro-architectural thing. In gaming performance the Titan RTX is about 70% faster than the A770.
I'm really looking forward to the next generation now.
From what I can understand, gaming is a strange workload that Nvida and AMD have a 20+ year head start on Intel. Deep learning and AI workloads are probably pretty straight forward as far as how it's implemented and run, but game developers do all sorts of hacky things to make their games work for the sake of meeting a given deadline (and ultimately, release date). The 20 year advantage of AMD and Nvidia comes from them optimizing their graphics drivers for all of the edgecase games as the games came out (this is why there may be a "The Last of Us" exclusive driver for Nvidia cards, for example). Intel has a massive backlog of games to work through to optimize for, and they also have to figure out what games are the ones to prioritize on their optimization todo list, I imagine.
Many of the DirectX issues on these cards can be almost entirely mitigated with DXVK. It's pretty well-documented at this point, but the Vulkan translation of older DX titles is better than using Intel's native DirectX driver.
For Linux gamers, DXVK is how you'd be playing the game regardless. It's a first-gen card that's not a terrible prospect for the right sort of tinkerer.
As you mentioned, Titan RTX is ancient when it comes to DL/ML use - if that has a 4-fold perf compared to Arc, it's not looking great. It's not like you can cram 10x Arc into a computer to make up for the gap.
For me, the calculus is simple: is the platform maker ensuring their work is merged upstream? If not, what’s the point? The field moves way too fast to rely on forks. Relying on Intel’s comparatively unmaintained PyTorch fork (stuck on PyTorch 1.10) is foolish. Hell, Apple, despite their reputation, has done a better job working with the PyTorch organization than Intel.
I was in the market for a new GPU and I debated between a 4090 and a 4070. 4070 is fine for the type of work and gaming I do, but I was thinking about the additional RAM. Then I realized that the 4070 has more vram than most consumers do, so many models will be optimized for that, and at some point someone will come out with a high-ram neural network only card at some point, and that would end up being better than the 4090 anyway because 24 gigs of vram is still quite small. I don’t know a ton about this stuff so who knows if that was the right decision… but I’m thinking along the same lines as you.
> someone will come out with a high-ram neural network only card at some point, and that would end up being better than the 4090 anyway because 24 gigs of vram is still quite small
These already exist, they're just disproportionately expensive compared to consumer cards. You can get a Nvidia A100 with 80GB VRAM on eBay for about $14,000.
You can also buy used enterprise cards on ebay. There may not be many 32gb+ cards up yet, but they'll be on there eventually. The big issue is how to cool these cards though, both amd and nvidia make the cards with fan-less heat sinks because the servers they get packed into provide cooling via their loud, ultra fast, potentially digit severing fans, but people who are salvaging these cards have to find their own solution to cool them, and fit them in a given case.
I'm surprised nobody's yet making custom coolers for them (complete coolers with heatsink/fans, not gluing a 3d-printed fan shroud to the front) and re-selling them. Seems like a niche gap in the market.
As far as AMD drivers are concerned, you can at least reflash some cards. Nvidia cards... well I've only seen videos where people say they have a given driver so there might be underhanded means involved.
Also, 24GB is good because it's just barely big enough for the (currently unreleased) Llama V2 34B.
In Llama V1, 7B was clever but kinda dumb, 13B was still short sighted, but 33B was a big jump to "disturbingly good" territory. 7B/13B finetunes required very specific prompting, but 33B would still give good responses even outside the finetuning format.
16GB is probably perfect for a ~13B model with a very long context and better (5 bit?) quantization, or just partially offloading a ~33B model.
Thanks! That's useful to know. Unfortunately it looks like a lot of them are non-commercial which rules them out for me but there's a couple that look like they might work.
Does it have to be exclusively vram to be useful or is shared ram also useful here. I have a system that reports 8gb dedicated to GPU with another 15 that is shared and thus a GPU memory of 23 GB
Driver shared memory is effectively useless. The way it accesses CPU memory is (for now) extremely slow.
First some background: llama is divided into prompt ingestion code, and "layers" for actually generating the tokens.
There are different offloading schemes, but what llama.cpp specifically does is map the layers to different devices. For example, ~7GB of the model could reside on the 8GB GPU, and the other ~9GB would live on the CPU.
During runtime, the prompt is ingested by the GPU all at once (which doesn't take much VRAM), and then for each word, the layers are run sequentially. So the first ~half of a word would run on your GPU, and the last half would run on your CPU, and the alternation repeats till all the words are generated.
The beauty of llama.cpp is that its cpu token generation is (compared to other llama runtimes) extremely fast. So offloading even half or two thirds of a model to a decent CPU is not so bad.
If anything, 24GB is probably the sweet spot for what's optimised for as many local LLM enthusiasts are running 3090s and 4090s (and all want more VRAM).
> Then I realized that the 4070 has more vram than most consumers do
The M1 and M2 Apple processors might change that equation.
The 32gb and 64gb MacBooks can run inference on a lot of these big models.
They might be quite a bit slower in TFLOPS than a 4070, but if they can run the model while the 4070 can't run it due to limited RAM, then they come out the winner
That's true, and you do make a disclaimer about performance, but often when this is brought up, people either haven't actually tried MPS or are unaware that a GPU isn't required for inference.
The long and short of it is, right now, CUDA beats TFLOPS.
Now for training, $10K for an M2 with 192GB memory could break even somewhere around a dozen models with current costs, iff the compute is not saturated. The limit of memory bandwidth is only a function of the compute; it doesn't scale infinitely.
I do also have a MacBook Pro with 32 gigs of Ram and an M2 Studio with 64 gigs, so if a great model comes out that supports unified memory, I should be good there!
Why are gpu vendors deciding how much ram the gpu gets? This feels so 1980s. GenAI workloads have radically variable memory footprints, some folks want 256GB per card so they have a spitting shot at trying something out - and other folks want 8GB per card to maximize throughput.
To add complication, I really don’t want to deal with copying 256GB models between cpu/card. We should have unified memory.
> To add complication, I really don’t want to deal with copying 256GB models between cpu/card. We should have unified memory.
The problem is latency. Apple can do what they do in terms of performance because they place the RAM extremely close to the SoC and don't have sockets that present their own RF interference challenges.
If you want to run modular, unified RAM, it gets really messy really fast.
> Why are gpu vendors deciding how much ram the gpu gets?
...because they make them?
You can go buy 256gb systems if you want, at extreme cost. You can also use an 8gb card for whatever you want too. What are they doing wrong?
> To add complication, I really don’t want to deal with copying 256GB models between cpu/card.
For deployment, it doesn't really matter. You just memory-map it to the CPU or keep it hot-loaded in GPU memory. With a unified memory model you're still limited by your disk speed, so the performance difference probably wouldn't come out great either way.
Intel makes microchips, but they do not control how much memory I put in them. There are no 256 GB GPUs available today outside of multi-GPU setups. Would love to see giant NUMA GPU boards.
To my knowledge, this is pretty much how InfiniBand works on datacenter-scale systems like the GH200. The incentive and upsell for bringing this to consumer systems seems weak to me.
At the end of the day it's limited by the PCIE lanes it is attached to, no different from a 2nd GPU.
Just saw one yesterday, 128GB PCIE 5.0 x8 which is a maximum of 32 GB/s in and out, where the 4090 has 1 TB/s memory bandwidth.
IMO for inference, dual socket EPYC (24 channel DDR5) is the way to go and CXL can theoretically allow you to bump up the bandwidth in that situation (assuming that you can optimize the software properly). Already in llama.cpp there seems to be some issues using 2P servers regarding numa.
At least Intel has some non x86 hardware for deep learning now. Between killing Nervana on the verge of release and Habana not making a splash, they are looking for a win.
Latest earnings call Intel said they have over $1B in the "pipeline of opportunities" for AI, with Gaudi being the "lion's share" of that. Certainly not bad.
I don't know what's in development but they do have a new chip that stated shipping this year and until recently they were the only ML-oriented product Intel had that was competitive in it's segment.
we need ML GPU workload standardization, and more options to chose from. Sucks being held hostage by Nvidia. No ML GPUs available in my country, and waitlist goes in years.