Hacker News new | past | comments | ask | show | jobs | submit login
Testing Intel’s Arc A770 GPU for Deep Learning (christianjmills.com)
203 points by T-A on Aug 9, 2023 | hide | past | favorite | 81 comments



I know it's not ideal, but I am really rooting for Intel's effort for graphics. From launch nearly a year ago to today, the driver team has been knocking out driver issues and slowly but surely trying to achieve parity with AMD and Nvidia. From what I hear, the team really going for broke on trying to get ML workloads fully supported. When Intel shipped a 16GB card at that price point, you knew that they were really trying, it would be cool if Battlemage (the next gen.) shipped with 32GB for larger models.


> From what I hear, the team really going for broke on trying to get ML workloads fully supported.

I've been waiting on AMD to do that. Embarrassingly, it looks like Intel will get there faster than AMD.


I think you'll be in for Another Massive Disappointment (AMD).

The software/driver woes in their graphics division have been going on for years at this point with no clear path forward, just some statements about how they plan to fix it in the future (e.g. handling the geohot AMD rant back in June).

It's not just ML either - on the gaming side, some people swallow the Nvidia price premium for the better drivers.


Agreed, Nvidia's dominance is annoying.

I made irrational purchasing decisions in the past to support the underdog, likening it to using Firefox. Unfortunately with AMD you just get a worse product.

This space sorely needs more competition.


What's annoying is that things don't have to be this way. If our chipmakers like Apple, AMD and Intel were able to actually work together, we could make a proper GPGPU library that puts CUDA in the grave. Nvidia is betting on these companies being driven further apart though, and they just won't stop being proven right.

We need FAANG to put aside their differences to fund an alternative, or else FAANG will suffer the consequences.


> I think you'll be in for Another Massive Disappointment (AMD).

Nah, I pretty much gave up on AMD when the 7000 series launch was imminent and still no ML real support for the 6000 series. Not that I was even expecting anything. With the way Intel Arc is coming along, any ML support for Radeon is just a bonus. I refuse to pay the Nvidia tax until it becomes clear there is no other option again. Nvidia just really rubbed me the wrong way with the latest generation.


for games I think the driver gap has narrowed quite a bit


Nvidia’s drivers are of horrid quality. They basically been break-fix-break-fix-break-fix on Destiny 2 and Sea of Thieves, with the broken version dropping a locked 144FPS PC to ~50FPS.

They seem to employ zero regression testing.


Maybe Intel has less baggage and can target what they know people actually use?


My opinion is that Intel has excellent software culture compared to other hardware companies. It might be one reason.


I'm a loyal Intel customer because their stuff is, in general, straight up less janky and more practically reliable than AMD's stuff. I agree that Intel knows how to do software in ways that AMD simply doesn't.


But if Intel "wins", it means yet another instance of corporate consolidation, where a majority market holder of one markets (CPUs) destroys a whole different market (GPUs) by integrating and price dumping their products into the majority/monopoly market.

It'll be really hard to compete for AMD Radeon cards and nVidia itself against Intel if they aggresively price dump their GPUs and use typical megacorporation tactics to push out and destroy competition.

Why would you want that? Why not root for an independent GPU company? Where does this love for megaconglomerates consolidating whole markets come from?


> It'll be really hard to compete for AMD Radeon cards and nVidia itself against Intel if they aggresively price dump their GPUs and use typical megacorporation tactics to push out and destroy competition.

AMD's market cap has been higher than Intel's for the last year, and NVIDIA is more than seven times higher. Apple is twice again as big. Surely this logic goes the other way. It's Intel and AMD and ARM Ltd. that are at the mercy of the big players dumping their way to dominance.

But obviously that's not happening. The competitive landscape in this industry is just fine. More products means more innovation and lower prices and bigger markets. That's just capitalism 101, really.

> Where does this love for megaconglomerates consolidating whole markets come from?

Nowhere, because that's just a strawman. I'm sure people here would love to see a GPU startup launch a product. But we're talking about Arc in this thread, and that seems like a good product too.


Why do you think market cap has non-tangential relevancy here?


It is a common metric for the power of a corporation, since it is the amount of money their stock is worth in total. What do you propose instead?


It's a measure (probably the single best one) of the financial resources a company can bring to bear on a problem. The contention upthread was about "price dumping" and "megacorporation tactics", which surely implies that big companies can do bad things to smaller players with their bigger resources. And it turns out that Intel is a small player in this particular field, as it happens. The logic just doesn't work.


It's a mystery to me why graphics cards have hardwired RAM. For CPUs/motherboards RAM has been extensible for decades.


Speed/signaling issues.

To me is a mystery that CPUs/motherboards still have extensible RAM, integrated RAM (if you don't skimp and go for cheap RAM) can be very fast - for example the Steam Deck has quad-channel 5500 MT/s LPDDR5, very console-like.


What is it that make extensible RAM inherently slower? Is it the connectors raised off the boards, adding physical distance to the traces that's the issue?


The mechanical connector layout forces sub-optimal signalling paths.

You're forced to make everything slower to make things consistent.

That being said, a more optimized connector, board layout and form factor could make large-capacity, lower-latency modules possible. Not sure for example if there is anything out there besides the Dell proposed CAMM form factor, and I don't remember that one being specifically promoted as "faster", mostly "smaller".


Yhe main reason is speed. For example DDR5 has a transfer rate of about 50GB/s. The GDDR6X used in GPUs has a transfer rate of about 1TB/s. Nvidia just announced H200 AI processors have HBM3e memory that they report can go as fast at 10TB/s.


Do you have any thoughts on how OneAPI compares to the Nsight tools?


If anything to get a new competitor to shake things up. It's not good to have a literal family duopoly. Both CEOs are blood relatives.


Lisa Su is CEO not owner of AMD. It's a public company. if you're implying she's not competing with Nvidia cause it's all in the family, well..... that's kinda out there.


They're distant cousins. Both from the same city, both moved to the US and crushed it, but both families were large and then had lots of kids.


This can be molded into a generational power grab thriller. Think Godfather trilogy, but both Su and Huang are sort of Sith lords.

The family started with nothing, multiplied and played the strategic long game to conquer the AI marked to enable Skynet.

But Intel are the Jedi hacker good guys.


It can be, as long as you are willing to take a slight detour from reality.


Yes. This is exactly right.


> trying to achieve parity with AMD and Nvidia

Interesting to put AMD and Nvidia at the same level. From experience developing GPGPU applications they are not even close..


I mean, if Nvidia’s GPGPU performance was like 100 “units” and AMD’s was like 40 units….Intel was at 1 Unit. The gap was far larger.


That software experience does not look as bad as I was expecting.

That performance is shockingly good, am I missing something? Comparable to a Titan RTX? Yes, that card is 4 years older but it cost an order of magnitude more, has almost twice the die area, and has twice the GDDR6 bus width. The A770 theoretical FP16 is apparently a bit higher (39.2T vs 32.6T) but in gaming workloads the Arc cards far under-perform their theoretical performance compared to competitors. I guess ML is less demanding on the drivers, or perhaps some micro-architectural thing. In gaming performance the Titan RTX is about 70% faster than the A770.

I'm really looking forward to the next generation now.

https://www.techpowerup.com/gpu-specs/titan-rtx.c3311

https://www.techpowerup.com/gpu-specs/arc-a770.c3914


From what I can understand, gaming is a strange workload that Nvida and AMD have a 20+ year head start on Intel. Deep learning and AI workloads are probably pretty straight forward as far as how it's implemented and run, but game developers do all sorts of hacky things to make their games work for the sake of meeting a given deadline (and ultimately, release date). The 20 year advantage of AMD and Nvidia comes from them optimizing their graphics drivers for all of the edgecase games as the games came out (this is why there may be a "The Last of Us" exclusive driver for Nvidia cards, for example). Intel has a massive backlog of games to work through to optimize for, and they also have to figure out what games are the ones to prioritize on their optimization todo list, I imagine.


Many of the DirectX issues on these cards can be almost entirely mitigated with DXVK. It's pretty well-documented at this point, but the Vulkan translation of older DX titles is better than using Intel's native DirectX driver.

For Linux gamers, DXVK is how you'd be playing the game regardless. It's a first-gen card that's not a terrible prospect for the right sort of tinkerer.


As you mentioned, Titan RTX is ancient when it comes to DL/ML use - if that has a 4-fold perf compared to Arc, it's not looking great. It's not like you can cram 10x Arc into a computer to make up for the gap.


> Titan RTX is ancient when it comes to DL/ML use

I just checked ebay sold listings and Titan RTX cards are selling consistently for between $800 - $1600.


* 50% more GDDR6 bus width


For me, the calculus is simple: is the platform maker ensuring their work is merged upstream? If not, what’s the point? The field moves way too fast to rely on forks. Relying on Intel’s comparatively unmaintained PyTorch fork (stuck on PyTorch 1.10) is foolish. Hell, Apple, despite their reputation, has done a better job working with the PyTorch organization than Intel.


Ship something reasonably priced with 32gb+ and I'd be interested!


I was in the market for a new GPU and I debated between a 4090 and a 4070. 4070 is fine for the type of work and gaming I do, but I was thinking about the additional RAM. Then I realized that the 4070 has more vram than most consumers do, so many models will be optimized for that, and at some point someone will come out with a high-ram neural network only card at some point, and that would end up being better than the 4090 anyway because 24 gigs of vram is still quite small. I don’t know a ton about this stuff so who knows if that was the right decision… but I’m thinking along the same lines as you.


> someone will come out with a high-ram neural network only card at some point, and that would end up being better than the 4090 anyway because 24 gigs of vram is still quite small

These already exist, they're just disproportionately expensive compared to consumer cards. You can get a Nvidia A100 with 80GB VRAM on eBay for about $14,000.

3.33x the memory for 9x the price.


You can also buy used enterprise cards on ebay. There may not be many 32gb+ cards up yet, but they'll be on there eventually. The big issue is how to cool these cards though, both amd and nvidia make the cards with fan-less heat sinks because the servers they get packed into provide cooling via their loud, ultra fast, potentially digit severing fans, but people who are salvaging these cards have to find their own solution to cool them, and fit them in a given case.


I'm surprised nobody's yet making custom coolers for them (complete coolers with heatsink/fans, not gluing a 3d-printed fan shroud to the front) and re-selling them. Seems like a niche gap in the market.


How purchasers get drivers, don't they require a subscription, or are they pirated?


As far as AMD drivers are concerned, you can at least reflash some cards. Nvidia cards... well I've only seen videos where people say they have a given driver so there might be underhanded means involved.


Or you could get A6000 for $4k with 48 GB VRAM.


Is 48GB enough to run 70B models ?


Yes, with 4 bit quantization.


Also, 24GB is good because it's just barely big enough for the (currently unreleased) Llama V2 34B.

In Llama V1, 7B was clever but kinda dumb, 13B was still short sighted, but 33B was a big jump to "disturbingly good" territory. 7B/13B finetunes required very specific prompting, but 33B would still give good responses even outside the finetuning format.

16GB is probably perfect for a ~13B model with a very long context and better (5 bit?) quantization, or just partially offloading a ~33B model.


I don't know, maybe I'm desensitized by GPT4 but I'm running Llama2 70B and I certainly wouldn't call it 'disturbingly good'.


Base Llama is not good at all, but the finetunes are quite good at their niches.

Llama in general is not great for code completion and fact retrieval, from what I have tried


Check out the hugging face leader board and try one of the finetunes.

I can't run llama2 70B - but I remember llama1-Guanaco30B was a major leap over the base llama30B model for me.


Thanks! That's useful to know. Unfortunately it looks like a lot of them are non-commercial which rules them out for me but there's a couple that look like they might work.


Its rather outdated. Search for "70b" on huggingface, sort by date, and you will find more cutting edge models.

And you can run this locally with enough combined RAM + VRAM.


Does it have to be exclusively vram to be useful or is shared ram also useful here. I have a system that reports 8gb dedicated to GPU with another 15 that is shared and thus a GPU memory of 23 GB


Driver shared memory is effectively useless. The way it accesses CPU memory is (for now) extremely slow.

First some background: llama is divided into prompt ingestion code, and "layers" for actually generating the tokens.

There are different offloading schemes, but what llama.cpp specifically does is map the layers to different devices. For example, ~7GB of the model could reside on the 8GB GPU, and the other ~9GB would live on the CPU.

During runtime, the prompt is ingested by the GPU all at once (which doesn't take much VRAM), and then for each word, the layers are run sequentially. So the first ~half of a word would run on your GPU, and the last half would run on your CPU, and the alternation repeats till all the words are generated.

The beauty of llama.cpp is that its cpu token generation is (compared to other llama runtimes) extremely fast. So offloading even half or two thirds of a model to a decent CPU is not so bad.


Thank you for the explanation


My compromise:

A 3090.

Ampere will be well supported until its very obsolete because of the A100.


If anything, 24GB is probably the sweet spot for what's optimised for as many local LLM enthusiasts are running 3090s and 4090s (and all want more VRAM).


> Then I realized that the 4070 has more vram than most consumers do

The M1 and M2 Apple processors might change that equation.

The 32gb and 64gb MacBooks can run inference on a lot of these big models.

They might be quite a bit slower in TFLOPS than a 4070, but if they can run the model while the 4070 can't run it due to limited RAM, then they come out the winner


That's true, and you do make a disclaimer about performance, but often when this is brought up, people either haven't actually tried MPS or are unaware that a GPU isn't required for inference.

The long and short of it is, right now, CUDA beats TFLOPS.

Now for training, $10K for an M2 with 192GB memory could break even somewhere around a dozen models with current costs, iff the compute is not saturated. The limit of memory bandwidth is only a function of the compute; it doesn't scale infinitely.


Yes I could run llama 65b on my M1 Max with 64GB of ram.


I do also have a MacBook Pro with 32 gigs of Ram and an M2 Studio with 64 gigs, so if a great model comes out that supports unified memory, I should be good there!


I just ordered a 3090, but I would have bought a 32GB A770 instead without even blinking.

The card is 256 bit... Intel could have done this, just like the 16GB RTX 4060 with a 128 bit bus.


Very hopeful for unified/configurable memory.

Why are gpu vendors deciding how much ram the gpu gets? This feels so 1980s. GenAI workloads have radically variable memory footprints, some folks want 256GB per card so they have a spitting shot at trying something out - and other folks want 8GB per card to maximize throughput.

To add complication, I really don’t want to deal with copying 256GB models between cpu/card. We should have unified memory.


> To add complication, I really don’t want to deal with copying 256GB models between cpu/card. We should have unified memory.

The problem is latency. Apple can do what they do in terms of performance because they place the RAM extremely close to the SoC and don't have sockets that present their own RF interference challenges.

If you want to run modular, unified RAM, it gets really messy really fast.

[1] https://www.ifixit.com/Guide/Macbook+Air+(M2+2022)+Logic+Boa...


> Why are gpu vendors deciding how much ram the gpu gets?

...because they make them?

You can go buy 256gb systems if you want, at extreme cost. You can also use an 8gb card for whatever you want too. What are they doing wrong?

> To add complication, I really don’t want to deal with copying 256GB models between cpu/card.

For deployment, it doesn't really matter. You just memory-map it to the CPU or keep it hot-loaded in GPU memory. With a unified memory model you're still limited by your disk speed, so the performance difference probably wouldn't come out great either way.


Intel makes microchips, but they do not control how much memory I put in them. There are no 256 GB GPUs available today outside of multi-GPU setups. Would love to see giant NUMA GPU boards.


> Would love to see giant NUMA GPU boards.

To my knowledge, this is pretty much how InfiniBand works on datacenter-scale systems like the GH200. The incentive and upsell for bringing this to consumer systems seems weak to me.


CXL is that, but a compromise. Practically, that means an APU.

Intel/AMD are reportedly coming out with wide M1-Pro like chips.


At the end of the day it's limited by the PCIE lanes it is attached to, no different from a 2nd GPU.

Just saw one yesterday, 128GB PCIE 5.0 x8 which is a maximum of 32 GB/s in and out, where the 4090 has 1 TB/s memory bandwidth.

IMO for inference, dual socket EPYC (24 channel DDR5) is the way to go and CXL can theoretically allow you to bump up the bandwidth in that situation (assuming that you can optimize the software properly). Already in llama.cpp there seems to be some issues using 2P servers regarding numa.


At least Intel has some non x86 hardware for deep learning now. Between killing Nervana on the verge of release and Habana not making a splash, they are looking for a win.


Latest earnings call Intel said they have over $1B in the "pipeline of opportunities" for AI, with Gaudi being the "lion's share" of that. Certainly not bad.


> Habana not making a splash

Yeah this is sad.

Intel is supposedly turning it into a tensor core-ish accelerator on Falcon Shores... And, hopefully, lower end Arc cards.


They also bought (and killed) Movidius, iirc


Movidius's death is greatly exaggerated.

Movidius AI accelerators are going to be there as part of every single Meteor Lake chip.


I don't know what's in development but they do have a new chip that stated shipping this year and until recently they were the only ML-oriented product Intel had that was competitive in it's segment.


Been waiting on Keem Bay for so long it's now a running joke.


we need ML GPU workload standardization, and more options to chose from. Sucks being held hostage by Nvidia. No ML GPUs available in my country, and waitlist goes in years.


Intel ARC is not a particularly big worry on training with mixed fp32/16.

On LLM inference we’ve got a baseball game.


For the price being within spitting range of a Titan seems pretty damn good


I nearly bricked a machine trying to set up an A770 for deep learning. Many of the same issues the author reports.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: