Testing Intel’s Arc A770 GPU for Deep Learning

ndneighbor · on Aug 9, 2023

I know it's not ideal, but I am really rooting for Intel's effort for graphics. From launch nearly a year ago to today, the driver team has been knocking out driver issues and slowly but surely trying to achieve parity with AMD and Nvidia. From what I hear, the team really going for broke on trying to get ML workloads fully supported. When Intel shipped a 16GB card at that price point, you knew that they were really trying, it would be cool if Battlemage (the next gen.) shipped with 32GB for larger models.

meragrin_ · on Aug 9, 2023

> From what I hear, the team really going for broke on trying to get ML workloads fully supported.

I've been waiting on AMD to do that. Embarrassingly, it looks like Intel will get there faster than AMD.

scrlk · on Aug 9, 2023

I think you'll be in for Another Massive Disappointment (AMD).

The software/driver woes in their graphics division have been going on for years at this point with no clear path forward, just some statements about how they plan to fix it in the future (e.g. handling the geohot AMD rant back in June).

It's not just ML either - on the gaming side, some people swallow the Nvidia price premium for the better drivers.

yaseer · on Aug 9, 2023

Agreed, Nvidia's dominance is annoying.

I made irrational purchasing decisions in the past to support the underdog, likening it to using Firefox. Unfortunately with AMD you just get a worse product.

This space sorely needs more competition.

smoldesu · on Aug 9, 2023

What's annoying is that things don't have to be this way. If our chipmakers like Apple, AMD and Intel were able to actually work together, we could make a proper GPGPU library that puts CUDA in the grave. Nvidia is betting on these companies being driven further apart though, and they just won't stop being proven right.

We need FAANG to put aside their differences to fund an alternative, or else FAANG will suffer the consequences.

meragrin_ · on Aug 9, 2023

> I think you'll be in for Another Massive Disappointment (AMD).

Nah, I pretty much gave up on AMD when the 7000 series launch was imminent and still no ML real support for the 6000 series. Not that I was even expecting anything. With the way Intel Arc is coming along, any ML support for Radeon is just a bonus. I refuse to pay the Nvidia tax until it becomes clear there is no other option again. Nvidia just really rubbed me the wrong way with the latest generation.

Arn_Thor · on Aug 9, 2023

for games I think the driver gap has narrowed quite a bit

jorvi · on Aug 9, 2023

Nvidia’s drivers are of horrid quality. They basically been break-fix-break-fix-break-fix on Destiny 2 and Sea of Thieves, with the broken version dropping a locked 144FPS PC to ~50FPS.

They seem to employ zero regression testing.

RF_Savage · on Aug 9, 2023

Maybe Intel has less baggage and can target what they know people actually use?

vbezhenar · on Aug 9, 2023

My opinion is that Intel has excellent software culture compared to other hardware companies. It might be one reason.

Dalewyn · on Aug 9, 2023

I'm a loyal Intel customer because their stuff is, in general, straight up less janky and more practically reliable than AMD's stuff. I agree that Intel knows how to do software in ways that AMD simply doesn't.

izacus · on Aug 9, 2023

But if Intel "wins", it means yet another instance of corporate consolidation, where a majority market holder of one markets (CPUs) destroys a whole different market (GPUs) by integrating and price dumping their products into the majority/monopoly market.

It'll be really hard to compete for AMD Radeon cards and nVidia itself against Intel if they aggresively price dump their GPUs and use typical megacorporation tactics to push out and destroy competition.

Why would you want that? Why not root for an independent GPU company? Where does this love for megaconglomerates consolidating whole markets come from?

ajross · on Aug 9, 2023

> It'll be really hard to compete for AMD Radeon cards and nVidia itself against Intel if they aggresively price dump their GPUs and use typical megacorporation tactics to push out and destroy competition.

AMD's market cap has been higher than Intel's for the last year, and NVIDIA is more than seven times higher. Apple is twice again as big. Surely this logic goes the other way. It's Intel and AMD and ARM Ltd. that are at the mercy of the big players dumping their way to dominance.

But obviously that's not happening. The competitive landscape in this industry is just fine. More products means more innovation and lower prices and bigger markets. That's just capitalism 101, really.

> Where does this love for megaconglomerates consolidating whole markets come from?

Nowhere, because that's just a strawman. I'm sure people here would love to see a GPU startup launch a product. But we're talking about Arc in this thread, and that seems like a good product too.

izacus · on Aug 9, 2023

Why do you think market cap has non-tangential relevancy here?

Eisenstein · on Aug 9, 2023

It is a common metric for the power of a corporation, since it is the amount of money their stock is worth in total. What do you propose instead?

ajross · on Aug 9, 2023

It's a measure (probably the single best one) of the financial resources a company can bring to bear on a problem. The contention upthread was about "price dumping" and "megacorporation tactics", which surely implies that big companies can do bad things to smaller players with their bigger resources. And it turns out that Intel is a small player in this particular field, as it happens. The logic just doesn't work.

cubefox · on Aug 9, 2023

It's a mystery to me why graphics cards have hardwired RAM. For CPUs/motherboards RAM has been extensible for decades.

ElectricalUnion · on Aug 10, 2023

Speed/signaling issues.

To me is a mystery that CPUs/motherboards still have extensible RAM, integrated RAM (if you don't skimp and go for cheap RAM) can be very fast - for example the Steam Deck has quad-channel 5500 MT/s LPDDR5, very console-like.

muttled · on Aug 11, 2023

What is it that make extensible RAM inherently slower? Is it the connectors raised off the boards, adding physical distance to the traces that's the issue?

ElectricalUnion · on Aug 12, 2023

The mechanical connector layout forces sub-optimal signalling paths.

You're forced to make everything slower to make things consistent.

That being said, a more optimized connector, board layout and form factor could make large-capacity, lower-latency modules possible. Not sure for example if there is anything out there besides the Dell proposed CAMM form factor, and I don't remember that one being specifically promoted as "faster", mostly "smaller".

alargemoose · on Aug 9, 2023

Yhe main reason is speed. For example DDR5 has a transfer rate of about 50GB/s. The GDDR6X used in GPUs has a transfer rate of about 1TB/s. Nvidia just announced H200 AI processors have HBM3e memory that they report can go as fast at 10TB/s.

Conscat · on Aug 9, 2023

Do you have any thoughts on how OneAPI compares to the Nsight tools?

sergiotapia · on Aug 9, 2023

If anything to get a new competitor to shake things up. It's not good to have a literal family duopoly. Both CEOs are blood relatives.

andrewstuart · on Aug 9, 2023

Lisa Su is CEO not owner of AMD. It's a public company. if you're implying she's not competing with Nvidia cause it's all in the family, well..... that's kinda out there.

speakspokespok · on Aug 9, 2023

They're distant cousins. Both from the same city, both moved to the US and crushed it, but both families were large and then had lots of kids.

newqer · on Aug 9, 2023

This can be molded into a generational power grab thriller. Think Godfather trilogy, but both Su and Huang are sort of Sith lords.

The family started with nothing, multiplied and played the strategic long game to conquer the AI marked to enable Skynet.

But Intel are the Jedi hacker good guys.

bee_rider · on Aug 9, 2023

It can be, as long as you are willing to take a slight detour from reality.

andrewstuart · on Aug 9, 2023

Yes. This is exactly right.

fransje26 · on Aug 9, 2023

> trying to achieve parity with AMD and Nvidia

Interesting to put AMD and Nvidia at the same level. From experience developing GPGPU applications they are not even close..

agloe_dreams · on Aug 9, 2023

I mean, if Nvidia’s GPGPU performance was like 100 “units” and AMD’s was like 40 units….Intel was at 1 Unit. The gap was far larger.

mastax · on Aug 9, 2023

That software experience does not look as bad as I was expecting.

That performance is shockingly good, am I missing something? Comparable to a Titan RTX? Yes, that card is 4 years older but it cost an order of magnitude more, has almost twice the die area, and has twice the GDDR6 bus width. The A770 theoretical FP16 is apparently a bit higher (39.2T vs 32.6T) but in gaming workloads the Arc cards far under-perform their theoretical performance compared to competitors. I guess ML is less demanding on the drivers, or perhaps some micro-architectural thing. In gaming performance the Titan RTX is about 70% faster than the A770.

I'm really looking forward to the next generation now.

https://www.techpowerup.com/gpu-specs/titan-rtx.c3311

https://www.techpowerup.com/gpu-specs/arc-a770.c3914

7speter · on Aug 9, 2023

From what I can understand, gaming is a strange workload that Nvida and AMD have a 20+ year head start on Intel. Deep learning and AI workloads are probably pretty straight forward as far as how it's implemented and run, but game developers do all sorts of hacky things to make their games work for the sake of meeting a given deadline (and ultimately, release date). The 20 year advantage of AMD and Nvidia comes from them optimizing their graphics drivers for all of the edgecase games as the games came out (this is why there may be a "The Last of Us" exclusive driver for Nvidia cards, for example). Intel has a massive backlog of games to work through to optimize for, and they also have to figure out what games are the ones to prioritize on their optimization todo list, I imagine.

smoldesu · on Aug 9, 2023

Many of the DirectX issues on these cards can be almost entirely mitigated with DXVK. It's pretty well-documented at this point, but the Vulkan translation of older DX titles is better than using Intel's native DirectX driver.

For Linux gamers, DXVK is how you'd be playing the game regardless. It's a first-gen card that's not a terrible prospect for the right sort of tinkerer.

kookamamie · on Aug 9, 2023

As you mentioned, Titan RTX is ancient when it comes to DL/ML use - if that has a 4-fold perf compared to Arc, it's not looking great. It's not like you can cram 10x Arc into a computer to make up for the gap.

Eisenstein · on Aug 9, 2023

> Titan RTX is ancient when it comes to DL/ML use

I just checked ebay sold listings and Titan RTX cards are selling consistently for between $800 - $1600.

mastax · on Aug 10, 2023

* 50% more GDDR6 bus width

Q6T46nT668w6i3m · on Aug 9, 2023

For me, the calculus is simple: is the platform maker ensuring their work is merged upstream? If not, what’s the point? The field moves way too fast to rely on forks. Relying on Intel’s comparatively unmaintained PyTorch fork (stuck on PyTorch 1.10) is foolish. Hell, Apple, despite their reputation, has done a better job working with the PyTorch organization than Intel.

mrbuttons454 · on Aug 9, 2023

Ship something reasonably priced with 32gb+ and I'd be interested!

lukevp · on Aug 9, 2023

I was in the market for a new GPU and I debated between a 4090 and a 4070. 4070 is fine for the type of work and gaming I do, but I was thinking about the additional RAM. Then I realized that the 4070 has more vram than most consumers do, so many models will be optimized for that, and at some point someone will come out with a high-ram neural network only card at some point, and that would end up being better than the 4090 anyway because 24 gigs of vram is still quite small. I don’t know a ton about this stuff so who knows if that was the right decision… but I’m thinking along the same lines as you.

squeaky-clean · on Aug 9, 2023

> someone will come out with a high-ram neural network only card at some point, and that would end up being better than the 4090 anyway because 24 gigs of vram is still quite small

These already exist, they're just disproportionately expensive compared to consumer cards. You can get a Nvidia A100 with 80GB VRAM on eBay for about $14,000.

3.33x the memory for 9x the price.

7speter · on Aug 9, 2023

You can also buy used enterprise cards on ebay. There may not be many 32gb+ cards up yet, but they'll be on there eventually. The big issue is how to cool these cards though, both amd and nvidia make the cards with fan-less heat sinks because the servers they get packed into provide cooling via their loud, ultra fast, potentially digit severing fans, but people who are salvaging these cards have to find their own solution to cool them, and fit them in a given case.

muttled · on Aug 11, 2023

I'm surprised nobody's yet making custom coolers for them (complete coolers with heatsink/fans, not gluing a 3d-printed fan shroud to the front) and re-selling them. Seems like a niche gap in the market.

yrro · on Aug 9, 2023

How purchasers get drivers, don't they require a subscription, or are they pirated?

7speter · on Aug 9, 2023

As far as AMD drivers are concerned, you can at least reflash some cards. Nvidia cards... well I've only seen videos where people say they have a given driver so there might be underhanded means involved.

YetAnotherNick · on Aug 9, 2023

Or you could get A6000 for $4k with 48 GB VRAM.

moonchrome · on Aug 9, 2023

Is 48GB enough to run 70B models ?

sacred_numbers · on Aug 9, 2023

Yes, with 4 bit quantization.

brucethemoose2 · on Aug 9, 2023

Also, 24GB is good because it's just barely big enough for the (currently unreleased) Llama V2 34B.

In Llama V1, 7B was clever but kinda dumb, 13B was still short sighted, but 33B was a big jump to "disturbingly good" territory. 7B/13B finetunes required very specific prompting, but 33B would still give good responses even outside the finetuning format.

16GB is probably perfect for a ~13B model with a very long context and better (5 bit?) quantization, or just partially offloading a ~33B model.

zacmps · on Aug 9, 2023

I don't know, maybe I'm desensitized by GPT4 but I'm running Llama2 70B and I certainly wouldn't call it 'disturbingly good'.

brucethemoose2 · on Aug 9, 2023

Base Llama is not good at all, but the finetunes are quite good at their niches.

Llama in general is not great for code completion and fact retrieval, from what I have tried

thewataccount · on Aug 9, 2023

Check out the hugging face leader board and try one of the finetunes.

I can't run llama2 70B - but I remember llama1-Guanaco30B was a major leap over the base llama30B model for me.

zacmps · on Aug 9, 2023

Thanks! That's useful to know. Unfortunately it looks like a lot of them are non-commercial which rules them out for me but there's a couple that look like they might work.

brucethemoose2 · on Aug 10, 2023

Its rather outdated. Search for "70b" on huggingface, sort by date, and you will find more cutting edge models.

And you can run this locally with enough combined RAM + VRAM.

ripvanwinkle · on Aug 9, 2023

Does it have to be exclusively vram to be useful or is shared ram also useful here. I have a system that reports 8gb dedicated to GPU with another 15 that is shared and thus a GPU memory of 23 GB

brucethemoose2 · on Aug 9, 2023

Driver shared memory is effectively useless. The way it accesses CPU memory is (for now) extremely slow.

First some background: llama is divided into prompt ingestion code, and "layers" for actually generating the tokens.

There are different offloading schemes, but what llama.cpp specifically does is map the layers to different devices. For example, ~7GB of the model could reside on the 8GB GPU, and the other ~9GB would live on the CPU.

During runtime, the prompt is ingested by the GPU all at once (which doesn't take much VRAM), and then for each word, the layers are run sequentially. So the first ~half of a word would run on your GPU, and the last half would run on your CPU, and the alternation repeats till all the words are generated.

The beauty of llama.cpp is that its cpu token generation is (compared to other llama runtimes) extremely fast. So offloading even half or two thirds of a model to a decent CPU is not so bad.

ripvanwinkle · on Aug 9, 2023

Thank you for the explanation

brucethemoose2 · on Aug 9, 2023

My compromise:

A 3090.

Ampere will be well supported until its very obsolete because of the A100.

h11h · on Aug 9, 2023

If anything, 24GB is probably the sweet spot for what's optimised for as many local LLM enthusiasts are running 3090s and 4090s (and all want more VRAM).

RcouF1uZ4gsC · on Aug 9, 2023

> Then I realized that the 4070 has more vram than most consumers do

The M1 and M2 Apple processors might change that equation.

The 32gb and 64gb MacBooks can run inference on a lot of these big models.

They might be quite a bit slower in TFLOPS than a 4070, but if they can run the model while the 4070 can't run it due to limited RAM, then they come out the winner

washadjeffmad · on Aug 9, 2023

That's true, and you do make a disclaimer about performance, but often when this is brought up, people either haven't actually tried MPS or are unaware that a GPU isn't required for inference.

The long and short of it is, right now, CUDA beats TFLOPS.

Now for training, $10K for an M2 with 192GB memory could break even somewhere around a dozen models with current costs, iff the compute is not saturated. The limit of memory bandwidth is only a function of the compute; it doesn't scale infinitely.

pram · on Aug 9, 2023

Yes I could run llama 65b on my M1 Max with 64GB of ram.

lukevp · on Aug 9, 2023

I do also have a MacBook Pro with 32 gigs of Ram and an M2 Studio with 64 gigs, so if a great model comes out that supports unified memory, I should be good there!

brucethemoose2 · on Aug 9, 2023

I just ordered a 3090, but I would have bought a 32GB A770 instead without even blinking.

The card is 256 bit... Intel could have done this, just like the 16GB RTX 4060 with a 128 bit bus.

lumost · on Aug 9, 2023

Very hopeful for unified/configurable memory.

Why are gpu vendors deciding how much ram the gpu gets? This feels so 1980s. GenAI workloads have radically variable memory footprints, some folks want 256GB per card so they have a spitting shot at trying something out - and other folks want 8GB per card to maximize throughput.

To add complication, I really don’t want to deal with copying 256GB models between cpu/card. We should have unified memory.

mschuster91 · on Aug 9, 2023

> To add complication, I really don’t want to deal with copying 256GB models between cpu/card. We should have unified memory.

The problem is latency. Apple can do what they do in terms of performance because they place the RAM extremely close to the SoC and don't have sockets that present their own RF interference challenges.

If you want to run modular, unified RAM, it gets really messy really fast.

[1] https://www.ifixit.com/Guide/Macbook+Air+(M2+2022)+Logic+Boa...

smoldesu · on Aug 9, 2023

> Why are gpu vendors deciding how much ram the gpu gets?

...because they make them?

You can go buy 256gb systems if you want, at extreme cost. You can also use an 8gb card for whatever you want too. What are they doing wrong?

> To add complication, I really don’t want to deal with copying 256GB models between cpu/card.

For deployment, it doesn't really matter. You just memory-map it to the CPU or keep it hot-loaded in GPU memory. With a unified memory model you're still limited by your disk speed, so the performance difference probably wouldn't come out great either way.

lumost · on Aug 9, 2023

Intel makes microchips, but they do not control how much memory I put in them. There are no 256 GB GPUs available today outside of multi-GPU setups. Would love to see giant NUMA GPU boards.

smoldesu · on Aug 9, 2023

> Would love to see giant NUMA GPU boards.

To my knowledge, this is pretty much how InfiniBand works on datacenter-scale systems like the GH200. The incentive and upsell for bringing this to consumer systems seems weak to me.

brucethemoose2 · on Aug 9, 2023

CXL is that, but a compromise. Practically, that means an APU.

Intel/AMD are reportedly coming out with wide M1-Pro like chips.

bick_nyers · on Aug 9, 2023

At the end of the day it's limited by the PCIE lanes it is attached to, no different from a 2nd GPU.

Just saw one yesterday, 128GB PCIE 5.0 x8 which is a maximum of 32 GB/s in and out, where the 4090 has 1 TB/s memory bandwidth.

IMO for inference, dual socket EPYC (24 channel DDR5) is the way to go and CXL can theoretically allow you to bump up the bandwidth in that situation (assuming that you can optimize the software properly). Already in llama.cpp there seems to be some issues using 2P servers regarding numa.

mcbuilder · on Aug 9, 2023

At least Intel has some non x86 hardware for deep learning now. Between killing Nervana on the verge of release and Habana not making a splash, they are looking for a win.

ac29 · on Aug 9, 2023

Latest earnings call Intel said they have over $1B in the "pipeline of opportunities" for AI, with Gaudi being the "lion's share" of that. Certainly not bad.

brucethemoose2 · on Aug 9, 2023

> Habana not making a splash

Yeah this is sad.

Intel is supposedly turning it into a tensor core-ish accelerator on Falcon Shores... And, hopefully, lower end Arc cards.

meepmorp · on Aug 9, 2023

They also bought (and killed) Movidius, iirc

my123 · on Aug 9, 2023

Movidius's death is greatly exaggerated.

Movidius AI accelerators are going to be there as part of every single Meteor Lake chip.

hedgehog · on Aug 9, 2023

I don't know what's in development but they do have a new chip that stated shipping this year and until recently they were the only ML-oriented product Intel had that was competitive in it's segment.

touisteur · on Aug 9, 2023

Been waiting on Keem Bay for so long it's now a running joke.

anagri · on Aug 9, 2023

we need ML GPU workload standardization, and more options to chose from. Sucks being held hostage by Nvidia. No ML GPUs available in my country, and waitlist goes in years.

benreesman · on Aug 9, 2023

Intel ARC is not a particularly big worry on training with mixed fp32/16.

On LLM inference we’ve got a baseball game.

Havoc · on Aug 9, 2023

For the price being within spitting range of a Titan seems pretty damn good

canjobear · on Aug 9, 2023

I nearly bricked a machine trying to set up an A770 for deep learning. Many of the same issues the author reports.