And yet people seem to work just fine with ML on AMD GPUs when they aren’t think...

FeepingCreature · 2024-06-17T13:50:35 1718632235

I have a 7900 XTX. There's a known firmware crash issue with ComfyUI. It's been reported like a year ago. Every rocm patch release I check the notes, and every release it goes unfixed. That's not to go into the intense jank that is the rocm debian repo. If we need DL at work, I'll recommend Nvidia, no question.

logicchains · 2024-06-17T06:20:40 1718605240

Which AMD GPUs? Most consumer AMD GPUs don't even support ROCm.

JonChesterfield · 2024-06-17T08:06:59 1718611619

Debian, Arch and Gentoo have ROCm built for consumer GPUs. Thus so do their derivatives. Anything gfx9 or later is likely to be fine and gfx8 has a decent chance of working. The https://github.com/ROCm/ROCm source has build scripts these days.

At least some of the internal developers largely work on consumer hardware. It's not as solid as the enterprise gear but it's also very cheap so overall that seems reasonable to me. I'm using a pair of 6900XT, with a pair of VII's in a backup machine.

For turn key proprietary stuff where you really like the happy path foreseen by your vendor, in classic mainframe style, team green is who you want.

paulmd · 2024-06-17T14:34:59 1718634899

> For turn key proprietary stuff where you really like the happy path foreseen by your vendor

there really was no way for AMD to foresee that people might want to run GPGPU workloads on their polaris cards? isn't that a little counterfactual to the whole OpenCL and HSA Framework push predating that?

Example: it's not that things like Bolt didn't exist to try and compete with Thrust... it's that the NVIDIA one has had three updates in the last month and Bolt was last updated 10 years ago.

You're literally reframing "having working runtime and framework support for your hardware" as being some proprietary turnkey luxury for users, as well as an unforeseeable eventuality for AMD. It wasn't a development priority, but users do like to actually build code that works etc.

That's why you got kicked to the curb by Blender - your OpenCL wasn't stable even after years of work from them and you. That's why you got kicked to the curb by Octane - your Vulkan Compute support wasn't stable enough to even compile their code successfully. That's the story that's related by richg42 about your OpenGL driver implementation too - that it's just paper features and resume-driven development by developers 10 years departed all the way down.

The issues discussed by geohotz aren't new, and they aren't limited to ROCm or deep learning in general. This is, broadly speaking, the same level of quality that AMD has applied to all its software for decades. And the social-media "red team" loyalism strategy doesn't really work here, you can't push this into "AMD drivers have been good for like 10 years now!!!" fervor when the understanding of the problems are that broad and that collectively shared and understood. Every GPGPU developer who's tried has bounced off this AMD experience for literally an entire generation running now. The shared collective experience is that AMD is not serious in the field, and it's difficult to believe it's a good-faith change and interest in advancing the field rather than just a cashgrab.

It's also completely foreseeable that users want broad, official support for all their architectures, and not one or two specifics etc. Like these aren't mysteries that AMD just accidentally forgot about, etc. They're basic asks that you are framing as "turnkey proprietary stuff", like a working opencl runtime or a working spir-v compiler.

What was it linus said about the experience of working with NVIDIA? That's been the experience of the GPGPU community working with AMD, for decades. Shit is broken and doesn't work, and there's no interest in making it otherwise. And the only thing that changed it is a cashgrab, and a working compiler/runtime is still "turnkey proprietary stuff" they have to be arm-twisted into doing by Literally Being Put On Blast By Geohotz Until It's Fixed. "Fuck you, AMD" is a sentiment that there is very valid reasons to feel given the amount of needless suffering you have generated - but we just don't do that to red team, do we?

But you guys have been more intransigent about just supporting GPGPU, no matter what framework, please just pick one, get serious and start working already than NVIDIA ever was about wayland etc. You've blown decades just refusing to ever shit or get off the pot (without even giving enough documentation for the community to just do it themselves). And that's not an exaggeration - I bounced off the AMD stack in 2012, and it wasn't a new problem then either. It's too late for "we didn't know people wanted a working runtime or to develop on gaming cards" to work as an excuse, after decades of overt willing neglect it's just patronizing.

Again, sorry, this is ranty, it's not that I'm upset at you personally etc, but like, my advice as a corporate posture here is don't go looking for a ticker-tape parade for finally delivering a working runtime that you've literally been advertising support for for more than a decade like it's some favor to the community. These aren't "proprietary turnkey features" they're literally the basics of the specs you're advertising compliance with, and it's not even just one it's like 4+ different APIs that have this problem with you guys that has been widely known, discussed in tech blogs etc for more than a decade (richg42). I've been saying it for a long time, so has everyone else who's ever interacted with AMD hardware in the GPGPU space. Nobody there cared until it was a cashgrab, actually half the time you get the AMD fan there to tell you the drivers have been good for a decade now (AMD cannot fail, only be failed). It's frustrating. You've poisoned the well with generations of developers, with decades of corporate obstinance that would make NVIDIA blush, please at least have a little contrition about the whole experience and the feelings on the other side here.

JonChesterfield · 2024-06-17T22:08:33 1718662113

You're saying interesting things here. It's not my perspective but I can see how you'd arrive at it. Worth noting that I'm an engineer writing from personal experience, the corporate posture might be quite divergent from this.

I think Cuda's GPU offloading model is very boring. An x64 thread occasionally pushes a large blob of work into a stream and sometime later finds out if it worked. That does however work robustly, provided you don't do anything strange from within the kernel. In particular allocating memory on the host from within the kernel deadlocks the kernel unless you do awkward things with shuffling streams. More ambitious things like spawning a kernel from a kernel just aren't available - there's only a hobbled nested lifetime thing available. The volta threading model is not boring but it is terrible, see https://stackoverflow.com/questions/64775620/cuda-sync-funct...

HSA puts the x64 cores and the gpu cores on close to equal footing. Spawning a kernel from a kernel is totally fine and looks very like spawning one from the host. Everything is correctly thread safe so calling mmap from within a kernel doesn't deadlock things. You can program the machine as a large cluster of independent cores passing messages to one another. For the raw plumbing, I wrote https://github.com/jonchesterfield/hostrpc. That can do things like have an nvidia card call a function on an amd one. That's the GPU programming model I care about - not passing blobs of floating point math onto some accelerator card, I want distributed graph algorithms where the same C++ runs on different architectures transparently to the application. HSA lends itself to that better than Cuda does. But it is rather bring your own code.

That is, I think the more general architecture amdgpu is shipping is better than the specialised one cuda implements, despite the developer experience being rather gnarlier. I can't express the things I want to on nvptx at all so it doesn't matter much that simpler things would work more reliably.

Maybe more relevant to your experience, I can offer some insight into the state of play at AMD recently and some educated guesses at the earlier state. ATI didn't do compute as far as I know. Cuda was announced in 2006, same year AMD acquired ATI. Intel Core 2 was also 2006 and I remember that one as the event that stopped everyone buying AMD processors. Must have been an interesting year to be in semiconductors, was before my time. So in the year cuda appears, ATI is struggling enough to be acquired, AMD mortgages itself to the limit to make the acquisition and Intel obsoletes AMD's main product.

I would guess that ~2007 marked the beginning of the really bad times for AMD. Even if they could guess what cuda would become they were in no position to do anything about it. There is scar tissue still evident from that experience. In particular, the games console being the breadwinner for years can be seen in some of the hardware decisions, and I've had an argument with someone whose stance was that semi-custom doesn't need a feature so we shouldn't do it.

What turned the corner is the DoE labs being badly burned by reliance on a single vendor for HPC. AMD proposed a machine which looks suspiciously like a lot of games consoles with the power budget turned way up and won the Frontier bid with it. That then came with a bunch of money to try to write some software to run on it which in a literal sense created the job opening I filled five years back. Intel also proposed a machine which they've done a hilariously poor job of shipping. So now AMD has built a software stack which was razor focused on getting the DoE labs to sign the cheques for functionally adequate on the HPC machines. That's probably the root of things like the approved hardware list for ROCm containing the cards sold to supercomputers and not so much the other ones.

It turns out there's a huge market opportunity for generative AI. That's not totally what the architecture was meant to do but whatever, it likes memory bandwidth and the amdgpu arch does do memory bandwidth properly. The rough play for that seems to be to hire a bunch of engineers and buy a bunch of compiler consultancies and hope working software emerges from that process, which in fairness does seem to be happening. The ROCm stack is irritating today but it's a whole different level of QoI relative to before the Frontier bring up.

Note that there's no apology nor contrition here. AMD was in a fight to survive for ages and rightly believed that R&D on GPU compute was a luxury expense. When a budget to make it work on HPC appeared it was spent on said HPC for reasonable fear that they wouldn't make the stage gates otherwise. I think they've done the right thing from a top level commercial perspective for a long time - the ATI and Xilinx merges in particular look great.

Most of my colleagues think the ROCm stack works well. They use the approved Ubuntu kernel and a set of ROCm libraries that passed release testing to iterate on their part of the stack. I suspect most people who treat the kernel version and driver installation directions as important have a good experience. I'm closer to the HN stereotype in that I stubbornly ignore the binary ROCm release and work with llvm upstream and the linux driver in whatever state they happen to be in, using gaming cards which usually aren't on the supported list. I don't usually have a good time but it has definitely got better over the years.

I'm happy with my bet on AMD over Nvidia, despite the current stock price behaviour making it a serious financial misstep. I believe Lisa knows what she's doing and that the software stack is moving in the right direction at a sufficient pace.

sorenjan · 2024-06-17T23:15:32 1718666132

> I think Cuda's GPU offloading model is very boring.

> That is, I think the more general architecture amdgpu is shipping is better than the specialised one cuda implements, despite the developer experience being rather gnarlier.

This reminded me of that Twitter thread that was linked on HN yesterday, specifically the part about AMD's "true" dual core compared to Intel's "fake" dual core.

> We did launch a “true” dual core, but nobody cared. By then Intel’s “fake” dual core already had AR/PR love. We then started working on a “true” quad core, but AGAIN, Intel just slapped 2 dual cores together & called it a quad-core. How did we miss that playbook?! AMD always launched w/ better CPUs but always late to mkt. Customers didn’t grok what is fake vs real dual/quad core. If you do cat /proc/cpu and see cpu{0-3} you were happy.

https://news.ycombinator.com/item?id=40696384

What is the currently available best way to write GPGPU code to be able to ship a single install.exe to end users that contains compiled code that runs on their consumer class AMD, Nvidia, and Intel graphics cards? Would AdaptiveCpp work?

JonChesterfield · 2024-06-18T07:44:44 1718696684

Shipping compiled code works fine if you have a finite set of kernels. Just build everything for every target, gzip the result and send it out. People are a bit reluctant to do that because there are lots of copies of essentially the same information in the result.

I suspect every solution you'll find which involves sending a single copy of the code will have a patched copy of llvm embedded in said install.exe, which ideally compiles the kernels to whatever is around locally at install time, but otherwise does so at application run time. It's not loads of fun deriving a program from llvm but it has been done a lot of times now.

sorenjan · 2024-06-18T11:47:30 1718711250

> Shipping compiled code works fine if you have a finite set of kernels. Just build everything for every target, gzip the result and send it out. People are a bit reluctant to do that because there are lots of copies of essentially the same information in the result.

That's kind of the point, you have to build everything for a lot of different targets. And what happens a year from now when the client have bought the latest GPU and wants to run the same program on that? Not having an intermediary compile target like RTX is a big downside, although I guess it didn't matter for Frontier.

I can't find any solution, AdaptiveCpp seems like the best option but they say Windows support is highly experimental because they depend on a patched llvm, and they only mention OpenMP and Cuda backends anyway. Seems like Cuda is still the best Windows option.

JonChesterfield · 2024-06-18T13:39:58 1718717998

There's a degree of moving the goalposts there.

Shipping some machine code today to run on a GPU released tomorrow doesn't work anywhere. Cuda looks like it does provided someone upgrades the cuda installation on the machine after the new GPU is released because the ptx is handled by the cuda runtime. HSAIL was meant to do that on amdgpu but people didn't like it.

That same trick would work on amdgpu - compile to spir-v, wait for a new GPU, upgrade the compiler on the local machine, now you can run that spir-v. The key part is the installing a new JIT which knows what the new hardware is, even if you're not willing to update the program itself. Except that compile to spir-v is slightly off in the weeds for compute kernels at present.

It's tempting to view that as a non-issue. If someone can upgrade the cuda install on the machine, they could upgrade whatever program was running on cuda as well. In practice this seems to annoy people though which is why there's a gradual move toward spir-v, or to shipping llvm IR and rolling the die on the auto-upgrade machinery handling it. Alternatively ship source code and compile it on site, that'll work for people who are willing to apt install a new clang even if they aren't willing to update your program.

sorenjan · 2024-06-18T15:53:05 1718725985

> Cuda looks like it does

And that's what matters. It might be seen like moving the goal posts by someone who knows how it works in the background and what kind of work is necessary to support the new architecture, but that's irrelevant to end users. Just like end users didn't care about "true" or "fake" dual cores.

> If someone can upgrade the cuda install on the machine, they could upgrade whatever program was running on cuda as well.

No, because that would mean that all GPGPU developers have to update their code to support the new hardware, instead of just the runtime taking care of it. I think you're more focused on HPC, data centers, and specialized software with active development and a small user base, but how would that work if I wanted to run a image processing program, video encoder, game, photogrammetry, etc, and the developer have lost interest in it years ago? Or if I have written some software and don't want to have to update it because there's a new GPU out? And isn't the cuda runtime installed by default when installing the driver, which auto updates?

> there's a gradual move toward spir-v

It was introduced 9 years ago, what's taking so long?

> Alternatively ship source code and compile it on site, that'll work for people who are willing to apt install a new clang

Doesn't seem to work well on Windows since you need to use a patched Clang, and some developers have a thing about shipping their source code.

On the whole both the developer and the Windows user experience is still very unergonomic, and I really expected the field to have progressed further by now. Nvidia is rightly reaping the reward from their technical investments, but I still hope for a future where I can easily run the same code on any GPU. But I hoped for that 15 years ago.

paulmd · 2024-06-20T02:19:06 1718849946

I mean, you say there was “just no money” but AMD signed a deal over three years ago to acquire Xilinx for $50b. They’ve been on an acquisition spree in fact. Just not anything related to gpgpu, because that wasn’t a priority.

Yes, after you spend all your money there’s nothing left. Just like after refusing the merger with nvidia and then spending all the cash buying ati there was nothing left. Times were very tough, ATI and consoles kept the company afloat after spending all their money overpaying for ATI put you there in the first place. Should have done the merger with nvidia and not depleted your cash imo.

More recently could easily have spent 0.5% of the money you spent on Xilinx and 10x’d your spend on GPGPU development for 10 years instead. That was 2020-2021 - it’s literally been 5+ years since things were good enough to spend $50 billion on a single acquisition.

You also spent $4b on stock buybacks in 2021... and $8 billion in 2022... and geohotz pointed out your runtime still crashed on the sample programs on officially-supported hardware/software in 2023, right?

Like the assertion that a single dime spent in any other fashion than the way it happened would have inevitably led to AMD going under while you spend an average of tens of billions of dollars a year on corporate acquisitions is silly. Maybe you legitimately believe that (and I have no doubt times were very very bad) but I suggest that you’re not seeing the forest for the trees there. Software has never been a priority and it suffered from the same deprioritizing as the dGPU division and Radeon generally (in the financial catastrophe in the wake of the ATI debacle). Raja said it all - gpus were going away, why spend money on any of it? You need some low end stuff for apus, they pulled the plug on everything else. And that was a rational, albeit shortsighted, decision to keep the company afloat. But that doesn’t mean it’s the only course which could have done that, that’s a fallacy/false logic.

https://youtu.be/590h3XIUfHg?t=1956

Nothing drives this home more than than this very interview with Lisa Su where she is repeatedly probed around her priorities and it always comes back to hardware. Even when she is directly asked multiple sequential questions probing at her philosophy around software, she literally flatly says hardware is what’s important and refuses to even say it’s a priority today. “I don’t know that. I am a hardware person”, and that’s how she’s allocated resources too.

https://news.ycombinator.com/item?id=40703420

I agree Xilinx is a very important piece for building systems, and it’s led to a lot of innovation around your packaging etc, but it’s also been a massive miss on systems engineering too. Nvidia has picked PHYs with higher bandwidth-density per era and built the hardware to scale the system up, as well as the software to enable it all. The network is hardware too, there is much more to hardware than just single-package performance and AMD is falling behind at most of it, even with Xilinx.

Again, though, from a developer perspective every single person’s formative experience with gpgpu has been getting excited to try opencl, APP, HSA, ROCm, or HIP, and then AMD shattering it, followed by “wow CUDA just works”. You guys have been an awful company to work with from the dev side.

And again, I’m sure you do look at it in terms of “we target the commercial stuff because they pay us money to” but you really should be thinking in terms of where your addressable market is. Probably 70% of your market (including iGPUs) is using GCN era gaming hardware. You simply choose not to target these users. This includes a large number of current-gen products which you continue to sell into the market btw - those 5700G are Vega. Another 25% or so is using rdna1/2 and you didn’t target these until very recently (still not on Linux, the dominant platform for this). And you are continuing to add numbers to this market with your zen4 APUs - you simply choose not to support these on ROCm at all, in fact.

Your install base is backwards-weighted, and in fact is only becoming moreso as you lose steam in the desktop market - 85% of dGPUs sold this gen were the competition's. And literally going into this generation you completely kicked all legacy hardware to the curb anyway, except for Radeon vii, the weird fragile card that nobody bought because it was worse than a 2080 at a higher price a year later. You have no long-term prospects because you stubbornly refuse to support the hardware that is going to be accessible to the people who want to write the next 20 years of software for you, you are literally targeting the inverse of your install base.

Sorry to say it, and again not mad at you personally etc, but the other comments are right that this is seemingly a problem that everyone can see except for the people who work at AMD. The emperor has no software.

It is hard to think of a more perfect example of the Innovator's Dilemma: you targeted the known, stable markets while a smaller, more agile, more focused (in R&D spend, etc) competitor visibly created a whole new segment, that you continued to ignore because there still wasn't Big Money in it yet. It's a tale as old as time, and it always comes from a place of execs making safe, justifiable, defensible decisions that would have been completely rational if things had gone differently and GPGPU compute didn't become a thing. But it's not a defense to the question of "why did you miss the boat and what are you going to do differently going forward", either. Understanding why the decision was made doesn't mean it's a good one.

One of the points I've been making is that the reason the mining booms keep coming back, and AI booms, etc is that "dense compute" has obviously been a thing for a while now. Not just HPC, not just crypto, but there are lots of things which simply need extreme arithmatic/bandwidth density above all else, and as various fields discover that need we get surges into the consumer gaming market. Those are the wakeup calls, on top of HPC and AI markets visibly and continuously expanding for a decade now, while AMD watched and fiddled. And again, as of 5+ years ago AMD was not nearly so destitute they couldn't start moving things forward a bit - you were so destitute you were looking to start acquisitions to the tune of tens of billions a year, tens of billions of dollars in stock buybacks, etc.

I just flatly reject the idea that this was the best posible allocation of every dollar within the company such that this was flatly not achievable. You had money, you just didn't want to spend it on this. Especially when the CEO is Hardware Mafia and not the Software Gang.

(and I'm nodding intentionally to the "Bomber Mafia" in WW2 - yeah, the fighter aircraft probably can't escort the bombers into germany, you're right, and that's mostly because the bomber mafia blocked development of drop tanks, but hindsight is 20/20 and surely it seemed like a rational decision at the time!)

https://www.youtube.com/watch?v=I7aGC6Sp8zQ

I also frankly think there is a real concerning problem with AMD and locus-of-control, it's a very clear PTSD symptom both for the company and the fans. Some spat with Intel 20 years ago didn't make AMD spend nearly a hundred billion dollars on acquisitions and stock buybacks instead of $100m on software. Everything constantly has to tie back to someone else rather than decisions that are being made inside the company - you guys are so battered and broken that (a) you can't see that you're masters of your own destiny now, and (b) that times are different now and you both have money to spend now and need to spend it. You are the corporate equivalent of a grandma eating rotten food despite having an adequate savings/income, because that's how things were during the formative years for you. You have money now, stop eating rotten food, and stop insisting that eating rotten food is the only way to survive. Maybe 20 years ago, but not today.

I mean, it's literally been over 20 years now. At what point is it fair to expect AMD leadership to stand by their own decisions in their own right? Will we see decisions made in 2029 be justified with "but 25 years ago..."? 30 years? More? It's a problem with you guys: if the way you see it is nothing is ever your responsibility or fault, then why would you ever change course? Which is exactly what Lisa Su is saying there. I don't expect a deeply introspective postmortem of why they lost this one, but at least a "software is our priority going forward" would be important signaling to the market etc. Her answer isn't that, her answer is everything is going great and why stop when they're winning. Except they're not.

paulmd · 2024-06-20T16:59:11 1718902751

it's also worth pointing out that you have abdicated driver support on those currently-sold Zen2/3 APUs with Vega as well... they are essentially legacy-support/security-update-only. And again, I'm sure you see it as "2017 hardware" but you launched hardware with it going into 2021 and that hardware is still for sale, and in fact you continue to sell quite a few Zen2/3 APUs in other markets as well.

if you want to get traction/start taking ground, you have to actually support the hardware that's in people's PCs, is what I'm saying. The "we support CDNA because it is a direct sale to big customers who pay us money to support it" is good for the books, but it leads to exactly this place you've found yourselves in terms of overall ecosystem. You will never take traction if you don't have the CUDA-style support model both for hardware support/compatibility and software support/compatibility.

it is telling that Intel, who is currently in equally-dire financial straits, is continuing to double-down on their software spending. At one point they were running -200% operating margins on the dGPU division, because they understand the importance. Apple understands that a functional runtime and a functional library/ecosystem are table stakes too. It literally, truly is just an AMD problem, which brings us back to the vision/locus-of-control problems with the leadership. You could definitely have done this instead of $12 billion of stock buybacks in 2021/2022 if you wanted to, if absolutely nothing else.

(and again, I disagree with the notion that every single other dollar was maximized and AMD could not have stretched themselves a dollar further in any other way - they just didn't want to do that for something that was seen as unimportant.)

lhl · 2024-06-17T07:59:31 1718611171

ROCm 6.0 and 6.1 list RDNA3 (gfx1100) and RDNA2 (gfx1030) in their supported architectures list: https://rocm.docs.amd.com/en/latest/compatibility/compatibil...

Although "official" / validated support^ is only for PRO W6800/V620 for RDNA2 and RDNA3 RX 7900's for consumer. Based on lots of reports you can probably just HSA_OVERRIDE_GFX_VERSION override for other RDNA2/3 cards and it'll probably just work. I can get GPU-accelerate ROCm for LLM inferencing on my Radeon 780M iGPU for example w/ ROCm 6.0 and HSA_OVERRIDE_GFX_VERSION=11.0.0

(In the past some people also built custom versions of ROCm for older architectures (eg ROC_ENABLE_PRE_VEGA=1) but I have no idea if those work still or not.)

^ https://rocm.docs.amd.com/projects/install-on-linux/en/lates...