CUDA vs. ROCm: A case study

cherryteastain · on Dec 19, 2023

I wish AMD would just drop ROCm at this stage, and focus on SYCL. The rocRAND/hipRAND woes in this article are if anything showing ROCm in a better light than it really is; here it at least worked and performed within the same ballpark as CUDA. Often it simply does not work at all, or if it works it's behind by a lot more. At work I simply gave up on our 4x Radeon Pro W6800 workstation because launching Tensorflow with more than 1 GPU would cause a kernel panic every time, and AMD engineers never offered a fix other than "reinstall Ubuntu".

ROCm feels like such a half assed product that (to me at least) feels like it's been made to tick a box and look cool in corporate presentations. It's not made with the proper mindset to compete against CUDA. Lisa Su claims they're doubling down on ROCm but to me it feels like they're falling behind relative to Nvidia, not catching up.

Banding together with Intel to support SYCL would in my opinion

1. Ensure there's a lot more momentum behind a single, cross-platform, industry-standard competitor

2. Entice other industry heavyweights like MSFT, Qualcomm, ARM etc to also take the cross-platform solutions more seriously

3. Encourage heavy investment into the developer experience and tooling for the cross-platform solution

arcanus · on Dec 20, 2023

SYCL is a terrible option. The performance of any application on SYCL is currently quite poor. Why drop ROCm (used on the world's largest supercomputer?) for an even more unproven product that AMD has no control over the roadmap?

> Banding together with Intel to support SYCL would in my opinion

Except that Intel has more control over SYCL and has repeatedly hurt AMD products with anticompetitive behavior in the past. Why would AMD permit their software to be controlled be a competitor?

AMD is executing with ROCm NOW, with multiple Top500 supercomputer wins and deployment in major cloud vendors with MI300X. Yes, the software needs to improve, but the practice of throwing out software to start over is not a good strategy.

AMD has far more momentum in the data center than INTC at present.

shihab · on Dec 20, 2023

I agree about ROCm.

When it comes to comparison to SYCL, HIP is much closer to the spirit of SYCL than ROCm is. Both aim to help writing a single codebase that'll run across multiple hardwares. For now though, the trajectory of SYCL appears much more promising to me than HIP. HIP is already split in two parts for CPU and GPU, which is baffling, and neither part seems to receive much love from AMD.

hjabird · on Dec 20, 2023

> The performance of any application on SYCL is currently quite poor.

SYCL can get pretty much equivalent performance in Kernels to eg. CUDA. Try looking at SYCL performance papers on Arxiv. Eg. see [1].

That isn't to say that SYCL code is optimised on every platform without tweaking - you do still need to put effort into target specific optimizations to get the best performance, like you would in the CUDA or HIP.

> Why drop ROCm (used on the world's largest supercomputer?)

Some of the world's largest super-computers / HPC applications do use SYCL for AMD! The application I'm most aware of for this is GROMACS. As to why? - because having 3 version of the same code using different programming APIs is a big maintenance burden.

[1] https://arxiv.org/pdf/2309.09609.pdf

arcanus · on Dec 21, 2023

> Some of the world's largest super-computers / HPC applications do use SYCL for AMD! The application I'm most aware of for this is GROMACS. As to why? - because having 3 version of the same code using different programming APIs is a big maintenance burden.

The fact that GROMACS is unwilling to drop CUDA support to stand fully behind SYCL is very telling.

hjabird · on Dec 21, 2023

I wouldn't expect them to drop CUDA support, even if SYCL is a viable alternative: * The CUDA backend is mature, featureful, and significant effort has been invested into optimising it on Nvidia hardware. One does not simply throw away a performant, well validated HPC code! * Nvidia GPUs dominate the GPGPU market - unlike AMD's. * The SYCL backend is still is very new in comparison (they even state in the docs to pay extra attention to validation), and doesn't have Nvidia-specific optimisations yet. Why prioritise reimplementing what already exists?

pjmlp · on Dec 20, 2023

It hardly matters if most desktops and laptops will be running CUDA/SYCL instead.

pokeypokes · on Dec 20, 2023

Important to understand that with these kinds of libraries, it's one thing to get everything working, and a whole other to have it all be fast. Given finite resources AMD are (rightfully) focused on the second, for a narrower set of use cases - namely big customers with big pockets, using the latest compute cards.

See, for example, the world's fastest super computer. That's a whole lot more than a presentation tick box.

As someone doing relatively small scale work, on gaming cards, you just aren't the target user. While these kinds of users are well represented on forums, they're a rounding error in terms of actual $.

Switching to SYCL makes no sense. They need to unseat Nvidia, and specifically CUDA. The whole point of HIP is to clone CUDA, make it easy for users to switch and piggy back on Nvidias success.

I used to work in this space (not at AMD lol), and generally I don't think the comments on HN are fair to AMD. Obviously ROCm has a long way to go (CUDA and libs are some truly amazing work), but it's going.

throwaway81523 · on Dec 20, 2023

Don't know about the libs but CUDA itself is a C++ dialect targeted at a family of quirky array processors (GPU's). That takes some dev work by compiler hackers but doesn't seem amazing. Am I missing something? It's of course still easy to mess up such a project by not bringing enough clue, and AMD might have done that with ROCm. Is that what happened, or is it something different?

pjmlp · on Dec 20, 2023

CUDA is more than a C++ dialect, which is a big thing people keep missing with all those "CUDA replacements".

Until CUDA 3.0, it was similar to OpenCL, a C dialect, however afterwards it became a C, C++ dialect, with common infrastructure PTX.

PGI targeted PTX, with their C, C++, and very relevant, Fortran compilers for HPC.

PGI was acquired by NVidia, and became the main set of CUDA compilers.

Given PTX, many other languages started targeting CUDA as well, Java, .NET, Haskell, Julia, at very least.

NVidia is now invested into a Python JIT for CUDA as well.

So yeah, while C++20 is the main language in CUDA, there is also a whole ecosystem of programming languages, that the "CUDA replacements" keep ignoring.

throwaway81523 · on Dec 20, 2023

In these CUDA vs ROCm comparisons I think they mostly compare the C++ dialects. And it's not even particularly the language implementation where ROCm is weak, but rather, the whole tool chain. Am I mistaken?

pjmlp · on Dec 20, 2023

Which is why it will never be a CUDA replacement, for most C++ users maybe, for the whole ecosystem definitly not.

throwaway81523 · on Dec 20, 2023

Well the C++ segment is important and from what I gather, ROCm is failing at that. AMD would be a lot better off if the C++ part worked, even if the other parts don't.

pjmlp · on Dec 20, 2023

True, not even that works that well.

imtringued · on Dec 20, 2023

>As someone doing relatively small scale work, on gaming cards, you just aren't the target user.

He didn't use gaming cards. He bought the expensive $1.5k "workstation" graphics cards. What exactly is one going to use these GPUs for, other than GPGPU? Play video games? Really?

>While these kinds of users are well represented on forums, they're a rounding error in terms of actual $.

There are only three thousand developers who have ever committed to pytorch. Compared to a datacenter contract, that is a rounding error in terms of sales. Hence AMD shouldn't waste time on letting people independently work on AMD support for pytorch. They should only work on pytorch, whenever there is a big contract.

However, switching to SYCL makes no sense because Mesa is getting SYCL support. The likelihood that the mesa drivers cause a kernel panic is much lower.

JonChesterfield · on Dec 20, 2023

A few issues with that.

One is that ROCm is the entire compute stack - driver, libraries, compilers, languages and sycl is a programming language. Sycl is more like one more language that could be built on ROCm than a replacement for it.

The second is that all the entire target market wrote all their stuff in cuda and looks totally unwilling to change to anything else. That makes HIP look like a good idea and Sycl not so much. Noone (commercially significant) is likely to port their code to Sycl to benefit from Intel's GPUs.

It's worth saying that AMD bet heavily on cross platform technology. Initially opencl, but take a look at the companies behind HSA as well. I'd say that doesn't appear to have worked at all - they've got the hassle of working with external specifications, and other companies have given up on the specs, so really noone is winning there. I think Triton is a cross company thing currently being implemented.

Finally Intel specifically is moribund. Definitely don't want to be relying on them to do anything.

On the bright side, the wide array of GPU programming languages are all very similar to each other from the perspective of the compiler stack, e.g. they all run through the same LLVM back end. So if someone decides they like sycl on amdgpu enough to write it, a lot of existing code is there to help. All open source in case it's someone outside of AMD that wants to write it.

kimixa · on Dec 20, 2023

I feel this is falling into the rewrite trap - rather than continuing to develop an existing but imperfect system, throw it all away for something that is unproven and currently in a notably worse state.

What about rocm makes it fundamentally by design that much worse? And what about sycl makes it better? Rocm is already based on many standard shared OSS projects, like llvm and clang. Often the same ones SYCL stacks are based on. What about SYCL makes it so superior that it'll make up for throwing all the current work away?

cherryteastain · on Dec 20, 2023

> What about rocm makes it fundamentally by design that much worse? And what about sycl makes it better?

The fact that SYCL compiles device code to SPIR-V instead of a device specific ISA like HIP for instance. The decision to do the latter, alone, is what causes ROCm to have support for a very narrow range of cards which significantly hampers adoption.

ColonelPhantom · on Dec 21, 2023

That's implementation details. SYCL and HIP are programming languages (more specifically C++ dialects), ROCm is an implementation of HIP that uses machine code. Intel DPCPP/icpx is an implementation of SYCL that uses SPIR-V.

There exists a HIP implementation that also compiles to SPIR-V, chipStar. And one can in theory write a SYCL compiler that generates machine code (maybe with openSYCL's HIP/ROCm backend?)

cherryteastain · on Dec 21, 2023

The woes the author of the article or various posters in this thread are also side effects of "implementation details". I don't think anyone is critical of HIP as a programming language, as it's just a CUDA clone. Pretty much all complaints (crashes, perf issues, poor device compatibility) regarding HIP/ROCm pertain to poor implementation/execution, including the directly-emitting-device-specific-machine-code problem. AMD can't wriggle out of this one with that excuse.

ColonelPhantom · on Dec 21, 2023

Oh, I'm not using it to excuse the issues! All I'm saying is that if AMD implemented SYCL instead they'd probably turn it into an equally big mess.

kimixa · on Dec 21, 2023

I very much agree - and I think some people misunderstood what I was trying to say too - that was that I don't think switching to SYCL would fix any of the problems that rocm currently has.

I personally feel many of the problems with rocm are due to lack of investment and management priority - there is no "technical" solution to that as a problem. And swapping out parts of that stack will just exacerbate those problems are now they need to rebuild everything from scratch with their insufficient resources.

imtringued · on Dec 20, 2023

Most likely the ability to run software on the GPU. When I last used ROCm I got both bad performance and it was prone to crashing.

JonChesterfield · on Dec 20, 2023

It's really sensitive to the (Linux) driver/compiler/library versions all lining up. If you get the blessed Ubuntu kernel version and use it with the exact corresponding driver from dkms and the matching userspace version it's reasonably solid.

If you use the driver in Linux upstream, whether you have a good time or a terrible one varies with the age of the userspace libraries you're pairing with it.

I don't know what to do about this setup. I think it's an artifact of gearing testing towards "releases" of the entire stack with specific HPC operating systems in mind.

To do this right, each component would need to be individually solid and deal with version/interface slip relative to other components, which means the cross product of testing patterns is rapidly infeasible. I'm thinking about it but don't see a solution yet.

zozbot234 · on Dec 20, 2023

SYCL is practically comparable to OpenCL which is already getting unofficial support (including on older hardware that Rocm does not support) via Mesa's RustiCL. Additionally clspv and Mesa's Zink are both exploring OpenCL (and potentially SYCL)-on-Vulkan, though that option involves lots of quirks since the standards have evolved independently (and are even based on different varieties of SPIR-V).

dagw · on Dec 20, 2023

I wish AMD would just drop ROCm

AMD has a really terrible reputation for dropping old and introducing new APIs on a whim. To the extent that a lot of people in the field are completely burned out on AMD. Dropping yet another API after promising that this is "the one" would completely kill any chance of ever being taken seriously in the field.

hjabird · on Dec 20, 2023

A challenge with this is that all current AMD GPU support in SYCL compilers (DPC++ w/ Codeplay's oneAPI for AMD GPUs, and AdaptiveCpp), is built atop of ROCm / HIP.

If AMD were to move away from ROCm, they would have to adopt some other API for SYCL to use as a backend. Potentially, this could be the level zero spec from UXL foundation, but that is not the same as SYCL.

However, no changing between APIs etc. will help with troublesome drivers, or issues with OS compatibility, or documentation. Those issues are orthogonal.

cyanydeez · on Dec 20, 2023

hate to break it to you but with the mi300, they're clearly going in the same direction.

I looked at their current compatibility chart for rocm and it's clearly just starting.

slavik81 · on Dec 20, 2023

> At work I simply gave up on our 4x Radeon Pro W6800 workstation because launching Tensorflow with more than 1 GPU would cause a kernel panic every time, and AMD engineers never offered a fix other than "reinstall Ubuntu".

Unfortunately, it's not always easy to get an issue in front of the right people. The W6800 is an officially supported GPU and a kernel panic is a serious bug. I'd like to try to help get this issue properly triaged. Could you send me a link to your report?

cherryteastain · on Dec 20, 2023

Unfortunately I no longer work there so I sadly cannot give you a detailed report. What I do remember is that this was a Threadripper 5995WX system and based on what AMD reps told us at the time the issue was related to some unexpected interaction between the amdgpu driver and the PCIe implementation on the motherboard we were using.

ekelsen · on Dec 19, 2023

AMD attempted responses go all the way back to 2007 when CUDA first debuted with "Close to Metal" (https://en.wikipedia.org/wiki/Close_to_Metal). They've had nearly 20 years to fix the situation and have failed to do so. Maybe some third party player like Lamini AI will do what they couldn't and get acquired for it.

shihab · on Dec 20, 2023

The thing about modern AI, it's that operations involved here (e.g. Dense matmuls) are lot simpler and GPU friendly than what you'd find in a typical HPC applications. This means you can get pretty close to peak hardware performance using high-level languages like Python or OpenAI's Triton. I think it's unlikely that the push to improve ROCm's standard libraries will come from an AI-focused startup

cavisne · on Dec 20, 2023

Even with AMD working on hard on specific benchmarks for marketing purposes they could not get close to peak hardware performance on their brand new chips.

https://www.semianalysis.com/p/amd-mi300-performance-faster-...

frognumber · on Dec 20, 2023

That's more-or-less my experience with AMD, only worse. Critical thing too are burned developers like myself.

I'm looking forward to Intel v. NVidia. Arc A770 is a pretty serious competitor. It's the lowest-cost way to run OPT-175B.

Given a 7-slot motherboard, $270 * 7 = $1890 for 112GB of VRAM in one computer. That's sweet. Compute speed would be on-par with top-of-the-line NVidia workstation GPU.

Three of those are enough to run the largest open-source LLMs at around $9000.

We're just drivers + libraries + documentation away, and Intel is not bad at drivers + libraries + documentation.

MezzoDelCammin · on Dec 20, 2023

$9k for three of those? I'm affraid Your math aint mathing.

7-slot motherboards are IMO Threadripper teritory. The CPU alone will set You back at least $1k5, MB to go along with that is another $1k. Not including the stack of DDR5 that You'll likely throw into that and the PSU / case / coolers to feed it power and evacuate the heat from this monstrosity.

That's at least $3000 for the base system "before" You add any GPUs.

And at some point I'd like to read someone more enlightened to comment on what the memory bandwidth is likely to be and how exactly will You distribute the PCIe lanes...

frognumber · on Dec 20, 2023

PCIe bandwidth is a limiting factor. For brevity, I did not write an essay.

The rest doesn't change the big picture. I spec'ed this out before, and your numbers are a good bit off if you're willing to cut corners. The cheapest way to do the rest is refurbished server, and slightly older parts. That's <<$1000.

New parts is a little bit more, but nowhere near $3k. Motherboard can be found for under $500 when a deal comes up. I found the Supermicro MBD-M12SWA-TF-O for that. PCIe risers and cooling can be done on the cheap, if you're willing to DIY. Etc. Power draw is under 3kW. A pair of 1.5kW PSUs will do that, for under $300. It's a mess (you're not getting a formal case), but it works.

Very similar systems were built during the bitcoin boom. Look some of those up, and see how they got costs down. That was all about margins.

I would like someone more enlightened to comment on PCIe issues as well.

MezzoDelCammin · on Dec 20, 2023

True, bitcoin miners have probably figured out quite a few things to which I'm fairly oblivious. I always regarded that movement as a complete waste of computing power. Turns out at least some of their stuff may turn out to be useful for LLMs...

As for the specs, I went new Threadripper way simply because that's AFAIK where You'll get the best balance of PCIe 5.0 lanes (for the memory bandwidth between the GPUs) and power.

I see how You could spec this out with some used Xeon, but for one I really wouldn't like the headache and another, You're probably getting PCIe 2.0/3.0 tops (if You're trying to cut costs that much).

That Supermicro You mention seems to be an interesting compromise. The MB is selling second hand for about ~700EUR in the EU. The Threadripper that goes into it... don't know, can't find them used, but guessing something under 1k EUR as well.

Interesting price, but then on top of the risk of using old parts, instead of partitioning PCIe 5.0 (and maybe having the option to go 4.0 and double the count, since Your GPUs are PCIe 4.0 anyway), You're now splitting straight 4.0 lanes to Your GPUs. Half the bandwidth, but it may end up not mattering all that much. Given that most LLMs are memory size / memory bandwidth limited, I probably wouldn't like to risk this. But that's just me...

frognumber · on Dec 20, 2023

A few points:

* The Supermicro was on sale for around $500 a few weeks ago. That's an actual price new on Newegg. I had it in my cart (not with intent to purchase now, but spec'ing out for a future purchase).

* I've found proper refurbished to be at least as reliable as new, if bought through appropriate channels. What you're looking for is off-lease. Companies lease computers for e.g. 3 years, and then return to Dell. Those computers ran for long enough to be past infant mortality. That very much differs from consumer returns / repairs, which were returned because they were unreliable. I'd never buy a refubished Inspiron, but I very much would buy a refurbished Latitude.

* Part of it depends on what you're doing. I'm very much not performance-limited for most of the things I do. If a model takes 2x as long to run, I'll wait twice as long. No big deal. If a model can't run at all, that's where there's a huge difference.

A card like the A770, which is around $20/GB, is very exciting to me, if the drivers / documentation / etc. lands.

I'll also mention: The scaling here isn't quite as it seems:

1) Two 16GB cards with half the bandwidth per card will have the same theoretical bandwidth as one 32GB card in a faster lane. Ditto for bandwidth to VRAM. Once you start doing the math there, the cheap cards look very good, indeed!

2) If big chunks of data need to flow between cards, you're using much more of that bandwidth. The cheap cards look a lot worse.

A lot of that depends on what you're doing, and if I haven't made it clear yet, I don't quite know what I'm doing :)

ParetoOptimal · on Dec 20, 2023

> Arc A770 is a pretty serious competitor. It's the lowest-cost way to run OPT-175B.

With what software?

sangnoir · on Dec 20, 2023

If AMD can significantly undercut Nvidia on perf/$ like they did to Intel with Zen 1; developers will fall over themselves to improve the software. When the performance price-point is low enough, the pain will be worth it.

JonChesterfield · on Dec 20, 2023

ATX power supplies cap out at 2kW or so and single slot GPUs require water cooling which moves your price estimate significantly. Also difficult to plumb in adjacent cards. I think the realistic maximum is four ~300W cards per system, and that still needs water.

Your point stands - GPU compute on commodity hardware is very cheap. $10k per 4U quad GPU box is probably the ballpark.

evilos · on Dec 20, 2023

Intel as a whole seems to be in pretty big trouble and rumor is that the GPU org is a dumpster fire in particular. I'd bet it's more likely than not that intel's GPU experiment ends within a year or two, unfortunately.

Which is a crying shame because the space could really use more competition, instead of less.

vGPU · on Dec 20, 2023

Sure but what’s the energy and overhead tradeoff?

anvuong · on Dec 20, 2023

“Please note the library is being actively developed, and is known to be incomplet; it might also be incorrekt and there could be a few bad bugs lurking.”

That gives me a good laugh.

thrtythreeforty · on Dec 20, 2023

I use exactly the same verbiage in my draft documents. I think it's a reference to C++ standards draft documents, which have that on the cover. I don't know if that's the original usage though, it's just where I saw it first.

physicsguy · on Dec 20, 2023

This was exactly my experience with it too. It's moved on a bit since then but when I looked at rocFFT a couple of years ago, the documentation was really poor and it was missing features.

When I switched from FFTW to cuFFT many years ago (~2015), the transition was very smooth, the documentation was great, and all features were supported. They even put a shim "FFTW" compatible header file in so that you didn't need to rewrite your code to make it work (leaving some performance on the table).

zamalek · on Dec 20, 2023

I'm convinced that Julia is how the moat will be crossed. There are some pretty incredible GPU packages for it (I'm looking at you KernelAbstractions.jl). The Python science community seems more than happy to carry on focusing on NVIDIA and are a lost cause.

I somewhat don't blame them: the MI300X might be miles ahead and all, but AMD are not only oblivious to the desktop market (you know, where new ideas are prototyped) but are also seemingly actively hostile[1]. NVIDIA has people doing somewhat interesting things with a 3060 (which can eventually graduate to a 4090 or even a H100), while AMD don't what to hear about it unless you have a "pro" GPU. Definitely a case of dollar-wise and penny-foolish.

[1] https://rocm.docs.amd.com/en/docs-5.5.1/release/gpu_os_suppo... * FWIW you can override this with an envar, but AMD aren't exactly forthcoming with that information.

JonChesterfield · on Dec 20, 2023

Documentation. The python example is a bit on the nose. I've not had good times with ROCm's documentation either.

Can anyone point to an example of good documentation for a big software system where they can also sketch how that was achieved? E.g. Cuda's docs are pretty good but I've no idea how they came to be, or how they stay up to date. LLVM's docs are a small amount of handwritten webpages which correlate with reality to some extent, the source for which lives in the same repo as the code.

I have an idea that it needs to combine programmers writing some things, some testing infra to notice things like internal links will 404 and some non-developers writing things.

I started trying to document one of my own systems as freeform notes under obsidian and while it kind of works at the time it diverges from reality pretty quickly, and that's without trying to have anyone else working on either the docs or the system.

So what's the proper, established answer to this?

not_the_fda · on Dec 20, 2023

You have well thought out stable user facing APIs. You can change the underlying implementation but the user facing API remains stable.

Then you hire good technical writers to write and maintain the documentation.

quanto · on Dec 20, 2023

AMD cards + ROCm are used in top supercomputers for (non-deep learning) HPC. Why is this the case?

I understand that AMD GPUs offer better cost efficiency for F32 & F64 FLOPs, RAM, and wattage. But however, if ROCm is such a half baked piece, shouldn't that advantage be gone? What drives AMD adoption in the HPC space then?

ip26 · on Dec 20, 2023

I assume a deal like that involves a lot of direct support and optimization by ROCm team for the workload the supercomputer is running, rather than generic awesomeness at everything.

That’s been my experience with most large enterprise software, really. Amazing at its most critical competency, core dumps when I try something a little unusual.

cavisne · on Dec 20, 2023

Nvidia has pretty epic margins in the datacenter space, so some of it will be that. Epic because their software advantage in AI lets them charge a premium, so why bother competing for the absolute cheapest $/FLOPS contract.

More generally LLM training is weird because its a supercomputing workload being executed by Silicon Valley devs. So they want to use open source frameworks, find answers on stack overflow, use cloud providers etc. And it's new enough that random PHD's can invent stuff like Flash Attention without being employed by Nvidia.

Normal supercomputer users dont care about any of that.

my123 · on Dec 20, 2023

> AMD cards + ROCm are used in top supercomputers for (non-deep learning) HPC. Why is this the case?

Public bidding processes are different and much more reliant solely on price. As such you can expect them to pick different choices than the private market.

This can end up picking outright duds (see Aurora)

taw28 · on Dec 20, 2023

There is certainly a requirement that prevents payout on the contract for those supers if performance numbers aren’t met. Further, those performance numbers are from real apps, so the system WILL be useful. Not getting paid is not an option, so performance/usability will come.

quanto · on Dec 20, 2023

how does that contractual obligation translate to technical implementation? Do those supercomputers get an optimized version of ROCm to fulfill said obligation?

izacus · on Dec 20, 2023

Absolutely. You usually end up with software stack where NOTHING can be updated, most of the stuff is forked with custom patches and the learnings there aren't reusable elsewhere because the code is full of "// replace this with hardcoded constant 59843 because it prevents crash on HPC machine".

It's a good marketing metric, but probably contraproductive the AMDs longterm success in the field. They're spending engineering time building something they'll unlikely to be able to translate into other fields.

p_l · on Dec 20, 2023

Vendor iterates until client gets an usable environment, even if that means 50 forks of different libraries with custom patches that in the end work only on that one system.

JonChesterfield · on Dec 20, 2023

ROCm works better in HPC than on consumer systems because it was initially written for HPC and arguably is still tested with HPC in mind. ROCm development was substantially funded by the Frontier supercomputer and the DoE developers get to complain directly at AMD developers when they're not happy.

The DoE doesn't like having single suppliers for their tech. AMD/HPE's Frontier bid was a bit of a gamble from the DoE - it wasn't remotely obvious whether they'd be able to deliver it or not, and was against a background of Intel broadly failing to deliver Aurora - whereas nvidia was a known safe bet. However placing a big order with Intel and another one with AMD was their best shot at getting away from being wholly reliant on nvidia.

Frontier shipped. It's a real thing, people run code on it. I'd guess the DoE labs talk to each other to some extent and thus AMD ended up winning the El Capitan bid. That means AMD has a flagship HPC machine that sales people can point to and a big ongoing revenue stream to continue funding development from. It looks like the DoE plan to have two HPC vendors has worked.

Aurora seems to be somewhat in existence now but is less compelling as a story other potential customers might want to copy. I'm curious whether Intel end up making a loss on it.

nestorD · on Dec 20, 2023

Part of the reason is that at the scale of a country like the US, if you are going to have several supercomputers build in the coming years, you want to diversify to avoid putting all your eggs in the same basket (and to giving a preferential treatment to a single vendor).

So, as long as they can get a given set of applications to run above a given performance threshold, alternative vendors have a shot.

throwawaymaths · on Dec 20, 2023

That's not how sales works for those kinds of installations

Zetobal · on Dec 20, 2023

I hate ROCm so much I can't even describe how much I suffered because of this pos software. It wasn't good 3 years ago but manageable now it's just I don't even know how they got it worse.

I really just wished my employer would give up on AMD for GPUs.

mnd999 · on Dec 20, 2023

It doesn’t even work on half of the GPUs (Navi23 hello). It’s like they’re not even trying which is insane when there’s so much cash on the table.

fer · on Dec 20, 2023

I had a look 3 years ago and the compatibility matrix was an absolute shitshow: older cards not supported at all, newer cards not supported at all, and support for those in the middle was full of holes depending on extension, version, etc.

I couldn't even figure out what I needed to buy, so I didn't.

HarHarVeryFunny · on Dec 20, 2023

At this point, rather than chasing CUDA/cuDNN compatability, it would seem more productive for AMD to be targetting high level language support. Forget CUDA compatibility, and instead support work such as Mojo and MLIR.

It seems that in an ideal world PyTorch support for AMD wouldn't rely on ROCm, but rather be based on high level code compiled to MLIR with AMD target support, with this same MLIR representation supporting Mojo and any other languages such as Julia that want optimized AMD support.

outside1234 · on Dec 20, 2023

How is there not an OpenGL at this point in this space? It seems like it is in all of the hyperscalers benefit to get this going - why haven’t they?

jacoblambda · on Dec 20, 2023

I mean there's OpenCL which has widespread support and is pretty much about as pleasant to write as OpenGL. It has existed for a very, very long time but people hate writing in it so they just defaulted to CUDA and ignored everyone but NVIDIA.

thrtythreeforty · on Dec 20, 2023

Doesn't it take, like, 100 lines of code to call a kernel in OpenCL? Compared to Nvidia's 1?

Developer friction is huge. Don't discount the amount of boilerplate it takes to get started with something.

pjmlp · on Dec 20, 2023

Additionally OpenCL was stuck in C99, Khronos ignored C++ for a long time, ignored Fortran to this day, and never cared about other languages.

CUDA is C, C++ and Fortran for at least a decade, and with PTX created a third party ecosystem that SPIR still doesn't have to this day.

dotnet00 · on Dec 20, 2023

Bigger than it taking a lot of boilerplate is the issue that there was never sufficiently widespread and bug free support for mixing CPU and GPU code in OpenCL in the way CUDA allows.

So you end up having to do a lot more work to port stuff over. When porting stuff to CUDA, you can share a lot of code. Simplifies the task so much.

jacoblambda · on Dec 20, 2023

Yep. That was the point I was making. Much like OpenGL, writing OpenCL is absolutely miserable so people just stuck with less portable alternatives that didn't write like pulling teeth.

Kon-Peki · on Dec 20, 2023

> Doesn't it take, like, 100 lines of code to call a kernel in OpenCL? Compared to Nvidia's 1?

If you are using a single line of code to execute a CUDA kernel then you are accepting a huge amount of default behavior.

I mean, this is really, really nice and you get some pretty good performance. You're going to outgrow that single line of code very quickly. I think you're beyond it by the time you get done with Nvidia tutorial #2.

throwaway0665 · on Dec 20, 2023

Just like OpenGL

latchkey · on Dec 20, 2023