I wish AMD would just drop ROCm at this stage, and focus on SYCL. The rocRAND/hipRAND woes in this article are if anything showing ROCm in a better light than it really is; here it at least worked and performed within the same ballpark as CUDA. Often it simply does not work at all, or if it works it's behind by a lot more. At work I simply gave up on our 4x Radeon Pro W6800 workstation because launching Tensorflow with more than 1 GPU would cause a kernel panic every time, and AMD engineers never offered a fix other than "reinstall Ubuntu".
ROCm feels like such a half assed product that (to me at least) feels like it's been made to tick a box and look cool in corporate presentations. It's not made with the proper mindset to compete against CUDA. Lisa Su claims they're doubling down on ROCm but to me it feels like they're falling behind relative to Nvidia, not catching up.
Banding together with Intel to support SYCL would in my opinion
1. Ensure there's a lot more momentum behind a single, cross-platform, industry-standard competitor
2. Entice other industry heavyweights like MSFT, Qualcomm, ARM etc to also take the cross-platform solutions more seriously
3. Encourage heavy investment into the developer experience and tooling for the cross-platform solution
SYCL is a terrible option. The performance of any application on SYCL is currently quite poor. Why drop ROCm (used on the world's largest supercomputer?) for an even more unproven product that AMD has no control over the roadmap?
> Banding together with Intel to support SYCL would in my opinion
Except that Intel has more control over SYCL and has repeatedly hurt AMD products with anticompetitive behavior in the past. Why would AMD permit their software to be controlled be a competitor?
AMD is executing with ROCm NOW, with multiple Top500 supercomputer wins and deployment in major cloud vendors with MI300X. Yes, the software needs to improve, but the practice of throwing out software to start over is not a good strategy.
AMD has far more momentum in the data center than INTC at present.
When it comes to comparison to SYCL, HIP is much closer to the spirit of SYCL than ROCm is. Both aim to help writing a single codebase that'll run across multiple hardwares. For now though, the trajectory of SYCL appears much more promising to me than HIP. HIP is already split in two parts for CPU and GPU, which is baffling, and neither part seems to receive much love from AMD.
> The performance of any application on SYCL is currently quite poor.
SYCL can get pretty much equivalent performance in Kernels to eg. CUDA. Try looking at SYCL performance papers on Arxiv. Eg. see [1].
That isn't to say that SYCL code is optimised on every platform without tweaking - you do still need to put effort into target specific optimizations to get the best performance, like you would in the CUDA or HIP.
> Why drop ROCm (used on the world's largest supercomputer?)
Some of the world's largest super-computers / HPC applications do use SYCL for AMD! The application I'm most aware of for this is GROMACS. As to why? - because having 3 version of the same code using different programming APIs is a big maintenance burden.
> Some of the world's largest super-computers / HPC applications do use SYCL for AMD! The application I'm most aware of for this is GROMACS. As to why? - because having 3 version of the same code using different programming APIs is a big maintenance burden.
The fact that GROMACS is unwilling to drop CUDA support to stand fully behind SYCL is very telling.
I wouldn't expect them to drop CUDA support, even if SYCL is a viable alternative:
* The CUDA backend is mature, featureful, and significant effort has been invested into optimising it on Nvidia hardware. One does not simply throw away a performant, well validated HPC code!
* Nvidia GPUs dominate the GPGPU market - unlike AMD's.
* The SYCL backend is still is very new in comparison (they even state in the docs to pay extra attention to validation), and doesn't have Nvidia-specific optimisations yet. Why prioritise reimplementing what already exists?
Important to understand that with these kinds of libraries, it's one thing to get everything working, and a whole other to have it all be fast. Given finite resources AMD are (rightfully) focused on the second, for a narrower set of use cases - namely big customers with big pockets, using the latest compute cards.
See, for example, the world's fastest super computer. That's a whole lot more than a presentation tick box.
As someone doing relatively small scale work, on gaming cards, you just aren't the target user. While these kinds of users are well represented on forums, they're a rounding error in terms of actual $.
Switching to SYCL makes no sense. They need to unseat Nvidia, and specifically CUDA. The whole point of HIP is to clone CUDA, make it easy for users to switch and piggy back on Nvidias success.
I used to work in this space (not at AMD lol), and generally I don't think the comments on HN are fair to AMD. Obviously ROCm has a long way to go (CUDA and libs are some truly amazing work), but it's going.
Don't know about the libs but CUDA itself is a C++ dialect targeted at a family of quirky array processors (GPU's). That takes some dev work by compiler hackers but doesn't seem amazing. Am I missing something? It's of course still easy to mess up such a project by not bringing enough clue, and AMD might have done that with ROCm. Is that what happened, or is it something different?
CUDA is more than a C++ dialect, which is a big thing people keep missing with all those "CUDA replacements".
Until CUDA 3.0, it was similar to OpenCL, a C dialect, however afterwards it became a C, C++ dialect, with common infrastructure PTX.
PGI targeted PTX, with their C, C++, and very relevant, Fortran compilers for HPC.
PGI was acquired by NVidia, and became the main set of CUDA compilers.
Given PTX, many other languages started targeting CUDA as well, Java, .NET, Haskell, Julia, at very least.
NVidia is now invested into a Python JIT for CUDA as well.
So yeah, while C++20 is the main language in CUDA, there is also a whole ecosystem of programming languages, that the "CUDA replacements" keep ignoring.
In these CUDA vs ROCm comparisons I think they mostly compare the C++ dialects. And it's not even particularly the language implementation where ROCm is weak, but rather, the whole tool chain. Am I mistaken?
Well the C++ segment is important and from what I gather, ROCm is failing at that. AMD would be a lot better off if the C++ part worked, even if the other parts don't.
>As someone doing relatively small scale work, on gaming cards, you just aren't the target user.
He didn't use gaming cards. He bought the expensive $1.5k "workstation" graphics cards.
What exactly is one going to use these GPUs for, other than GPGPU? Play video games? Really?
>While these kinds of users are well represented on forums, they're a rounding error in terms of actual $.
There are only three thousand developers who have ever committed to pytorch. Compared to a datacenter contract, that is a rounding error in terms of sales. Hence AMD shouldn't waste time on letting people independently work on AMD support for pytorch. They should only work on pytorch, whenever there is a big contract.
However, switching to SYCL makes no sense because Mesa is getting SYCL support. The likelihood that the mesa drivers cause a kernel panic is much lower.
One is that ROCm is the entire compute stack - driver, libraries, compilers, languages and sycl is a programming language. Sycl is more like one more language that could be built on ROCm than a replacement for it.
The second is that all the entire target market wrote all their stuff in cuda and looks totally unwilling to change to anything else. That makes HIP look like a good idea and Sycl not so much. Noone (commercially significant) is likely to port their code to Sycl to benefit from Intel's GPUs.
It's worth saying that AMD bet heavily on cross platform technology. Initially opencl, but take a look at the companies behind HSA as well. I'd say that doesn't appear to have worked at all - they've got the hassle of working with external specifications, and other companies have given up on the specs, so really noone is winning there. I think Triton is a cross company thing currently being implemented.
Finally Intel specifically is moribund. Definitely don't want to be relying on them to do anything.
On the bright side, the wide array of GPU programming languages are all very similar to each other from the perspective of the compiler stack, e.g. they all run through the same LLVM back end. So if someone decides they like sycl on amdgpu enough to write it, a lot of existing code is there to help. All open source in case it's someone outside of AMD that wants to write it.
I feel this is falling into the rewrite trap - rather than continuing to develop an existing but imperfect system, throw it all away for something that is unproven and currently in a notably worse state.
What about rocm makes it fundamentally by design that much worse? And what about sycl makes it better? Rocm is already based on many standard shared OSS projects, like llvm and clang. Often the same ones SYCL stacks are based on. What about SYCL makes it so superior that it'll make up for throwing all the current work away?
> What about rocm makes it fundamentally by design that much worse? And what about sycl makes it better?
The fact that SYCL compiles device code to SPIR-V instead of a device specific ISA like HIP for instance. The decision to do the latter, alone, is what causes ROCm to have support for a very narrow range of cards which significantly hampers adoption.
That's implementation details. SYCL and HIP are programming languages (more specifically C++ dialects), ROCm is an implementation of HIP that uses machine code. Intel DPCPP/icpx is an implementation of SYCL that uses SPIR-V.
There exists a HIP implementation that also compiles to SPIR-V, chipStar. And one can in theory write a SYCL compiler that generates machine code (maybe with openSYCL's HIP/ROCm backend?)
The woes the author of the article or various posters in this thread are also side effects of "implementation details". I don't think anyone is critical of HIP as a programming language, as it's just a CUDA clone. Pretty much all complaints (crashes, perf issues, poor device compatibility) regarding HIP/ROCm pertain to poor implementation/execution, including the directly-emitting-device-specific-machine-code problem. AMD can't wriggle out of this one with that excuse.
I very much agree - and I think some people misunderstood what I was trying to say too - that was that I don't think switching to SYCL would fix any of the problems that rocm currently has.
I personally feel many of the problems with rocm are due to lack of investment and management priority - there is no "technical" solution to that as a problem. And swapping out parts of that stack will just exacerbate those problems are now they need to rebuild everything from scratch with their insufficient resources.
It's really sensitive to the (Linux) driver/compiler/library versions all lining up. If you get the blessed Ubuntu kernel version and use it with the exact corresponding driver from dkms and the matching userspace version it's reasonably solid.
If you use the driver in Linux upstream, whether you have a good time or a terrible one varies with the age of the userspace libraries you're pairing with it.
I don't know what to do about this setup. I think it's an artifact of gearing testing towards "releases" of the entire stack with specific HPC operating systems in mind.
To do this right, each component would need to be individually solid and deal with version/interface slip relative to other components, which means the cross product of testing patterns is rapidly infeasible. I'm thinking about it but don't see a solution yet.
SYCL is practically comparable to OpenCL which is already getting unofficial support (including on older hardware that Rocm does not support) via Mesa's RustiCL. Additionally clspv and Mesa's Zink are both exploring OpenCL (and potentially SYCL)-on-Vulkan, though that option involves lots of quirks since the standards have evolved independently (and are even based on different varieties of SPIR-V).
AMD has a really terrible reputation for dropping old and introducing new APIs on a whim. To the extent that a lot of people in the field are completely burned out on AMD. Dropping yet another API after promising that this is "the one" would completely kill any chance of ever being taken seriously in the field.
A challenge with this is that all current AMD GPU support in SYCL compilers (DPC++ w/ Codeplay's oneAPI for AMD GPUs, and AdaptiveCpp), is built atop of ROCm / HIP.
If AMD were to move away from ROCm, they would have to adopt some other API for SYCL to use as a backend. Potentially, this could be the level zero spec from UXL foundation, but that is not the same as SYCL.
However, no changing between APIs etc. will help with troublesome drivers, or issues with OS compatibility, or documentation. Those issues are orthogonal.
> At work I simply gave up on our 4x Radeon Pro W6800 workstation because launching Tensorflow with more than 1 GPU would cause a kernel panic every time, and AMD engineers never offered a fix other than "reinstall Ubuntu".
Unfortunately, it's not always easy to get an issue in front of the right people. The W6800 is an officially supported GPU and a kernel panic is a serious bug. I'd like to try to help get this issue properly triaged. Could you send me a link to your report?
Unfortunately I no longer work there so I sadly cannot give you a detailed report. What I do remember is that this was a Threadripper 5995WX system and based on what AMD reps told us at the time the issue was related to some unexpected interaction between the amdgpu driver and the PCIe implementation on the motherboard we were using.
AMD attempted responses go all the way back to 2007 when CUDA first debuted with "Close to Metal" (https://en.wikipedia.org/wiki/Close_to_Metal). They've had nearly 20 years to fix the situation and have failed to do so. Maybe some third party player like Lamini AI will do what they couldn't and get acquired for it.
The thing about modern AI, it's that operations involved here (e.g. Dense matmuls) are lot simpler and GPU friendly than what you'd find in a typical HPC applications. This means you can get pretty close to peak hardware performance using high-level languages like Python or OpenAI's Triton. I think it's unlikely that the push to improve ROCm's standard libraries will come from an AI-focused startup
Even with AMD working on hard on specific benchmarks for marketing purposes they could not get close to peak hardware performance on their brand new chips.
That's more-or-less my experience with AMD, only worse. Critical thing too are burned developers like myself.
I'm looking forward to Intel v. NVidia. Arc A770 is a pretty serious competitor. It's the lowest-cost way to run OPT-175B.
Given a 7-slot motherboard, $270 * 7 = $1890 for 112GB of VRAM in one computer. That's sweet. Compute speed would be on-par with top-of-the-line NVidia workstation GPU.
Three of those are enough to run the largest open-source LLMs at around $9000.
We're just drivers + libraries + documentation away, and Intel is not bad at drivers + libraries + documentation.
$9k for three of those? I'm affraid Your math aint mathing.
7-slot motherboards are IMO Threadripper teritory. The CPU alone will set You back at least $1k5, MB to go along with that is another $1k. Not including the stack of DDR5 that You'll likely throw into that and the PSU / case / coolers to feed it power and evacuate the heat from this monstrosity.
That's at least $3000 for the base system "before" You add any GPUs.
And at some point I'd like to read someone more enlightened to comment on what the memory bandwidth is likely to be and how exactly will You distribute the PCIe lanes...
PCIe bandwidth is a limiting factor. For brevity, I did not write an essay.
The rest doesn't change the big picture. I spec'ed this out before, and your numbers are a good bit off if you're willing to cut corners. The cheapest way to do the rest is refurbished server, and slightly older parts. That's <<$1000.
New parts is a little bit more, but nowhere near $3k. Motherboard can be found for under $500 when a deal comes up. I found the Supermicro MBD-M12SWA-TF-O for that. PCIe risers and cooling can be done on the cheap, if you're willing to DIY. Etc. Power draw is under 3kW. A pair of 1.5kW PSUs will do that, for under $300. It's a mess (you're not getting a formal case), but it works.
Very similar systems were built during the bitcoin boom. Look some of those up, and see how they got costs down. That was all about margins.
I would like someone more enlightened to comment on PCIe issues as well.
True, bitcoin miners have probably figured out quite a few things to which I'm fairly oblivious. I always regarded that movement as a complete waste of computing power. Turns out at least some of their stuff may turn out to be useful for LLMs...
As for the specs, I went new Threadripper way simply because that's AFAIK where You'll get the best balance of PCIe 5.0 lanes (for the memory bandwidth between the GPUs) and power.
I see how You could spec this out with some used Xeon, but for one I really wouldn't like the headache and another, You're probably getting PCIe 2.0/3.0 tops (if You're trying to cut costs that much).
That Supermicro You mention seems to be an interesting compromise. The MB is selling second hand for about ~700EUR in the EU. The Threadripper that goes into it... don't know, can't find them used, but guessing something under 1k EUR as well.
Interesting price, but then on top of the risk of using old parts, instead of partitioning PCIe 5.0 (and maybe having the option to go 4.0 and double the count, since Your GPUs are PCIe 4.0 anyway), You're now splitting straight 4.0 lanes to Your GPUs. Half the bandwidth, but it may end up not mattering all that much. Given that most LLMs are memory size / memory bandwidth limited, I probably wouldn't like to risk this. But that's just me...
* The Supermicro was on sale for around $500 a few weeks ago. That's an actual price new on Newegg. I had it in my cart (not with intent to purchase now, but spec'ing out for a future purchase).
* I've found proper refurbished to be at least as reliable as new, if bought through appropriate channels. What you're looking for is off-lease. Companies lease computers for e.g. 3 years, and then return to Dell. Those computers ran for long enough to be past infant mortality. That very much differs from consumer returns / repairs, which were returned because they were unreliable. I'd never buy a refubished Inspiron, but I very much would buy a refurbished Latitude.
* Part of it depends on what you're doing. I'm very much not performance-limited for most of the things I do. If a model takes 2x as long to run, I'll wait twice as long. No big deal. If a model can't run at all, that's where there's a huge difference.
A card like the A770, which is around $20/GB, is very exciting to me, if the drivers / documentation / etc. lands.
I'll also mention: The scaling here isn't quite as it seems:
1) Two 16GB cards with half the bandwidth per card will have the same theoretical bandwidth as one 32GB card in a faster lane. Ditto for bandwidth to VRAM. Once you start doing the math there, the cheap cards look very good, indeed!
2) If big chunks of data need to flow between cards, you're using much more of that bandwidth. The cheap cards look a lot worse.
A lot of that depends on what you're doing, and if I haven't made it clear yet, I don't quite know what I'm doing :)
If AMD can significantly undercut Nvidia on perf/$ like they did to Intel with Zen 1; developers will fall over themselves to improve the software. When the performance price-point is low enough, the pain will be worth it.
ATX power supplies cap out at 2kW or so and single slot GPUs require water cooling which moves your price estimate significantly. Also difficult to plumb in adjacent cards. I think the realistic maximum is four ~300W cards per system, and that still needs water.
Your point stands - GPU compute on commodity hardware is very cheap. $10k per 4U quad GPU box is probably the ballpark.
Intel as a whole seems to be in pretty big trouble and rumor is that the GPU org is a dumpster fire in particular. I'd bet it's more likely than not that intel's GPU experiment ends within a year or two, unfortunately.
Which is a crying shame because the space could really use more competition, instead of less.
“Please note the library is being actively developed, and is known to be incomplet; it might also be incorrekt and there could be a few bad bugs lurking.”
I use exactly the same verbiage in my draft documents. I think it's a reference to C++ standards draft documents, which have that on the cover. I don't know if that's the original usage though, it's just where I saw it first.
This was exactly my experience with it too. It's moved on a bit since then but when I looked at rocFFT a couple of years ago, the documentation was really poor and it was missing features.
When I switched from FFTW to cuFFT many years ago (~2015), the transition was very smooth, the documentation was great, and all features were supported. They even put a shim "FFTW" compatible header file in so that you didn't need to rewrite your code to make it work (leaving some performance on the table).
I'm convinced that Julia is how the moat will be crossed. There are some pretty incredible GPU packages for it (I'm looking at you KernelAbstractions.jl). The Python science community seems more than happy to carry on focusing on NVIDIA and are a lost cause.
I somewhat don't blame them: the MI300X might be miles ahead and all, but AMD are not only oblivious to the desktop market (you know, where new ideas are prototyped) but are also seemingly actively hostile[1]. NVIDIA has people doing somewhat interesting things with a 3060 (which can eventually graduate to a 4090 or even a H100), while AMD don't what to hear about it unless you have a "pro" GPU. Definitely a case of dollar-wise and penny-foolish.
Documentation. The python example is a bit on the nose. I've not had good times with ROCm's documentation either.
Can anyone point to an example of good documentation for a big software system where they can also sketch how that was achieved? E.g. Cuda's docs are pretty good but I've no idea how they came to be, or how they stay up to date. LLVM's docs are a small amount of handwritten webpages which correlate with reality to some extent, the source for which lives in the same repo as the code.
I have an idea that it needs to combine programmers writing some things, some testing infra to notice things like internal links will 404 and some non-developers writing things.
I started trying to document one of my own systems as freeform notes under obsidian and while it kind of works at the time it diverges from reality pretty quickly, and that's without trying to have anyone else working on either the docs or the system.
AMD cards + ROCm are used in top supercomputers for (non-deep learning) HPC. Why is this the case?
I understand that AMD GPUs offer better cost efficiency for F32 & F64 FLOPs, RAM, and wattage. But however, if ROCm is such a half baked piece, shouldn't that advantage be gone? What drives AMD adoption in the HPC space then?
I assume a deal like that involves a lot of direct support and optimization by ROCm team for the workload the supercomputer is running, rather than generic awesomeness at everything.
That’s been my experience with most large enterprise software, really. Amazing at its most critical competency, core dumps when I try something a little unusual.
Nvidia has pretty epic margins in the datacenter space, so some of it will be that. Epic because their software advantage in AI lets them charge a premium, so why bother competing for the absolute cheapest $/FLOPS contract.
More generally LLM training is weird because its a supercomputing workload being executed by Silicon Valley devs. So they want to use open source frameworks, find answers on stack overflow, use cloud providers etc. And it's new enough that random PHD's can invent stuff like Flash Attention without being employed by Nvidia.
Normal supercomputer users dont care about any of that.
> AMD cards + ROCm are used in top supercomputers for (non-deep learning) HPC. Why is this the case?
Public bidding processes are different and much more reliant solely on price. As such you can expect them to pick different choices than the private market.
This can end up picking outright duds (see Aurora)
There is certainly a requirement that prevents payout on the contract for those supers if performance numbers aren’t met. Further, those performance numbers are from real apps, so the system WILL be useful. Not getting paid is not an option, so performance/usability will come.
how does that contractual obligation translate to technical implementation? Do those supercomputers get an optimized version of ROCm to fulfill said obligation?
Absolutely. You usually end up with software stack where NOTHING can be updated, most of the stuff is forked with custom patches and the learnings there aren't reusable elsewhere because the code is full of "// replace this with hardcoded constant 59843 because it prevents crash on HPC machine".
It's a good marketing metric, but probably contraproductive the AMDs longterm success in the field. They're spending engineering time building something they'll unlikely to be able to translate into other fields.
Vendor iterates until client gets an usable environment, even if that means 50 forks of different libraries with custom patches that in the end work only on that one system.
ROCm works better in HPC than on consumer systems because it was initially written for HPC and arguably is still tested with HPC in mind. ROCm development was substantially funded by the Frontier supercomputer and the DoE developers get to complain directly at AMD developers when they're not happy.
The DoE doesn't like having single suppliers for their tech. AMD/HPE's Frontier bid was a bit of a gamble from the DoE - it wasn't remotely obvious whether they'd be able to deliver it or not, and was against a background of Intel broadly failing to deliver Aurora - whereas nvidia was a known safe bet. However placing a big order with Intel and another one with AMD was their best shot at getting away from being wholly reliant on nvidia.
Frontier shipped. It's a real thing, people run code on it. I'd guess the DoE labs talk to each other to some extent and thus AMD ended up winning the El Capitan bid. That means AMD has a flagship HPC machine that sales people can point to and a big ongoing revenue stream to continue funding development from. It looks like the DoE plan to have two HPC vendors has worked.
Aurora seems to be somewhat in existence now but is less compelling as a story other potential customers might want to copy. I'm curious whether Intel end up making a loss on it.
Part of the reason is that at the scale of a country like the US, if you are going to have several supercomputers build in the coming years, you want to diversify to avoid putting all your eggs in the same basket (and to giving a preferential treatment to a single vendor).
So, as long as they can get a given set of applications to run above a given performance threshold, alternative vendors have a shot.
I hate ROCm so much I can't even describe how much I suffered because of this pos software. It wasn't good 3 years ago but manageable now it's just I don't even know how they got it worse.
I really just wished my employer would give up on AMD for GPUs.
I had a look 3 years ago and the compatibility matrix was an absolute shitshow: older cards not supported at all, newer cards not supported at all, and support for those in the middle was full of holes depending on extension, version, etc.
I couldn't even figure out what I needed to buy, so I didn't.
At this point, rather than chasing CUDA/cuDNN compatability, it would seem more productive for AMD to be targetting high level language support. Forget CUDA compatibility, and instead support work such as Mojo and MLIR.
It seems that in an ideal world PyTorch support for AMD wouldn't rely on ROCm, but rather be based on high level code compiled to MLIR with AMD target support, with this same MLIR representation supporting Mojo and any other languages such as Julia that want optimized AMD support.
I mean there's OpenCL which has widespread support and is pretty much about as pleasant to write as OpenGL. It has existed for a very, very long time but people hate writing in it so they just defaulted to CUDA and ignored everyone but NVIDIA.
Bigger than it taking a lot of boilerplate is the issue that there was never sufficiently widespread and bug free support for mixing CPU and GPU code in OpenCL in the way CUDA allows.
So you end up having to do a lot more work to port stuff over. When porting stuff to CUDA, you can share a lot of code. Simplifies the task so much.
Yep. That was the point I was making. Much like OpenGL, writing OpenCL is absolutely miserable so people just stuck with less portable alternatives that didn't write like pulling teeth.
> Doesn't it take, like, 100 lines of code to call a kernel in OpenCL? Compared to Nvidia's 1?
If you are using a single line of code to execute a CUDA kernel then you are accepting a huge amount of default behavior.
I mean, this is really, really nice and you get some pretty good performance. You're going to outgrow that single line of code very quickly. I think you're beyond it by the time you get done with Nvidia tutorial #2.
The rocrand library did not have any real documentation at all until 2023. It's still pretty barebones, but the updates in article regarding the Python API seem to suggest that this is a work-in-progress.
Are there use-cases for LLMs assisting in developing this software? Basically, I'm wondering if LLMs for developing a GPU-API exist and how can they can accelerate development such that this "moat" becomes more of a river that other can join?
I'm struggling to see how that helps given the current problem they have. What they have is several extremely half-assed implementations of somewhat different APIs that are basically the same and abysmal documentation. Pointing an LLM at it would make yet another half-assed implemetation of yet another different API, where what documentation it had would be convincingly wrong a bunch of the time.
What they need is:
1) Make one API [1] that builds out of the box, works really well and gets the best out of their hardware right away. Bonus points if its not a total pig to use.
2) Take the thing that they built for 1 and make a straightforward python binding for it that works for tensorflow and pytorch.[2]
3) Have a beer on me. Seriously I would buy them one.
Ideally there's an optional step 0 which is to stop making all kinds of press statements saying they are taking this seriously and paying sloppy journalists to say they are catching up in the race against nvidia if they can't actually get their act together to do steps 1 and 2.
[1] In c, c++ or rust so it's easy to embed in python and other languages etc
[2] By work here I mean the definition of done is a normal person can do `pip install tensorflow` and it will actually use an amd gpu with hardware accelleration right away. Not some bullshit where you have to jump through a bunch of hoops and apply a whole lot of custom patches and it kinda sorta sometimes works but actually mostly just either falls back to CPU or crashes half the time
It does seem like they need a reset, and given that need for a reset, a new-starting point: how do they move quickly? I was thinking that LLM could assist in creating a coherent approach, helping in creating code, documentation, etc (i.e. dog-food'ing their 'product') -- what I would think could make this development more rapid.
ROCm feels like such a half assed product that (to me at least) feels like it's been made to tick a box and look cool in corporate presentations. It's not made with the proper mindset to compete against CUDA. Lisa Su claims they're doubling down on ROCm but to me it feels like they're falling behind relative to Nvidia, not catching up.
Banding together with Intel to support SYCL would in my opinion
1. Ensure there's a lot more momentum behind a single, cross-platform, industry-standard competitor
2. Entice other industry heavyweights like MSFT, Qualcomm, ARM etc to also take the cross-platform solutions more seriously
3. Encourage heavy investment into the developer experience and tooling for the cross-platform solution