> AMD diverged their GPU architecture development into separate CDNA and RDNA lines specialized for compute and graphics respectively.
Ooooh, is that why the consumer cards don't do compute? Yikes, I was hoping that was just a bit of misguided segmentation but this sounds like a high level architecture problem, like an interstate without an on ramp. Oof.
Typically, software developers only support a single GPGPU API, and that API is nVidia CUDA.
Technically, consumer AMD cards are awesome at compute. For example, UE5 renders triangle meshes with compute instead of graphics https://www.youtube.com/watch?v=TMorJX3Nj6U Moreover, AMD cards often outperform nVidia equivalents because nVidia prioritized ray tracing and DLSS over compute power and memory bandwidth.
The issue is, no tech company is interested in adding D3D or Vulkan backend to AI libraries like PyTorch. nVidia is not doing that because they are happy with the status quo. Intel and AMD are not doing that because both hope to replace CUDA with their proprietary equivalents, instead of an open GPU API.
> The issue is, no tech company is interested in adding D3D or Vulkan backend to AI libraries like PyTorch.
AMD is interested in making PyTorch take advantage of their chips, and has done so.
> we are delighted that the PyTorch 2.0 stable release includes support for AMD Instinct™ and Radeon™ GPUs that are supported by the ROCm™ software platform.
When NVIDIA is charging 1000% profit margins for a GPU aimed at use as an AI accelerator, you can expect competitors to be willing to do what it takes to move into that market.
I know you are just quoting the tomhsardware article, but sheesh, "1000% profit margin" is nonsensical. If the cost of goods is X and they sell it for 10X, then it is a 90% profit margin, not a 1000% profit margin.
Once when I was in an RPG shop I visited frequently, some youths came in and said "We want to order 5 figures." After a short pause they added "Can we get them cheaper, because we're buying 5?" The shop owner said "Fine, I'll give you one for free." They said "No, the other owner promised us 10 percent". After a small pause the shop owner smiled and said "Ok, I'll stick with the promise, you get 10 percent."
Percentages suck. They feel almost designed to confuse. Just about anything is better, really. Doesn't help that people using percentages either use them intentionally to confuse their victims - think shop owners, advertisers, banks, scammers, especially in finance and insurance adjacent areas - or just don't realize how easy it is for everyone, including them themselves, to lost track of the base/referent.
I've come to see the use of percentages as a warning sign.
Great satire, I think the greatest satire is one that sounds totally real. That which lets the reader think twice. But "I've come to see the use of percentages as a warning sign." gave you away.
They're both percentages. Using the wrong term isn't a math failure, like in your example.
"markup" versus "margin" is an arbitrary choice, not something people should be expected to intuit. And they're usually close in value, so people are less likely to be informed or reminded of the two terms and which is which.
All the companies that are now screaming hellfire because of Nvidia's market maker position are also the companies that gave Nvidia a warchest filled with billions of dollars. How is AMD supposed to compete when the whole market is funding their rival?
AMD was almost bankrupt 10 years ago. Basically AMD had to bet the whole company on their next CPU or GPU and their choice was the CPUs and put all their efforts into that and it payed off (Zen architecture).
It has only been the last 4 or 5 years that AMD has had any real money to put into their GPU/AI accelerator sector and that seems to be developing quite well now (though they seem to be mostly interested in super computer/massive data center deployments for now)
GPU languages shouldn't be proprietary. At one point this was a shared understanding and lots of companies were behind initiatives like OpenCL. Meanwhile progress in higher level non C++ GPU languages has stayed in niches without big industry backing, and we're stuck with CUDA which is bad in both language design sense and market-wise.
the problem with opencl was that AMD's runtime was so buggy and incomplete that developers usually had to build a whole separate pathway just for AMD. And if you're doing that, you're not portable anyway, so why bother forcing NVIDIA through a more-mismatched generalized API just for the sake of purported "portability"?
it's turtles (broken AMD technical leadership) all the way down.
this is simply not a field AMD cared about, whether you think they were right or wrong (based on their financials or otherwise). and now that it has turned into a cash fountain, everyone wishes it had gone differently. "should have bought bitcoin" but instead of pollution funbux it's investing into your own product.
This is a AMD centric view but the problem is much wider, it includes other OS platforms (eg Android & Apple platforms) and other hardware vendors (eg intel, mobile chipsets) as well. After all game APIs were also victim to the same kind of "code to the driver" circumstance.
I agree that correctness and robustness of the vendor impolementations went poorly, but it's fixable if there was commitment between vendors, the infighting leading to Apples departure must not have helped that side either.
In the end I think including the high level compilers and APIs in the proprietary driver stack responsibility is unsustainable. Microsoft seems to have a better model, or there could be even a standard JIT code consumed by drivers that emitted by open source stacks shared across platforms. The latter way would also better support development of nicer GPU languages.
> In the end I think including the high level compilers and APIs in the proprietary driver stack responsibility is unsustainable. Microsoft seems to have a better model, or there could be even a standard JIT code consumed by drivers that emitted by open source stacks shared across platforms. The latter way would also better support development of nicer GPU languages.
OK, let's talk MS.
What NVIDIA has done with PTX is basically the same thing as MSIL/CIL. There is a meta-ISA/meta-language that has been commonly agreed, and if you invoke no tokens that the receiving driver doesn't understand, the legacy compiler understands and emits executable (perhaps not optimal) code.
The legacy hardware stays on a specific driver and a specific CUDA support level. The CUDA support level is everything, that is your entire "feature profile". It's not GFX1030/GFX1031, it's RDNA3.0/3.1/whatever. NVIDIA has been able to maintain their feature support (not implementation details) in a monotonically increasing sequence.
Additionally, they also maintain a guarantee that spec 3.1 guarantees that you can also compile and execute 3.0 code. "Upwards family correctness" I guess. Again, doesn't have to be optimal but the contract is there's no tokens that you don't understand. A Tegra 7.0 compiler can compile desktop 7.0 CUDA (byte/)code etc.
Legacy driver support doesn't change, so, you can always read PTX, you can always read anything that's compiled for your CUDA Capability Level even if it's from a future toolkit that wasn't released at the time. PTX is always the ultimate "emit code from 2020 and run on the 2007 gpu that only understands CUDA 1.3" relief valve though. If you don't write code that does stuff that's illegal in 1.3... it'll compile and emit PTX, and the Tesla driver will parse and run it.
Anyway, let's talk MS. MS could come up with MSIL/CIL. NVIDIA can come up with PTX. Why can't AMD do this? What is unique about GCN/RDNA as a language that you can't come up with an intermediate language representation?
I agree that AMD could have done a proprietary intermediate instruction format and with good execution it would have made their users lives easier.
But the actual big opportunity which could make the whole GPU programming field far more dev friendly and could make the gpu vendor competition work in favour of users would be doing the GPU-IL in a cross-vendor way, driven by someone else than a single GPU vendor who is playing in the proprietary lock-in game (or certain to do it after gaining sufficient market position).
even better, APUs. how about those vega APUs in everyone's AMD laptops and desktops?
oh, right, those are EOL'd, security updates only/separate driver package with a quarterly release cycle, even though they're still on store shelves...
just to emphasize: it is crazy to me that at this moment of frenzy over AI/ML, that AMD is not making better use of their iGPUs. The Vega iGPU is really powerful compared to its contemporaries, it's not WMMA or DP4a capable (iirc) but it still can brute force a lot more than you'd think, it certainly is capable of doing at least a little bit of inference.
Remember that the "AI core" in the new meteor lake and hawk point stuff is not really all that gangbusters either... it's a low-power sidekick, for the same types of stuff that smartphones have been doing with their AI cores for ages. Enhancing the cameraphone (these cameras would be complete shit without computational/AI enhancement). Recognizing gestures, recognizing keywords for voice assistant activation.
AMD's pitch for the AI core is enhancing game AI. Windows 11/12 assistant. That type of stuff.
Vega can absolutely just brute-force its way through that stuff, and it gets more people onto the platform and developing for AMD. It is crazy that if nothing else they aren't at least making sure the APUs can run shit.
And again, it's pretty damn unethical imo to be pulling support from products that are still actively marketed and sold. That's a cheezy move. AMD dropped support for Terascale before NVIDIA dropped support for fermi, and NVIDIA went back and added Vulkan support to all their older stuff during the pandemic too. Then they dropped the 28nm GCN families, while NVIDIA is still supporting maxwell. And then they dropped Polaris and Vega, and NVIDIA is still supporting Maxwell (albeit I expect them to drop it very soon imo).
The open driver is great under linux because it bypasses AMD's craptacular support. But AMD doesn't support consumer GPUs in ROCm under Linux, and de-facto they don't seem to support ROCm on the open driver in the first place anyway, you have to use AMDGPU-PRO (according to geohot's investigations).
This is such a crazy miss. Yes, it's not a powerhouse, but in the era of Win12 moving to AI everything, and games moving to AI-driven computer opponents, etc - Vega can do that ok, and it at least would give people something to open the door and get them developing.
If there's a contender for breakout against CUDA, sadly it really seems to be Apple. They've got APUs with PS5-level bandwidth and unified memory, and that's just M1 Max, and they have Metal, and it's supported everywhere across their ecosystem. It lets people dip their toes into AI/ML if nothing else, and lets people dip their toes into metal development. That's the kind of environment that NVIDIA spent a decade fostering, and it's also not a coincidence that the second stop for all these data scientists playing with models is not ROCm but their apple laptops. llama.cpp and so on. Everyone likes the hardware they have in their gaming PC or in their laptop, and it's an absolute miss for AMD to not make themselves available in that fashion when they already have the market penetration. Crazy.
>The open driver is great under linux because it bypasses AMD's craptacular support. But AMD doesn't support consumer GPUs in ROCm under Linux, and de-facto they don't seem to support ROCm on the open driver in the first place anyway, you have to use AMDGPU-PRO (according to geohot's investigations).
While they don't officially support any consumer GPU aside from 7900XT(X) and VII, I haven't encountered any issues using it on a 6700XT with the open source drivers, pretty much the only tinkering required was to export HSA_OVERRIDE_GFX_VERSION=10.3.0. It was quite a pleasant surprise after never getting my RX480 to work even though it was officially supported back in the day.
Vulkan Compute backends for numerical compute (as typified by both OpenCL and SYCL) are challenging, you can look at clspv https://github.com/google/clspv project for the nitty gritty details. The lowest-effort path so far is most likely via some combination of Rocm/HIP (for hardware that AMD bothers to support themselves) and the Mesa project's RustiCL backend (for everything else).
> Vulkan Compute backends for numerical compute (as typified by both OpenCL and SYCL) are challenging
Microsoft has an offline dxc.exe compiler which compiles HLSL to Spir-V. Also, DXVK has a JIT compiler which recompiles DXBC byte codes to Spir-V. Both technologies are old, stable and reliable, for example the DXVK’s JIT compiler is a critical software component of the SteamDeck console.
> The lowest-effort path so far is most likely
I agree that’s most likely to happen, but the outcome is horrible from consumer PoV.
Mesa is Linux-only, Rust is too hard to use for vast majority of developers (myself included), AMD will never support older cards with ROCm, and we now have the third discrete GPU vendor, Intel.
Why would you want OpenCL? Pretty sure D3D11 compute shaders gonna be adequate for a Torch backend, and they even work on Linux with Wine: https://github.com/Const-me/Whisper/issues/42 Native Vulkan compute shaders would be even better.
Why would you want unified address space? At least in my experience, it’s often too slow to be useful. DMA transfers (CopyResource in D3D11, copy command queue in D3D12, transfer queue in VK) are implemented by dedicated hardware inside GPUs, and are way more efficient.
OpenCL is stricter with the results of floating point operations, and makes different assumptions with respect to memory aliasing. Whether or not this is important the AI domain I don't know.
> Why would you want a unified address space?
A unified address space doesn't always imply that the memory can be accessed from anywhere (although that might also be supported with some memory allocation mechanisms), and you still may have to copy between host and device memory. But it makes it much easier to have pointers in your GPU kernels, instead of having to deal with objects like OpenCL buffers.
My laptop has a single GPU inside Ryzen 5 5600U i.e. unified memory, all consoles also have unified memory. These devices are fine with traditional GPU programming model, where shaders only have access to well-shaped pieces of memory accessible through resource views or UAVs.
CPU-style memory access in GPU kernels technically possible (CUDA did it) but unfortunately rather hard. The feature requires hardware support inside GPUs (need pointers, they need to be 64-bits, need 64-bit integer arithmetic instructions), and therefore not going to work on many current GPUs. It becomes harder to compile and optimize GPU-running code. On devices without physically unified memory, the performance of these kernels gonna suck.
Luckily, none of that is needed to implement D3D11 or Vulkan backend for PyTorch. PyTorch is not a general purpose GPGPU runtime, it’s merely a high-level library which manipulates tensors and implements a few BLAS routines operating on them. It’s easy to allocate a GPU resource for each tensor being manipulated.
Vulkan backend for PyTorch exists. It's mostly tested on Android, but it's there. PyTorch maintainers though are reluctant to advertise that as "support" because complete, reliable support for the zillion 'operators' PyTorch includes is quite a different challenge.
SYCL (closely related to oneAPI) isn't single-vendor-controlled. It's a Khronos open standard. If you take a look at the spec, you'll see contributions from various universities, Qualcomm, Huawei, Argonne, Altera, and AMD (Xilinx I think). Intel just adopted it (and bought Codeplay, the original contributor).
oneAPI is a set of SYCL related standards. Originally that was Intel, but now that's owned by the UXL foundation, which is part of the Linux foundation.
Couple times in the past I wanted to port open source ML models from CUDA/Python to a better technology stack. I have ported Whisper https://github.com/Const-me/Whisper/ and Mistral https://github.com/Const-me/Cgml/ to D3D11. I don’t remember how much time I spent, but given both were unpaid part-time hobby projects, probably under 160 hours / each.
These software projects were great to validate the technology choices, but note I only did bare minimum to implement specific ML models. Implementing a complete PyTorch backend gonna involve dramatically more work. I can’t even estimate how much more because I’m not an expert in Python or these Python-based ML libraries.
To go on a tangent, I note your custom 'BCML1' 5bit per weight compression codec and your optimised hand-coded AVX2 to encode it... was that really needed? Are the weights encoded on every startup? Why not do it once and save to disk?
Not really, that code only runs while importing the PyTorch format. See readme for the frontend app: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... When loading the model from *.cgml, the model file already contains compressed tensors. That’s how that file is only 4.55 GB, versus 13.4 GB in the original model.
> was that really needed?
For desktops with many CPU cores, a simpler scalar version would probably work equally well. Still, low-end computers don’t always have many cores to use by these background encoding tasks. Also, CPU usage on laptops translates to battery drain.
> Typically, software developers only support a single GPGPU API, and that API is nVidia CUDA.
There is very little NVIDIA and CUDA cards off of X86_64 and maybe OpenPower architectures. So I disagree. Also, OpenCL, despite being kind of a "betrayed standard", enjoys quite a lot of popularity even on x86_64 (sometimes even with NVIDIA hardware) - even if it is not as popular there.
> AMD cards often outperform nVidia equivalents
Can you link to benchmarks or other analysis supporting this claim? This has not been my impression in recent years, though I don't routinely look at high-end AMD hardware.
> because nVidia prioritized ray tracing and DLSS over compute power and memory bandwidth.
Note that for applications which bottleneck on FP16 compute (many ML workloads) or FP64 compute (many traditional HPC workloads: numerical solvers, fluid dynamics, etc), the 7900 XTX even outperforms the 4090 which costs $1600.
In ML workloads, usually the FP16 operations are matrix operations. On RDNA3, these execute at the same rate as normal shader/vector operations, but on Nvidia RTX cards there are Tensor cores which accelerate them. The Ada whitepaper lists 48.7 shader TFlops (not 43 because boost vs base clock), and 195 TFlops for FP16 Tensor with FP16 Accumulate. That's 4 times faster than regular, and almost double what the XTX lists!
Ampere and newer also have native sparsity support which means that you can skip over the zeroes 'for free', which Nvidia uses to market double the TFlops, which is kind of misleading imo. But the 195 TFlops are even before sparsity is included!
I'm not sure if the 93 TFlops (120 with boost clocks) on AMD are with FP16 or FP32 accumulation, as with FP32 accumulation the 4080 slows down significantly and gets much closer with 97.5 TFlops.
Intel Xe-HPG (used in the Arc A cards) also offers very aggressive matrix acceleration via XMX, with 137.6 FP16 TFlops at base clock, vs. 17.2 FP32 TFlops.
You’re comparing general-purpose computations to a proprietary feature with limited applicability. For example, in my Mistral implementation the most expensive compute shader is handling matrices compressed with a custom codec unknown to tensor cores.
This is not an apples-to-apples comparison. We use FLOPS numbers to compare processors of different architectures because they’re the same FLOPS everywhere.
Neither is FLOPS. RDNA3 FLOPS numbers are greatly inflated, because they are only applicable to fairly specific VOPD and WMMA instructions, the former being for two MADs at the same time, and the latter being applicable in the same cases as tensor cores.
Besides, it should be possible to use the tensor cores combined with the codec: you can use vector hardware to decode matrix values to fp16/fp32, then yeet those into the tensor cores. Although most of the time will probably be spent on the decoding part, assuming you're doing matrix-vector multiplication and not matrix-matrix (which might be different with MoE models like Mixtral 8x7B?)
The only important numbers are processing power TFlops, and memory bandwidth GB/s.
If your compute shader doesn’t approach the theoretical limit of either computations or memory, it doesn’t mean there’s anything wrong with the GPU. Here’s incomplete list of possible reasons.
● Insufficient parallelism of the problem. Some problems are inherently sequential.
● Poor HLSL programming skills. For example, a compute shaders with 32 threads/group wastes 50% of compute units of most AMD GPUs, the correct number for AMD is 64 threads/group, or a multiple of 64. BTW, nVidia and Intel are fine with 64 threads/group, they run 1 thread group as 2 wavefronts which does not waste any resources.
● The problem being too small to compensate for the overhead. For example, CPUs multiply two 4x4 matrices in a small fraction of a time it takes to dispatch a compute shader for that. You gonna need much larger matrices for GPGPU to win.
OpenCL 2.0 was ignored because implementing shared virtual memory was mandatory, but it simply isn't possible to implement without cache coherence all the way into the GPU memory. Now that Intel has developed their own GPUs, they have gone out of their way to completely ignore SVM and went with their own proprietary USM extension of OpenCL and the Intel driver developers themselves said SVM is difficult to get right. Intel and AMD were the only ones implementing OpenCL 2.0 and the SVM implementation by AMD was basically unusable.
OpenCL 3.0 got rid of the mandatory features of OpenCL 2.0 that nobody was willing to implement.
Basically a wall of text going around that Intel and AMD never delivered anything in OpenCL that would make developers actually care.
They only needed to make something half as usable as CUDA, in polyglot support and graphical development tooling for GPGPU.
I was once in a Khronos panel webminar in HPC, where no one in the panel understood the relevance of Fortran support in OpenCL, then no wonder Fortran workloads get deployed with CUDA.
OpenCL has had a C++ variant for a while. But it is like you said: The consortium members, especially NVIDIA, did not do thing properly, either half-heartedly or in an intentionally broken manner.
As LLM, AI takes over the heated topic of discussions on HN, it is hard not to notice the tone on blaming Nvidia for everything. Both Hardware and Software.
I think we should be blaming the OS platform actors and user communities for letting the lunatics run the asylum (hw vendors dictating sw). And then MS & Apple both went ant did their own proprietary stuff.
AMD's products have generally failed to catch traction because their implementations are halfassed and buggy and incomplete (despite promising more features, these are often paper features or resume-driven development from now-departed developers). all of the same "developer B" stuff from openGL really applies to openCL as well.
that's the reason blender kicked their openCL implemention to the curb after years of futzing with it. it wasn't portable because AMD's openCL was so broken they had to have their own special codepath anyway, and it didn't perform well and wasn't worth the effort.
AMD has left a trail of abandoned code and disappointed developers in their wake. These two repos are the same thing for AMD's ecosystem and NVIDIA's ecosystem, how do you think the support story compares?
in the last few years they have (once again) dumped everything and started over, ROCm supported essentially no consumer cards and rotated support rapidly even in the CDNA world. It offers no binary compatibility support story, it has to be compiled for specific chips within a generation, not even just "RDNA3" but "Navi 31 specifically". Etc etc. And nobody with consumer cards could access it until like, six months ago, and that still is only on windows, consumer cards are not even supported on linux (!).
That's fine for non-commercial and HPC, but it makes it difficult to imagine ever shipping commercial products to end-users on it. I guess you... ship as source, and your users have to figure out ROCm and compile it in-situ? Or ship a binary for every single GPU to ever release (including NVIDIA, Intel, ....)
This is on top of the actual problems that still remain, as geohot found out. Installing ROCm is a several-hour process that will involve debugging the platform just to get it to install, and then you will probably find that the actual code demos segfault when you run them.
AMD's development processes are not really open, and actual development is silo'd inside the company with quarterly code dumps outside. The current code is not guaranteed to run on the actual driver itself, they do not test it even in the supported configurations, because all the development is happening on internal builds etc, not release code.
oh and AMD themselves isn't using the open driver of course... they only test and validate on AMDGPU-PRO.
it hasn't got traction because it's a low-quality product and nobody can even access it and run it anyway, let alone ship commercial products on it. and now everyone regrets it because it's turned into a money fountain.
AMD consumer cards do compute, but do not have a developed or well supported ecosystem (ROCm is a mess).
Misguided segmentation or a high level architecture problem however, it is not. A specialized product will perform its specialty better than a general purpose product. There's minimal demand for a card that's good at both compute and gaming. These people exist, but are few compared to those who only care about one or the other.
The benefit of splitting GCN into RDNA and CDNA was immediate. Comparing the Radeon VII (GCN 5) vs RX 5700 XT (RDNA 1), you'll see that they're fairly evenly matched on gaming, trading blows with the Radeon VII slightly ahead when averaged. The RX 5700 XT takes a very heavy loss on compute benchmarks. Both are TSMC 7nm, but the RX 5700 XT has fewer shaders (2560 vs 3840), a smaller die (251 vs 311 mm2), and lower power consumption (225 vs 300 w) which shows how much more efficient it is at gaming. That much lower power consumption and noise, and a couple hundred dollars cheaper made it a much more compelling card for gamers.
CDNA cards are apparently missing components needed for gaming, like render output units. Hence, they have no official support for DirectX, OpenGL, or Vulkan. I've never seen anyone get one of these working for gaming. In exchange, their compute performance is so good that a number of companies are buying these cards over Nvidia's, despite the overwhelming CUDA ecosystem. In 2013, a GCN based supercomputer made the top 100. That is the one and only GCN based system to ever make the top 100. Compare to now, where 8 of the top 10 most energy efficient super computers use CDNA accelerators, as does the number 1 fastest supercomputer outright.
> These people exist, but are few compared to those who only care about one or the other.
They're few, but they're very important, because 5 years from now, they will be the ones working for the companies that AMD will want to sell their compute cards to.
The reasons you cite may still be more important. I'm not saying AMD made a bad decision. I'm just saying that neglecting the path from hobbyist to professional (or student to professional) is not insignificant.
They added support for high end radeon cards 2 months ago. Rocm is "eventually" coming to RDNA in general but it's a slow process which is more or less in line with how AMD has approached rocm from the start (targetting a very small subset of compute and slowly expanding it with each major version).
I am getting the impression their (AMD) understanding of software/Devx is not as good as nvidia. There seems to be some sort of inertia in getting the software side moving - as if, now that they are behind, they are not willing to make any mistakes, so they hold off on doing anything decisive - instead opting for continuing the status quo mode but perhaps adding a few more resources.
The issue is that RDNA and CDNA are different so when an enthusiast makes a fast RDNA code it doesn't mean it would work well on CDNA and vice-versa. Not sure why AMD had to go this route, only high-end pros will write software for CDNA and they won't get any mindshare.
Someone writing for the rocm platform will be writing HIP and then HIPCC will compile that down into the actual runtime targeting the given architecture. HIP is pretty platform agnostic so very little device specific optimization tends to go into idiomatic HIP.
They're fundamentally a hardware company (as per Lisa Su's CV) which didn't cotton on to the fact that Cuda was the killer. I remember how @Bridgman kept playing a rearguard action on Phoronix, his job being to keep devs on board. Losing battle.
I kinda understand it because [80/90]s era hardware people intrinsically think hardware is the top dog in the stack, and that's where all AMD management come from including Su.
Kodura understood that Nvidia was killing AMD because the consumer cards could run CUDA. So he pushed through, against Lisa Su, the Radeon VII, which until very recently, and for many years, was the only consumer card that ROCm supported. He was basically fired shortly thereafter, and the RVII, which was a fantastic card, was shutdown sharpish. Then they brought in Wang who crystallised the consumer/pro segmentation.
Now they're furiously trying to backpedal, and it's too late. There are multiple attempting competitors, but basically the only one worth talking about is AAPL and Metal.
I live in Toronto and knew people who worked for ATI back in the 1990s-2000s. The management structure was fucked. In those days it was generally regarded that ATI could and did produce better hardware than NVIDIA, but ATI’s drivers were horrible and only evened out near the end of the hardware’s lifetime. But the end of the lifetime was far too late and all the benchmarks would have long been done on the buggy drivers, driving people to NVIDIA.
I kid you not, but the software side of ATI’s reviews and incentives were literally QA got rewarded for finding bugs and then the software engineers were then rewarded for fixing them. You can imagine what happened next. The bitch of it is that it was overall just a symptom of the problem that the company as a whole didn’t prioritize software. NVIDIA knew that the drivers and software efficiency of using the hardware was just as if not more important than the hardware (the stories in the 1990s of how John Carmack used various hacks with hardware is a good example of that).
I'd really not call the Radeon VII a "fantastic card" by any stretch of the imagination, outside of specialized use cases like FP64 compute. Its only selling point is that and the enormous 1 TB/s memory bandwidth with the for the time large 16 GB, which a friend summarized as "I can push 1 TB/s of garbage to nowhere!".
Sure, if those things are relevant to you, it was a killer card, but most consumers judge GPUs first and foremost by their gaming performance, in which the VII was a total dud. The 5700 XT was nearly as fast in those with far less hardware because getting proper utilization on GCN/CDNA is a pain compared to RDNA.
RDNA2 completed the gaming oriented flip with Infinity Cache (which is also useful outside of games, but still has limited application in GPGPU) which allowed them to offer competitive gaming performance to Nvidia with far less hardware, a big difference from the Vega era.
> I'd really not call the Radeon VII a "fantastic card" by any stretch of the imagination
An argument about whether the Radeon VII was a "fantastic card" misses the point, it could have been another GPU. OP's main point was that AMD had a years-long policy of not officially supporting any of their consumer-grade GPUs for computing applications, as a result AMD had lost critical opportunities during this time period - even a small market share is better than none. OP mentioned Radeon VII because it was the only exception, and it only happened because there was strong support from some groups within AMD - which didn't last long. After RDNA was released, computing support on consumer GPUs was put on pause again until several months ago. There's a big difference between "not optimally designed for computing" and "no official support of computing at all".
The outcome is that it created a great barrier of entry for developers and power users with desktop GPUs. You have to figure out everything by yourself, which is not supposed to be particularly difficult as it's the norm in the FOSS world. The problem is that, in a community-driven project, you would have received help or at least a hint from your peers or project maintainers who's familiar with the system. Not so for AMD. if something doesn't work, no AMD developers will provide any help or even hints - and by not providing support, the "community" is never really established. The ROCm GitHub Issues page was (is?) basically a post-apocalypse world abandoned by AMD, full of user self-help threads with no input from developers. Almost nothing works and nobody known why. Even if the problem reported is a general one, it would still be ignored or closed if you're not running an officially-supported system and GPU.
One time, I needed to understand a hardware feature of an AMD GPU, I eventually had to read ROCm and Linux kernel source code before finding an answer. For most users, it's completely unacceptable.
> Sure, if those things are relevant to you, it was a killer card
Disclosure: I use the Radeon VII for bandwidth-heavy simulations so I have a pro-Radeon VII bias. But I don't think my bias affected my judgement here.
Also it's basically an Mi50 in a trench coat.
You could buy a 32 GiB version labeled Mi60, initially with a single mini DP port making it IMO classify as "graphics card" not "compute accelerator".
Looks like the split was around 2016? Which could make sense given the state of crypto at the time. One problem that hit nvidia more than AMD is consumer cards getting vacuumed up by crypto farms. AMD making a conscious split effectively isolated their compute cards from their gamer cards.
That being said, I can't imagine this is something that's been good for adoption of AMD cards for compute tasks. The wonderful thing about CUDA is you don't need a special accelerator card to develop cuda code.
The split had more to do with GCN being very compute heavy and AMD's software completely failing to take advantage of that. GCN got transitioned into CDNA, and got heavily reworked for graphics as RDNA. This is why while RDNA is nowhere near as good at compute oriented features, it tends to outperform NVIDIA in pure rasterization.
In contrast, current NVIDIA GPUs are very compute heavy architectures (to the point that they often artificially limit certain aspects of their consumer cards other than fp64 to not interfere with the workstation cards), and they take great advantage of it all with their various features combining software and hardware capabilities.
Correct. Whether or not a card has official GPGPU support is irrelevant for its use in mining, you can use pure Direct3D or OpenGL for mining if you really want it. The decision to not support customer GPGPU by AMD only hampers other useful non-mining applications.
Ironically, mining is probably the most optimized GPGPU application a regular end-user can run on AMD GPUs. One reason was that GCN was a computing-first GPU, but also that miners have a strong motivation to go through the troubles of overcoming the lack of computing support by AMD. They would write the entire application in OpenCL 1.0 and optimize it to extract the last bit of performance in spite of its difficulties - the hashing kernels used for mining also tends to be much simpler than other applications, so it's feasible to do so. Meanwhile other people who want a decent programming environment would've picked CUDA without looking back.
if i remember correctly, AMD GPU cards was being sold very well to crypto miners. The split wasn't because that. They did, because gamming and computing having different hardware optimization requisites, and that allowed to get better and more competitive GPUs and computing cards on the market.
I was quite surprised to find at least one of the the top500 is based on AMD gpus, it seems to me not bringing that to the consumer market was a conscious choice made some time ago.
They're far behind Nvidia in gaming, to they try to compete on price to get market share back.
Eg the 7900 XTX competes with the 4090 in rastering but is between 30 and 50% cheaper, 7900 XT / 4080, 7800 XT / 4070 are all winning by decent amount in pricing. Frankly this focus has allowed them to catch back in a field they've behind for more than a decade now.
The rise of RT and other compute tools in gaming (DLSS and co) is hurting this strategy, but their APU being on PS5 AND Xbox Series keep them in the race, game engines and game developpers need to make it work on AMD tech.
With Nvidia pricing and memory choice this generation, it's the first time in a while that bying an AMD card for gaming is the better choice at almost every price point (unless you can afford the 4090).
> Ooooh, is that why the consumer cards don't do compute?
No, not really. Sure, the consumer cards are based on a less suitable architecture, so they won't be as good, but it's not like it's not a general purpose GPU architecture anymore.
The real problem is that ROCm has made the very weird technical decision of shipping GPU machine code, rather than bytecode. Nvidia has their own bytecode NVPTX, and Intel just uses SPIR-V for oneAPI.
This means that ROCm support for a GPU architecture requires shipping binaries of all the libraries (such as rocBLAS, rocDNN, rocsparse, rocFFT) for all architectures.
As the cherry on top, on paper one GPU family consists of different architectures!
The RX 6700 XT identifies as a different 'architecture' (gfx1031) than the RX 6800 XT (gfx1030). These are 'close enough' that you can tell ROCm to use gfx1030 binaries on gfx1031 and it'll usually just work, but still.
I feel like this is a large part of the reason AMD is so bad about supporting older GPUs in ROCm. It'd cost a bit of effort, sure, but I feel like the enormous packages you'd get are a bigger issue. (This is another factor that kills hobbyist ROCm, because anyone with a Pascal+ GPU can do CUDA, but ROCm support is usually limited to newer and higher end cards.)
> As the cherry on top, on paper one GPU family consists of different architectures! The RX 6700 XT identifies as a different 'architecture' (gfx1031) than the RX 6800 XT (gfx1030). These are 'close enough' that you can tell ROCm to use gfx1030 binaries on gfx1031 and it'll usually just work, but still.
As far as I can tell, the gfx1030 and gfx1031 ISAs are identical in all but name. LLVM treats them exactly the same and all tests for the ROCm math libraries will pass when you run gfx1030 code on gfx1031 GPUs using the override.
The Debian package for HIP has a patch so that if the user has a gfx1031 GPU and there are no gfx1031 code objects available, it will fall back to using gfx1030 code objects instead. When I proposed that patch upstream, it was rejected in favour of adding a new ISA that explicitly supports all the GPUs in a family. That solution is more complex and is taking longer to implement, but will more thoroughly solve that problem (as the solution Debian is using only works well when there's one ISA in the family that is a subset of all the others).
There was a ship-bytecode idea branded HSAIL, associated with opencl finalisers in some way. I'm not sure what killed that - would speculate that the compiler backend was annoyingly slow to run as a JIT. It's somewhat conceptually popular under the spirv branding now.
Some of the ROCm libraries are partially written in assembly. Hopefully they all have C reference implementations that can be compiled for the other architectures.
My guess is that the AMD strategy of only shipping libraries for some cards + open source means ROCm-ish as built by Linux distros, who are likely to build the libs for all the architectures and not just the special few, can be expected to drive people away from the official releases.
I don't think AMD will be able to compete with NVidia in the neat future because many scientists developing core libraries in the ML/AI space, are getting free or heavily discounted GPUs from NVidia. If they had to buy GPU's from their own or grant money, and pay the same price as regular consumers, the situation might be a different one.
IMHO the way NVidia is infiltrating academics and academia, is highly unethical.
It is important to have in mind that Nvidia started investing resources and time in this 10+ years ago, CUDA was introduced in 2007, where none of the current ML/AI flows even existed. And they continued waiting, betting the company a number of times that the market will "show up" for the products they have built.
In the past few years that has happened, as reflected on the stock price. Other players are basically a decade behind them and with all of the current hype and AI/ML workflows becoming mainstream, I think it is next to impossible for anyone to catch up.
> Other players are basically a decade behind them
I disagree. NVDA aren't in the position they are because they had to work 10 years to get there. They're in that position because it took the market that long to develop, and they were there the whole time.
But the main factor is demand. You can't rush that, no matter how early you start. And second mover advantage is a real thing. CUDA is a double edged sword. Everyone has an incentive to break NVDA's dominance, and ultimately their moat will become commoditized especially has dominant architectures like Transformers drive the vast majority of demand. People will simply optimize for that.
Can AMD catch up to them quickly on the training / R&D side? Doubtful.
Can AMD catch up or even surpass them on the inference side? Absolutely.
There is an also a lot of bad blood towards AMD in the space. I know several people who spent a lot of time trying to support both Nvidia and AMD GPUs back in the early days only to have AMD stop supporting APIs and making their code useless. Their CUDA code on the other hand just kept working on generation after generation of new Nvidia cards.
I'm not sure how accurate this is. I work at a university supporting researchers doing all the typical "AI" research. LLMs, CV, etc. All NVIDIA discounts is the A5000 card for EDU. Maybe one other card they aren't interested in (L40?). Most are buying A6000 and up at consumer prices from companies like Exxact and Supermicro.
I don't think I've seen a single researcher get a free GPU since the V100 era. (DGX-1) system.
I think this is indeed more of a 2009-2014 thing, and even then it was more "we give you a dozen K40s for your cluster" than "we give you, an individual researcher, a GPU". The stories I've heard of individual researchers asking for free stuff always ended in "no".
I don't see anything unethical about seeding universities (as an organization) with hardware tbh. That's just... grants. Even grants often come in the form of "we fund someone who's doing research on something that's related to us/using our hardware+software".
> Compute has been outpacing memory for decades. Like CPUs, GPUs have countered this with increasingly sophisticated caching strategies.
I'd say it's rather the contrary. Unlike CPUs, GPUs don't attempt to directly counter this. By accepting higher latencies, they have let themselves parallelize much more widely (or wildly) relative to CPUs - and the high number of parallel pseudo-threads provides a "latency hiding" effect.
This effect is illustrated for example, in this presentation on optimizing GPU code:
GPUs do deal with memory in other ways than parallelism! That's why GPUs tend to offer huge register files (up to 256 architectural registers per thread on RDNA1!) and local memories (up to 64KB LDS per workgroup on RDNA1). This means lots of work can be done purely in registers and LDS, and trips to global memory are way rarer compared to CPUs where global memory contain everything except 16 or so architectural registers.
Even then, global memory is an issue. And not just because of latency, but also bandwidth! That's why RDNA2 and Ada both added a ton of last-level cache, not to better hide latency (although it's a welcome addition), but mostly as a 'bandwidth amplifier'.
I think register files, shared memory and even texture memory patterns are basically the GPU's way of having fully programmable caches, as opposed to cache being a grey box like on CPUs. Overall I think for HPC code it works out better. Looking at vectorized and cache-optimized CPU code gives me the shivers...
Mostly due to the above I think Intel's years long push to try to have people just run their CPU-optimized OpenMP codes on their Xeon Phis was the downfall of that architecture - it just can't win against a data-parallel-first framework like CUDA, mainly because there the data parallel compute thread can be programmed just as such, rather than trying to get the compiler to behave exactly as you want (and in the end giving up and dropping down to assembly-like SSE instructions).
I wonder why Intel didn't take the OpenCL route farther. Even plain-vanilla, and certainly with some vendor extensions, it could have made it quite easy for programmers to manage to get the hardware to play nicely.
Also, on an unrelated note: IIANM I believe Intel now sorta-kinda offers the ability to set some cache aside to be close to fully programmable. But I forgot exactly what the mechanism for that is called.
You mean for CPUs? Check out ISPC, it's essentially a lightweight data-parallel C for CPUs (that has been ported to Intel GPUs too!), and I think also works fine on even some non-x86 architectures. I'm not sure if there's any mature CPU OpenCL runtimes.
Intel also still supports OpenCL as a first-class target on their GPUs, which is more than I can say for Nvidia or AMD.
OpenCL was a decent effort, but without Nvidia on board it was IMO too little too late - by the time that became a thing, NVDA already was entrenched in the accelerator compute software space. Even worse when Intel/AMD didn't offer a good story to also target CPU vector units with it - that would have created a strong enough ecosystem to target.
> Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.
> The traditional means to improve performance in processors include dividing instructions into substeps so the instructions can be executed partly at the same time (termed pipelining), dispatching individual instructions to be executed independently, in different parts of the processor (superscalar architectures), and even executing instructions in an order different from the program (out-of-order execution).[1] These methods all complicate hardware (larger circuits, higher cost and energy use) because the processor must make all of the decisions internally for these methods to work.
The most prominent example of VLIW processors was the Itanic, er, Itanium.
It, er, didn't work out well.
Hence Itanic.
Their premise was that the compiler could figure out dependencies sufficiently statically so that they could put multiple sequential and some divergent execution paths in the same instruction. It turned out the compilers couldn't actually do this, so processors figure out dependencies and parallelizable instructions dynamically from a sequential instruction stream.
Which is a lot of work, a lot of chip resources and a lot of energy. And only works up to a point, after which you hit diminishing returns. Which is where we appear to be these days.
VLIW is also still heavily used in DSPs and similar. And arguably the dual issue instructions in both AMD's and Nvidia's latest GPUs is a step back towards that.
It feels like a lot of these things are cyclical - the core ideas aren't new and phase in and out of favor as other parts of architecture design changes around and increases or decreases benefits accordingly.
This is less cyclical and more common sense changes.
AMD used to have 5-wide VLIW. Theoretical performance was massive, but NOTHING could take advantage of 5-wide parallelism consistently. When they switched to VLIW-4, essentially zero performance per CU was lost. GCN went the complete opposite direction and got rid of all the VLIW in exchange for tons of flexibility.
It turns out that (as we've known for a very long time) in-order parallelism has diminishing returns. If you look at in-order CPUs for example, two generic execution ports can be used something like 80% of the time. Moving to 3 in-order drops the usage of the third to something like 15-30% and the fourth is single digits of usage.
Adding a second port should guarantee lots of use, but only if the second unit is kept flexible. RDNA3 can only use the second port for a handful of operations which means that it can't be used anywhere near that 80% metric for most applications. Future versions of RDNA and CDNA with more flexible second ports should start to be able to leverage those extra compute units (if they decide it's worth the transistors).
It is not so much that it is cyclical than VLIW being good for numerical/DSP stuff with more predictable memory access and terrible for general computation. Itanium was ok at floating point code, but terrible at typical pointer chasing loads.
I'd argue that the Hexagon DSPs shipped with all Qualcomm's phone SoC which a large number of us have in our pockets right now are probably more prominent example.
Itanium's problem wasn't so much divergent control flow as not being able to handle unpredictable latencies on load instructions without halting all execution entirely. There were some features that were, in theory, supposed to deal with this but using them was where the magic compiler problem came in.
For DSPs you can usually predict memory latency so this isn't an issue and VLIW works great there even with some amount of branching.
5 points for the Luddite (me) that said amd would leverage their knowledge of chiplets and bus fabrics to make a comeback in AI. I wont pretend to be able to read this article, or read in general, just wanted to plant my flag.
I hear, read and use the term on a daily basis, and I would say that it emerged in my workplaces around five years ago, and came into common use around two.
Ooooh, is that why the consumer cards don't do compute? Yikes, I was hoping that was just a bit of misguided segmentation but this sounds like a high level architecture problem, like an interstate without an on ramp. Oof.