So for reference, this geohot is George Hotz who has a company Tiny Corp [0] in the space. Among a long list of things, he had a moment in the spotlight recently for a long rant [1] where he "gave up on AMD" because their drivers sucked more than he expected. The situation is reasonably complex - AMD have some programmers working on ROCm who seemed to be operating at standard fare when that is really not what AMD needs right now (I personally, suspect there is/was a PM in a key position who didn't "get it", although I am not sure what it is either). As far as I know it got the attention of Lisa Su.
I'm cheering him on, even though his complaints were a little melodramatic. My experience is the driver technically supports everything I could possibly want. The problem is if I spend an evening trying to do anything with OpenCL or ROCm the kernel hard-locks and I go to bed early. If the problem inside is what it looks like from the outside (repeating myself, a key manager somewhere just doesn't get the space) they really need some pressure from grumpy customers like George to realign their software development process.
That context might be related to this particular case. I see a suspicious folder called "crash" in this repo.
NVidia is making boatloads of money because their driver works and they have a software library called CUDA that accelerates neural networks.
Nobody expects AMD to match them, but George thought they could at least write a GPU driver. If AMD can get that driver out, then George could provide a competitor to CUDA (for neutral networks only). They'd both make boatloads of money.
However, AMD was less capable than expected and their drivers were too buggy to run neural networks like those needed for the MLPerf benchmark. So now, it appears that AMD, Tinybox, and investors like me won't be making boatloads of money.
>However, AMD was less capable than expected and their drivers were too buggy to run neural networks like those needed for the MLPerf benchmark. So now, it appears that AMD, Tinybox, and investors like me won't be making boatloads of money.
This is where the melodrama kicks in.
They reverse-reversed course less than a week later and now they're back on AMD again.
He talks about in his streams how horrible the communication he’s received from AMD is. From what I listened to, it seemed like that was more of why he was giving up initially. Why do a bunch of free work for a massive company that won’t even communicate with you? Especially when he’s doing them a massive favor for next to no investment on their end?
There's a chance they (maybe specifically the lawyers) know something we don't. I mean, maybe they are absurdly incompetent in listening to feedback, while at the same time achieving technically great things in hardware. But after so many years and seeing all the AI money going to the competitor... That seems less and less likely every day.
"Organizationally unable to make competent software, perfectly able to make great hardware" seems to be the common case with hardware companies, if not de facto standard. Exceptions are rare.
Apart from "trying to implement this will cost us more in CUDA API copying lawsuits then it could earn", I don't know.
But it's not just that they can't make competent software. It's that everyone tells them they should try, that it looks like a pile of money ready to pick up, that people try doing it on their own... and AMD does nothing. They're not even taking the chance to fail/succeed. Can you imagine that Lisa Su doesn't get asked about this at least once a week?
One individual doesn't suffer from the problem of being pulled in multiple different directions by multiple people. A company is not typically led by one person "dictator-style" but instead groups of people who try to make decisions together, sometimes not agreeing.
Sure, but I don't think this is "too many cooks in the kitchen," I think it's the opposite: hardware companies tend to be structurally incapable of spending as much as they should on software because everyone in the hardware space has the same bias. The economics of the space select for it in the short term and against it in the long term, creating the neverending foot-gun party we observe.
AMD has simply never invested into software. Their code has been atrocious since even before AMD bought ATI. ATI "Catalyst Control Center" was their consumer driver code before Vista and into Windows 7 IIRC, and that was utter trash. Granted, nVidia's drivers were ALSO trash back then, accounting for literally 65% of ALL Vista BSODs.
nVidia decided to redouble their efforts, and now they might still crash occasionally, but are largely way better at driver stability and they brought CUDA into the world at the same time.
AMD decided that shitty software didn't seem to stop them from selling GPUs, and also we're too busy desperately surviving a decade of Intel anti-competitive practices that nearly killed the company, and bet everything on Ryzen. They also worked to make pretty good physical GPU hardware. Meanwhile, their GPUs still couldn't run Blender as fast as a similarly specced nVidia card because their OpenCL implementation was god awful. They ran at literally half the render speed of a similar nVidia GPU. It was stuck on OpenCL 1.x the whole time, because the 2.x implementation was literally broken. They nearly didn't have ANY hardware render solution for an update to the Blender rendering engine in 3.x because OpenCL 1.x literally couldn't do what they wanted, and ROCm is a joke. AMD engineers helped put together an emergency/late breaking fix to create an HIP implementation, and that works, at least mostly.
My pet theory is that not only does AMD not give a fuck about software, but they saw how nVidia was struggling with market segmentation from consumer cards being effective compute cards, and didn't want to run into those same struggles if they had a real CUDA competitor. Instead, they got to rub nVidia's face into the dirt with their GPUs that had way more VRAM, and not worry that it would chew into their professional GPU profit margins, because you can't compute on consumer cards. Oh, I forgot to mention, the whole time this nonsense is going on, AMD is pushing really hard to get their professional GPUs into Supercomputer clusters, and has several premier supercomputer implementations where their GPUs have no problem being used for top level compute tasks, almost like they CAN actually write GPU compute software and just don't give it to consumers.
Slight mistake in your description: CUDA is an out of date API that was replaced by Khronos's own official compute APIs. Khronos is a standards consortium that Nvidia is a founding member of.
Although the marketing department at Nvidia still pushes for greenfield CUDA codebases, no new code should be written in it, and they should opt for open source international standards only. Khronos APIs are implemented by over 120 vendors.
With my past experience in Khronos, NVIDIA is indeed a member and they sent decent guys to the meetings -- but only for strategic reasons, rather than "advocating for open standards" as you described. My experiences there actually told me the opposite that they will never drop CUDA. Objectively they also have incentive to do so: fighting with 100-ish companies to ratify something is always slower than rolling out an feature in an ecosystem you have total control of.
How may standards Khronos endorced over the years/decades on compute? From an uneducated and external view, it seems every 2-5years there is a new standard.
Ultimately 2. OpenCL, and Vulkan (via it's compute shader).
Sycl's job isn't that, its meant to abstract implementations of common components across different kinds of hardware, and doesn't force you into any particular style of impl. As in, I could write a component for Sycl for my GPU in OpenCL and what Sycl would abstract away from the consumer of my component would be the entire usage of OpenCL itself; but I could write a component for a DSP, and it'd use an entirely closed source SDK for that hardware and is entirely opaque, and a Sycl user could use that impl for that function of they owned that DSP (instead of a CPU-based or GPU-based impl).
Also, Vulkan's compute doesn't replace OpenCL (not even in the sense that Vulkan, as a graphics API, replaces OpenGL). They're different levels of abstraction. Most Vulkan games are written almost entirely in compute shaders (ex: the powerhouse that is the Doom 2016 and Doom Eternal engines; and why they perform so fucking amazingly on paltry hardware like the original revision Xbox One, or hell, even the Switch).
In addition, I almost consider DX12 a flavor of Vulkan. Same job, written largely by the same people from the same companies, but instead of being OpenGL C-dialect flavored, its D3D C++-dialect flavored, but they both have entirely equivalent APIs that often call the same driver internals and produce nearly identical MIR. Microsoft did this on purpose to reflect the nature of how modern GPUs are almost entirely software renderers, sans certain parts of the texture units.
You'll note that AMD (and Nvidia, but that's expected) is unfortunately missing from the members list for UXL, which seems to be based on Intel's oneAPI which in turn is based in part on Khrono's SYCL:
> Our mission:
> • Build a multi-architecture multi-vendor software ecosystem for all accelerators.
> • Unify the heterogeneous compute ecosystem around open standards.
> • Build on and expand open-source projects for accelerated computing.
If AMD got their act together they would go all in on open standards together with other actors in the segment, but maybe they really want to see if their rocm can repeat Cuda's success and lock in? I just want accelerators to become commodities and interchangeable like CPUs, let me write my code once and deploy on whatever hardware the user have, preferably without having to compile multiple versions.
Nope, they basically open sourced their api which is kind of a worthless proposition. You would think this is malicious, trying to pretend as open source, but the whole thing is done so ineptly, it is just incompetence. Likely management has no idea what open source means and thinks releasing api code means open source and is good enough for customers to use your GPU.
More concretely, what they open sourced calls closed source functions that run on the GPU which is where all the bugs are, and causes all the crashes.
> More concretely, what they open sourced calls closed source functions that run on the GPU which is where all the bugs are, and causes all the crashes.
In particular, what they did was open source the drivers but not the firmware, and then when the firmware has bugs nobody outside the company can fix it because it's not open source.
Yes, that's a problem, but it's also a bit of an unfair dunk.
After all, your CPU almost certainly runs firmware with all the same issues. Not to mention all other GPUs from all other vendors.
Despite all of the noise-making, I don't think anybody has really convincingly made the case that the firmware quality is the main issue here, as opposed to the quality of the rest of the stack.
> After all, your CPU almost certainly runs firmware with all the same issues.
CPUs don't really have firmware in the same sense. CPUs expose basic primitives (machine instructions). Modern CPUs will translate them to micro-ops in microcode instead of executing them directly, but the mapping is relatively simple and bugs are correspondingly uncommon.
The GPU firmware is doing something complicated, and therefore buggy, but you're required to use their complicated buggy closed source code because it also doesn't directly expose simpler primitives you could use to build your own alternative to it.
> Despite all of the noise-making, I don't think anybody has really convincingly made the case that the firmware quality is the main issue here, as opposed to the quality of the rest of the stack.
The rest of the stack should be an abstraction layer that makes programmers not have to care what kind of hardware is under them, the same as they generally don't have to care if their CPU is from AMD or Samsung. But then you need the code to translate from that abstraction to the hardware, and that code needs to work. Which can either be accomplished by the vendor providing working code, or by the vendor providing sufficient documentation and source code for someone else to write working code. Providing neither of these is not effective.
What AMD should be doing at this point is both. Open source the existing firmware so that people who are trying to use their hardware right now have the ability to fix any problems they encounter themselves, while the company uses the money they now have to address their existing technical debt, which nobody reasonably expects to happen overnight. But will happen faster when it's not only the company but also the users working on solving the problem.
idk, but Twitter under early Musk leadership sounded like anything but a bureaucratic swamp and he still absolutely and predictably failed. However, Twitter was a pretty different problem domain to what he was initially known for
There's a difference between fixing technical problems and cultural problems. Hotz has proven to be a wizard at technical ones, while his skill with fixing cultural issues remains to be seen.
Any long-lasting solution requiring continuous investments is going to require more impact on the people side than the technical side. Hiring maybe one or two technical wizard asshats at a company may work for a bit but can’t scale in that eventually asshats will either drive away enough potential contributors or clash with another asshat. Any company actually needing said asshats to conduct business has a form of organizational tech debt in that they’ll need to spend a lot of resources managing around said persons in the end while said person’s ego continues to grow convinced they’re primarily responsible for successes.
>I'm cheering him on, even though his complaints were a little melodramatic.
I've really soured on him when he literally went "I'm going to fix twitter search, hire me elon" and then went to public saying "Hey, want an internship? Fix it, and then MIT license your code(so he could use it lol), and maybe we can talk - oh btw I have no real authority to give anyone an internship."
He might be a brilliant programmer and problem solver, but that incident left a very sour "Grifter" vibes.
Yeah, I was feeling similar vibes back when Lex Friedman interviewed him for some self driving stuff back in the day (...I've soured on lex too since he's not really an interviewer/journalist so much as a host/promoter and rarely gets to anything interesting in his talks imo).
His work for jailbreaking/actually being able to own the devices we pay for back in the iPhone+PS3 days was and is still inspiring, but I think being so smart and encountering so much bullshit in his life kind of led him to the 'rockstar coder I know better than you regardless of domain' personality, and that leaves him both a bit grating and underquipped to drive consensus. He's still a great wrench to throw into the bullshit machine though, and the guy is smarter and more accomplished with computers than I'm ever likely to be.
This is a great take. We need more people that are willing and able to jump in and wreck new bullshit machines as they crop up in diverse industries. They can even keep their boastful grandstanding attitude as compensation.
What if the "wants to wreck bullshit machines" trait often co-occurrs with the "grandstanding attitude" trait? You'd be unnecessarily restricting the applicant pool because you don't like how they express themselves, which seems pretty dumb given that we're being flooded by bullshit machines.
We're surrounded by assholes. It's up to you which one you'd rather deal with, bullshit machines or assholes. I'd rather we have neither, but between the two I'd rather bullshit machines because then I don't have to deal with assholes. You are free to choose dealing with assholes and fewer bullshit machines though.
> He might be a brilliant programmer and problem solver, but that incident left a very sour "Grifter" vibes.
Your algorithm for "who is a grifter" is buggy. You can tell someone is not a grifter based on what code they ship, not what words they say ("Cypherpunks write code" being the mantra).
Failure doesn't prove someone is a grifter, btw. Not acknowledging failure is what makes one a grifter. A grifter would continue to draw a Twitter salary without actually fixing anything and were many such cases before Musk took over! Instead, Hotz quit upon realizing that he couldn't do what he said and publicly acknowledged that he couldn't fix Twitter search.
This is also why his companies/projects have bounties instead of "internships". Their code is open source and actually writing code is a simple filter that eliminates people who don't write code but are good at _saying_ that they will. By not even guaranteeing a paid internship in exchange for a pull request, he ensures a contributor's incentives are aligned with improving the technology as opposed to gaming a hiring process. And he pays cash for bounties so you can accurately evaluate whether it's worth your time to contribute.
In other words, he learned from his failure at Twitter that if Twitter's code had been open he would have been able to make an accurate assessment of whether he could fix it by actually submitting a PR!
I soured on him a bit more when I watched some of his stream yesterday and he was ranting about diversity being a root of evil in Silicon Valley. Struck me that he hasn’t matured much since he was slinging mud at Sony and Apple.
He's given me grifter vibes for a while, so it's nice to see that others are catching on too.
Feels like the usual trend of someone actually skilled in one specific thing (hacking) letting the social media notoriety get to his head and thinking he's just as much of a genius at everything else.
They do need to rework their development processes. It’s worth noting though that there are two (kinda three) different drivers with very different quality and different teams behind them.
There’s the binary driver, on windows and the fglrx driver for Linux, that contains the full OpenGL stack and all the proprietary stuff. That’s been maintained and extended since the ATI days by a team primarily based in Markham. It’s garbage. It’s been garbage since the 90s. This is not new or surprising. It’s unstable, poorly maintained and just generally a bad time. It was also the reason OpenCL required an X session run as root with no access control and sometimes a stub dvi cable on the card for many years. It is a source of endless sadness and despair.
The other driver is the HSA/ROCM driver on Linux, which can be used with the amdgpu upstream kernel driver and is itself upstream. The amdgpu driver is 2d only, but has been developed in the open, and stable, for a long, long time. It’s maintained by a completely separate set of teams mostly in (last I knew) Germany and Austin. The HSA driver has likewise been immensely more stable than the binary drivers since it was introduced. There are potential problems with it, but I’ve managed to recover every single one I’ve run into without a reboot, and I’ve been doing GPU compute with AMD, NVIDIA and intel since 2008. When it was the binary driver, we needed reboots every time an AMD GPU locked up. I’m not saying there’s no problem here, but he’s reporting problems I’ve never seen in thousands of hours of high-load compute both personally and for work with the driver one ought to use for ROCM. There are other problems, but HSA/amdgpu drivers have not been high on that list for me. Possibly because I was so used to how bad the other ones were I guess…
Aren't the relevant 3D bits in userspace (not the driver)?
Isn't the concept of 3D APIs limited to stuff like Mesa, with the kernel space AMDGPU driver providing only the primitives on which the API can then be built? I could be wrong, but that's how I understood it.
I things like Fglrx, the OpenGL stack was part of the driver proper IIRC.
But granted, I haven't heard of or seen FGLRX in literally more than a decade, and describing AMDGPU as 2D only is misleading or pedantic.
AMDGPU enables the whole hardware sans the video encoders. 3D is standard OpenGL, provided by Mesa.
It's fully accelerated, incl. video decoding & HDCP. Full stack sans the card firmware is open source.
The video decoder block is completely independent from HDCP block on the silicon to be able to provide open access to these parts too.
I don't know why they can't/don't open source video encoder parts. Probably some 3rd party royalties, but I'm not sure.
If you want another 3D API, you can directly build it on top of the driver. Nothing prevents that. Plus, ROCm packages are landing to "free/main" part of Debian for quite some time.
It's mostly Mesa. The kernel - user space division is kinda like it would be in a microkernel OS: the kernel provides hardware access and manages permissions and resource (time, memory, energy) allocation. User space does the rest.
The kernel part is not really that small but you get the idea.
This is wrong for two reasons. One, he claims that AMD's enterprise GPUs are also unreliable. Two, he is comparing vs. 4090 which is a consumer GPU and CUDA is very reliable as it is on all of Nvidia's consumer hardware.
I watched geohot streams and if he can go for an hour with a need to restart the computer due to crashed kernel driver/locked up card, it's a miracle (various computers, one or multi-card setup).
And he is not trying to crash the driver, just making the most basic stuff work.
I seriously think there is a deep HW bug that caused the difference in performance between RDNA3 presentation and reality and they had to do some horrible hacks to make it work.
Either that or the card never seen a tester for compute. Which is strange, considering they added it as a "supported" option.
> The problem is if I spend an evening trying to do anything with OpenCL or ROCm the kernel hard-locks and I go to bed early.
In a race against NVIDIA where every comment on here is about just how much better CUDA is than any alternative, why doesn't literally what you just said "also" have the attention of Lisa Su?
I have had multiple replies from Lisa Su on both Twitter and by e-mail. It doesn't help.
AMD is structurally incapable of fixing these issues. They don't even have a 7900XTX in CI. Crashing the firmware is so trivial there's no way anyone fuzzed anything.
They debug by the application, adding mitigations at many layers of the stack to make it work. While this strategy is fine for the 20 mainstream games that come out per year, unless you root cause issues and have a good CI, you'll never build a stable GPU for general compute.
You read stories in the glory days of Microsoft where they would bend over backwards to ensure backwards compatibility [0]. Going from that to what you say is depressing. It’s actually hard for me to accept your statement because such a failure would not only require QA/the PM to be totally ignorant of current testing best practices, but also require HR to have catastrophically misidentified both those people and the people who supervise them.
The shocking thing about AMD in the last 5 years isn't that they got Ryzen to work, it's that they didn't screw it up within two releases. There are no other technology companies in the world who have a will to fail so ingrained in their corporate DNA as does AMD.
Yikes. For context, the 7900XTX and 7900XT are the only consumer GPUs that AMD officially support running AI workloads on, and the XTX is the one they recommend out of the two. Other cards in the same generation and any from previous generations are officially not supported or tested and may not work. So it's not like they have a particularly large amount of hardware that needs to be included in their testing environment.
Genuine question for you then, why are you investing your time reversing their firmware if it sounds like you think they're failing/will fail? Why not just "adopt" the golden child/golden standard NVIDIA now for your usecase, reverse that (if needed) and move on? Genuinely curious what you gain from trying to make AMD work/investing in AMD on your end here. Is it that you see a value proposition through the "muck" of crashy-kernels that I'm missing?
Did you ever consider using NVIDIA? I thought the machine you were planning to sell originally had RTX 6000s.
Also, isn’t the real problem with the 7900 XTX is the lack of interconnect? I really appreciate the long list of stuff you are agitating AMD about. Infinity Fabric and blower fans would be great too. Another POV is none of that is ever going to happen, and everyone will have to wait for a Chinese GPU manufacturer to create real competition. I can’t recall a single time in my life a non-Chinese company - electronics or otherwise - delivered the “same” features for a lower price.
> Why not just "adopt" the golden child/golden standard NVIDIA now for your usecase, reverse that (if needed) and move on?
Oh oh, I will answer you with the business side of the thing: because Nvidia sells exactly the same thing as Tinycorp. Also, Nvidia has shown that they will crush you if they get a whiff that you are doing something they don't like, for example, speaking with a competitor. With a business partner that gets you by the balls like that, it is not sound business strategy not to depend on them.
I thought he sort of gave up and is going to build his stuff on Nvidia in the end? It's wild. I'd want AMD to present some sort of alternative if only because you can't get any high-mem chips these days but they seem to have some massive organizational uselessness on this front.
AMD GPUs are fantastic values for money when it comes to running mainstream PC games. Once you step outside that narrow box they become infuriating. My 7900XTX is probably the last AMD GPU I ever buy. It's just horrible at all the other things I want to use it for, especially and gallingly, game console emulation. For the first few months I had it I actually kept my old GTX1070 plugged into my motherboard because the eight year old $400 GPU was so much better at half my use cases than the $900 AMD card.
His thesis is that owning a good software platform is a prerequisite for a successful AI hardware company. He is (or was) using AMD as a stepping stone to create and flesh out that software platform (tinygrad) before starting custom ASIC development, instead of going the other way around with a costly ASIC first and trying to develop a software platform after, or trying to push support for an ASIC into someone else's platform such as PyTorch. Not sure I agree but it's an interesting idea.
It's also worth noting that tinygrad is bypassing as many layers of AMD's software platform as they can get away with (MIOpen, ROCm, and even the userspace GPU driver).
If the AMD code makes it into the kernel, like their graphics drivers did, then we've won because it will just work on every kernel update.
Compared to now where AMD drivers just don't work and NVidia drivers only work with the right blobs installed I'd be willing to take a paycut for 6 months to have 10 years of painless development.
That said I jumped ship to NVidia since I need to work and not spend all my time trying to get servers to the point that they can work.
I’m going remind everyone that AMD and NVIDIA are atypically competition avoidant, but I don’t want to defend a generalization that broadcast let’s be specific: Hopper and MI300 are on paper just straight up competitive, both super scarce, high-margin cards that really love PyTorch, (which really loves NVIDIA, even TPU is, not the fist among equals).
But all the MI300 is going to supercomputer and adjacent things, and all the Hopper is going to giant tech LLM type stuff, it’s not the same people bidding on those bins of those.
Oh, and their respective CEOs are closely blood-related and in a trivially first name if not family gathering basis.
The people root for George generally think this isn’t real capitalism. We think it’s a trend towards the failure to enforce anti-trust laws.
>I’m going remind everyone that AMD and NVIDIA are atypically competition avoidant, but I don’t want to defend a generalization that broadcast let’s be specific: Hopper and MI300 are on paper just straight up competitive, both super scarce, high-margin cards that really love PyTorch, (which really loves NVIDIA, even TPU is, not the fist among equals).
> But all the MI300 is going to supercomputer and adjacent things, and all the Hopper is going to giant tech LLM type stuff, it’s not the same people bidding on those bins of those.
They clearly compete fiercely, every feature NVIDIA adds to its consumer GPUs gets an equivalent version by AMD. Previously AMD has introduced features that have caught on enough to force NVIDIA to implement them. AMD's offerings often force NVIDIA to discontinue certain products or to drop prices.
The reason MI300 is going to supercomputing and Hopper to AI is the self fulfilling prophecy that AMD's software still sucks for AI, thus they have to target general supercomputing.
>Oh, and their respective CEOs are closely blood-related and in a trivially first name if not family gathering basis.
This is mostly just a meme, they're distant relatives, which is not too surprising considering they're both from Taiwan. Jensen's maternal uncle is Lisa's grandfather, with said maternal uncle being the eldest of 12 siblings, 18 years older than Jensen's mother. I barely know the names of the eldest of my aunts, who have a similar age gap, let alone have a close relationship with them, because that age gap and number of siblings means that even my mother barely knows them.
This is a contentious debate of which I merely think everyone interested in this should be aware. I included my position, which is shared by many, that this is ridiculous if not illegal.
I hope I didn't confuse anyone as to there being a different and dramatically better funded side of the argument, given that everyone on HN is squarely the target audience for the PR blitz on this. Everyone knows all the arguments for why we're all supposed to say "This is fine. Everything is great here.".
A much milder position than "cousins are running companies that seem to be coordinating" is: "55.58% Net Profit Margin last quarter isn't consistent with a functioning market". Those are Wintel margins, those are "get more than noticed by the DoJ" margins.
Throw in everything `ROCm` down to the "hangs randomly doing ostensibly supported things on a mainstream platform like Ubuntu 22.04 with a modern kernel" being pretty clearly under-resourced for an organization that can do Zen4 and it's basically flawless firmware / UEFI / driver / etc. story?
AMD can build a 7900XT that is a joy to use for people who don't need 100Gb of VRAM or FP8 training. They'd sell a zillion of them at margins that even EPYC would envy, and they've had people on their ass in public about it for coming up on a few years now.
NVIDIA can build a prosumer card, and continue to be the GPGPU vendor that all of us had no compunctions about listing as one of our favorite companies in tech (with a few exceptions like the Linux fiasco ages ago) even a couple of years ago. The "holy shit you can do all this on a 3090-Ti and a normal tech worker can save up enough to buy one pretty easily" days were imperfect, but I still really liked NVIDIA, or at least enough not to be desperately looking for other options as my default posture.
Both companies can easily make a pile, look like the good guys, draw hackers into the ecosystem, and enjoy both the wads of cash and love that everyone would be throwing at them. The competition would be better for both teams! It just wouldn't turn into a short-term asset, it would turn into a long-term asset: it would be "our team is in true fighting form on this", and you can't report that next quarter. And the parts of AMD and NVIDIA that have serious pressure on them, that are in fighting shape? Those teams are red hot, they're killing it.
I'm building a bare metal cloud service provider entirely around MI300x. Anyone (within US export restrictions) can have reasonably priced access to them. You get full access to the cards (not API based). If you take an entire machine (of cluster of them), you get BMI level access. It is as if you're sitting at the machine yourself.
In other words, I'm building my own supercomputer, and making it public.
Thank you for two things: first thank you for correcting my error is saying "all the MI300" is going to supercompute, I am aware that a few groups are doing awesome stuff with making the platform available to mortals.
Second, thank you for being one of the people trying to make the platform available to mortals. As Sheryl used to say when someone did the right thing: "You're doing God's work."
Is there a link or a mailinglist or something that I can watch for how to get access when it becomes available? I'd like to have `HYPER // MODERN // AI` support the platform well.
edit: It's in your profile. I will check it out tonight. Keep it up!
We started the business last year, as a proof of concept before MI300x was even released. After a lot of work to build the necessary relationships, we got a box of MI300x, deployed it into our data center, and immediately onboarded a very large customer.
Now that we've completed that PoC phase of the business, we just closed many millions in additional funding that will go purchase many more GPUs. Thanks to our hard work and excellent investors, it is actually happening!
While we wait for more GPUs, we are donating time on the box to anyone who'd like to run benchmarks [0] and publish unbiased blog posts (with repeatable source code). Our hope is that once people see how well these perform, they will consider porting/running their code on our systems.
If they don't perform we will be transparent about that as well. I'll have a good set of data to bring back to AMD/SMCI to try to resolve any issues. My guess is that AMD will take that a lot more seriously than someone trying to make consumer hardware work, in an enterprise setting. This isn't a knock on George's ambitions, nor the need for AMD to make consumer products work better with ROCm/AI. I just feel that AMD has limited resources today and they have to focus on one thing first, which is what they are clearly doing in their responses to him.
We can save the debate about whether AMD can walk and chew gum at the same time :)
The important part is that you’re a (potentially very big) part of the solution: if can get serious adoption via a stemless experience, that’s a major win for free, fair, functioning markets via robust competition sparingly but well-refereed. You’ve won a fan and evangelist at the conceptual level, and I’m sure you’ll be hearing from me about wanting to make sure the stuff I’m moonlighting on supports the platform well, at which time I’d happily do an unbiased write up: I’ve got a bias for functioning markets, not one particular vendor.
Thanks for the productive dialog and the fan support. Greatly appreciated!
> We can save the debate about whether AMD can walk and chew gum at the same time :)
It is a huge risk for my business, so it is something that I'm taking very seriously. In a past life, I've run a lot of consumer AMD GPUs, I'm well aware of their positives, and negatives. My feeling is that George didn't factor this risk into his business, and then went into panic mode when he ran into issues. This is exactly why I'm going the "enterprise" route first.
> I’ve got a bias for functioning markets, not one particular vendor.
I strongly concur with this, which is why AMD won't be our only offering. I'd love to get other hardware (including, but not limited to, Nvidia and all others). That said, we're just focused on starting with creating functioning markets first.
One thing everyone should keep in mind about this NVIDIA/AMD battle right now: CUDA has been published for 16 years, it's been a huge push by NVIDIA to do GPGPU computation. I remember seeing it as a new thing in university back then, after the advanced shaders that were only available on NVIDIA.
NVIDIA pretty rightful has the lead there, because they worked and invested into it for something like 20 years (you could do pretty advanced shaders on NVIDIA pre-CUDA).
It only started to pay off recently, and especially with the AI hype (GPU mining was nice, too).
Now everybody is looking at the profits and goes like "OMG, I want a part of that cake!", either by competing (AMD / Intel) or by paying less for the cards (basically everyone else in the AI space).
But you have to catch up to 16 years of pretty solid software and ecosystem development. And that's only going to work if you have good enough hardware. NVIDIA did the hard work here. They have earned this lead.
I am saying this as someone who would rather not buy NVIDIA. I really wish I can soon throw 1-2 7900XTX into a machine and use it for LLMs without issues. But I would also bet that it takes at least a few more years to catch up, even with the massive global interest.
Yes, this is the counterpoint to the “ 55.58% Net Profit Margin last quarter isn't consistent with a functioning market” thread above. Sure, selling shovels during a gold-rush is very profitable… and nvidia invested a long time in building the best shovels for a lot of years where the net profit of doing so was intensely negative. They built the prospector community up and sponsored the development of geological science that spurred advancement of knowledge and practice - using their products, of course.
(The Michelin star model of hardware sales - did you know that Michelin actually makes money from that book!? They’re not doing it because they’re financially disinterested, can you believe that!?!?)
Anyway net profit is more like 15% in a normal year. Recently it is actually lower, Ada is already lower margin than pascal for example.
It is only this high because nvidia finally struck good - and they spent a lot of effort and money that might never have pqidnoffZ
The RDNA docs are just the tip of the iceberg.. in order to get to the compute engines (what you actually care about) you have to go through 3 layers of user-space cruft, a kernel driver, and then a firmware layer running half a dozen separate components of the GPU that "manage" the RDNA compute engines. Apparently, these components run on at least 3 different ISAs (some ARM, some F32, and some on RS64) where 2 of them aren't really documented at all. All of this is what he's trying to bypass and talk to the RDNA cores as directly as possible. Modern consumer GPUs are complex beasts.
I guess George's 5 hour poking and googling around session, fruitless in terms of the actual results(making compute on AMD's consumer rdna3 chips stable), is newsworthy on HN as of now.
I used to be an AMD diehard fan. Bought their processors and GPUs for close to 20 years until I simply couldn't take it anymore - 5 years ago I started buying NVIDIA GPUs with Intel CPUs and honestly never looked back. Sure I pay more, but it's worth paying more for reliability and software support. Anything else and I'm losing money out of ideology to give to a business that doesn't have its act together.
It's always intriguing to see other people's takes. I'm in nearly the complete opposite boat: in recent years I've switched to AMD and it feels like all of my hardware problems have gone away.
Heaven is using AMD cards for graphics and NVidia for computation.
Hell is the reverse.
I have an extra Radeon VII in my ML work station because I don't have to fight nvidia graphics drivers to get it to work. I have the NVidia cards so I don't have to fight AMD drivers to get ML drivers to work.
I'm assuming GP runs Windows and you run Linux. NVidia's proprietary drivers are known to be a lot more stable than AMD's. But on the other hand the open source driver and software stack for AMD (partially shared with Intel) is much more stable than NVidia's on Linux.
I'm surprised to read such a take given pretty much every enthusiast I know went through the reverse cycle over the last 5 years when it comes to CPUs. I don't know anyone right now who is running Intel on their desktop computer. Even on mobile Intel seems to be loosing grounds. Only the GPU department is an uphill battle for AMD where they can't compete with any of NVIDIAs high end offerings.
However, observing Intels cards from an average consumers perspective, they really showed what difference good software makes when it comes to having great performance hardware.
Meanwhile my 7800x3d and 7900xt sip power compared to the new Intel chips pulling 400w on the processor alone. I also have 20GB of vram future proofing me for some time.
Imagine being a fan of a company when your options are VERY limited ((A or B) or C).
Making an informed decision at the time you want to buy/upgrade something is a relative simple task these days.
As someone who basicially grew up with Intel/Nvidia but now running AMD/AMD, while i could agree with you on the GPU side that AMD still has more work to do, I would hard disagree on the CPU side.
Especially with the X3D line, AMD cpus can absolute smoke intel and at worst, come within a single percentage of performance in usually single thread bound scenarios.
So, why AMD acts like they don't care/can't care about this? Is this such a specialized area that the programmers are hard to find or fished away by Nvidia or something else is going on? If the latter, what?
Maybe that's where they run the HDCP/DRM stuff? That's the most obvious application of a secure enclave on a GPU that I can think of, and would also explain why they won't (or can't) open it up.
I'm cheering him on, even though his complaints were a little melodramatic. My experience is the driver technically supports everything I could possibly want. The problem is if I spend an evening trying to do anything with OpenCL or ROCm the kernel hard-locks and I go to bed early. If the problem inside is what it looks like from the outside (repeating myself, a key manager somewhere just doesn't get the space) they really need some pressure from grumpy customers like George to realign their software development process.
That context might be related to this particular case. I see a suspicious folder called "crash" in this repo.
[0] https://tinygrad.org/
[1] https://www.youtube.com/watch?v=Mr0rWJhv9jU - as I recall he ran the demo suite and it crashed.