Hacker News new | past | comments | ask | show | jobs | submit login
AMD Instinct MI300 Data Center APU – 146B Transistors, Shipping H2’23 (anandtech.com)
70 points by kristianp on March 1, 2023 | hide | past | favorite | 40 comments



That is amazing and incredible, but still won't make up for the fact that rocm is a garbage fire and opencl is slow. Nvidia has more than good hardware.


Arch is packaging ROCm natively in testing, and PyTorch has official-ish builds of it now (and maybe Arch will build it too).

But you should also check out MLIR which, for instance, is used very sucessfully here and is benching way faster than rocm: https://github.com/nod-ai/SHARK/


Debian also has a lot of ROCm related packages from base library to software packages. Looks like AMD is in a push for making it widely available and easy to work with.


I think we should clarify that this push is for making it easy to use, not to develop. They are trying to capture the python-based ML community, but care a lot less about people who actually want to write code in ROCm than Nvidia cares about CUDA coders.


And the datacenter cards (which are even less accessible than an A100) are still the first class citizens, while CUDA performance on the desktop cards is quite excellent.


RoCM works okay for me in casual use, what issues do you have?


ROCm's software stack runs on what feels like Ubuntu LTS only (good luck on Debian or derivatives!), you need very specific PyTorch versions that work with ROCm (no running the bleeding edge easily), and running an old UEFI version (eg: 2 or more years old) often causes a no boot/no POST situation.

ROCm is a fragile stack that works on a very small subset of AMD's hardware with 3 blessed Linux distros. It doesn't have to be this way, most APUs and GPUs from AMD should be able to run ROCm even on Debian, Nix, or a distro of the user's choosing.

Thankfully some Instinct hardware is super underpriced, so if your willing to make the effort you can run things really fast for rather cheap :D


HIP/ROCm has been packaged in debian - all I had to do was `apt install libamdhip64-{5,dev}; pip install torch` and then Stable Diffusion would run, on amdgpu 6600XT, with no third party repos at all.

It should be fine for AMD's packages to only work on their own blessed distros, that's normal behaviour for upstream. Then downstream distros will then repackage it to make it a good citizen in their distro.


Those packages are only in Debian testing, and I think RDNA2 GPUs are the only consumer-level GPUs with reasonably good support in ROCm (and then probably only some of them). AMD also dragged their feet for a long time on supporting RDNA after it launched, and actually managed to break support for the only consumer GPUs that worked reasonably well with ROCm before that (the RX570 and similar) quite a while before they added support for RDNA.


And God help you if you want to program in actual ROCm rather than using a library like PyTorch or BLAS. There are basically no good manuals or usage guides, and the accompanying tooling is much harder to use than the CUDA equivalents (if they exist at all). On top of that, nobody has any examples outside of major libraries or HPC code (which were largely written by developers at AMD).



Not discounting any of that, but it seems like a similar situation to early versions of cuda. Those really sucked to install.


I would argue cuda is still painful when you find yourself in some sort of edge case... The thought of using a stack that is worse gives me the creeps.


> runs on what feels like Ubuntu LTS only

It surely runs in the Cray environment (I assume still based on SuSE) on Frontier. There is ROCm packaging in Fedora/EPEL, provided by an AMD employee, but I don't know anything about it.


Where did you buy the Instinct hardware and what's a good price?


When I tried it, ebay was the place to go, and the price was about 1/10th the nvidia equivalent when comparing computing power. You pay for it in time instead.


Maybe comparing raw specs, not accounting for software running slower, if at all. You pay for it in time, but also sanity.


Are you saying OpenCl is architecturally slow or a particular implementation is slow?

Moritz is able to extract theoretical peak hardware perf out of his OpenCl code.


146B transistors - we are probably only a few years away from 1 trillion in a single design. Pretty awe inspiring.


If MI300 counts at 146 billion transistors, than I think we can count Cerebras' Wafer Scale Engine 2. It sits at 2.6 trillion transistors.

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...


You make a good point!


I love the monolith of WSE2, but you are correct, we are probably 5-8 years away from 1T transistors in a single "package" using chiplets and advanced packaging.

Could we way sooner.


Flash memory chips have passed 1 trillion long time ago.

The first one was a 512GB Samsung flash chip


Sure there's a ton of compute chips, gobs of HBM... but honestly the super interesting weird challenge here, besides making this monster boot, is connecting all the chips.

Will CPU & GPU chiplets talk with each other? Or will all communication travel through glue chips like the IO Die? Where does the hbm attach? Is there ram too? On RDNA3, AMD has 6 small Memory Channel Die (and a huge Graphic Compute Die in the center)... will we see similar with HBM be attached to individual GCD?


I'm going to guess that the 6nm chiplets are analogous to the EPYC IO die, and have the IO, memory controllers, and last level cache. The compute dies are stacked on the IO dies, which are stacked on an interposer. The interposer connects the IO dies to each other and to the HBM. The HBM is not stacked on the chiplets because I don't think any current HBM supports that and AMD considers it a future technology: https://www.techpowerup.com/305060/amd-envisions-stacked-dra...


This is a fun theory! I like the view of the IO dies connecting to each other across interposer. That interposer is going to be moving a lot of traffic!


HBM requires a pretty expensive and dense silicon interposer, which should be sufficient to connect the IO dies. They're probably just using an infinity fabric network between the dies anyway so it's pretty high level.


Given that AMD has been making combined cpu+gpus for over a decade I think it's safe to assume that the cpu(s) and gpu(s) will have an efficient communication path with shared access to memory.


> According to AMD, MI300 is comprised of 9 5nm chiplets, sitting on top of 4 6nm chiplets. The 5nm chiplets are undoubtedly the compute logic chipets – i.e. the CPU and GPU chiplets – though a precise breakdown of what’s what is not available. A reasonable guess at this point would be 3 CPU chiplets (8 Zen 4 cores each) paired with possibly 6 GPU chiplets; though there's still some cache chiplets unaccounted for. Meanwhile, taking AMD’s “on top of” statement literally, the 6nm chiplets would then be the base dies all of this sits on top of. Based on AMD’s renders, it looks like there’s 8 HBM3 memory stacks in play, which implies around 5TB/second of memory bandwidth, if not more.

I cant get this to add up when I look at the photo of the chip. Maybe this is the lower layer we have photographed? The photo is a central cluster of 4 huge chiplets, then and 2+2 big+small chiplets next to each.

At first I was trying to imagine this as the top layer, the 9 5nm chiplets, and just couldn't square it.

Maybe the 2+2 are all cache/hbm, and this is the 4 6nm chiplets we're seeing? I wonder if that means the 9 compute chiplets are burried in the stack, but the words "on top" keep making me think otherwise. Or perhaps this photo isn't a complete chip, is missing the top dies?

I'm still more excited to know about how this all connects than anything else. There's such a huge amount of switching required to connect this all. This feels like such a radical change for Infinity Fabric, such a huge step function.


Can we get an APU with 16 or 32GB of HBM to build tiny PCs? That and an SSD and some IO would make an awesome SFF. Of course do this all at low power so air cooling is possible.


APU with 24 or 32 GB would be really awesome, this would be perfect for a game console (new steam machine?). The extra memory is very useful as the GPU and CPU would be one. Similarly, the M1/M2 Mac mini could, hardware wise, also be a great game console. So we are maybe not that far away.


Sadly AMD/Intel iGPUs (which AMD calls APUs) are limited to 83GB/sec or so, of which you'd be lucky to see 50GB/sec or so.

Apple has 100, 200, 400, and 800GB/sec flavors that fit in laptops (100-400), mac mini (100-200GB/sec), or mac studio (400-800GB/sec). Configurations up to 96GB ram are available.

Even does low power, air cooling, and pretty quiet. I do wish Arm, AMD, or Intel would do similar. Maybe a future AMD will have a better memory bus, like they one they ship i n the PS5 and XboxX. Or maybe Nvidia will ship a smaller version of the Hopper+Grace that fits in a SFF.


that would be really great… if AMD had a working software stack that supported consumer (as opposed to: datacenter) APUs. I'm writing this on a machine with a Ryzen APU… and it's pretty much unusable for compute.

And that's not a hardware problem (the hardware specs would be nice enough) but AMD not giving a shit about having a working software stack, especially if it doesn't fall into one of the two domains: "gamers" or "datacenter".


Which APU do you have?


Too expensive.

PC OEMs dont even like moderately sized IGPs, as they largely rejected Intel's Iris Broadwell eDRAM chips, Kaby Lake G, and Van Gogh (the Steam Deck chip[1]):

[1] https://videocardz.com/newz/amd-mobile-apus-for-2021-2022-de...


I was trying to shop various higher end intel iGPUs. Generally the premiums were so high than a low end Nvidia was much faster AND cheaper. They were somewhat lower power, but it was a pretty narrow space where Intel had the advantage.


The only IGP-heavy gaming device around these days is basically the Steam Deck :(.


Not to mention XboxX and PS5, both use flavors of the AMD APU/iGPU. Sad that cheap consoles have better memory systems than any normal AMD/Intel box short of a Xeon, Epyc, or Threadripper.


No mention of SRAM. Does anyone know how much is present in total across all caches?


I want that in my next laptop :) seems to be Apple M1/M2 competitor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: