Samsung announces 'Shinebolt' HBM3E memory

wincy · on Oct 26, 2023

So this is the secret sauce for speeding up AI stuff, right? Memory bandwidth is the thing that for training large language models or diffusion models, you run up against? Or is there other stuff? What would a perfect machine look like for generating the next Stable Diffusion or ChatGPT5? They train on H100X8 machines with HBM3 right now, is that correct?

bobim · on Oct 26, 2023

Computational Fluid Dynamics and implicit Finite Elements Analysis are still severely memory bandwidh limited on our 128 cores zen3 dual sockets servers. More bandwidth is always welcome.

kookamamie · on Oct 26, 2023

Zen 3, as in Milan? Upgrade to Genoa and you'll see a very nice bump in performance due to DDR5 support with 12 memory channels, plus due to the added AVX512 instruction set.

bayindirh · on Oct 26, 2023

The problem is, unless you reorder and optimize your matrices to make the prefetcher happy, you end up throwing a lot of data, because you don't access the memory in the linear fashion where the prefetchers assume.

So, having more channels is always nice, but this "more cowbells" approach doesn't bring in the performance you expect in some cases.

markhahn · on Oct 26, 2023

more channels (and the fact that each ddr5 dimm is basically two channels) is a major win that doesn't depend on prefetching.

it is perhaps underappreciated that CPU cores have gotten a lot more "thready" even in the absence of SMT, since each core has hundreds of instructions in flight. you know, kind of like how GPU cores (CUs, SMs) do...

bayindirh · on Oct 26, 2023

It depends on prefetching very much.

I have a similar simulation code, which can saturate the cores pretty efficiently (It was running with >1.7M integrations/sec/core and scaling linearly on last decade's hardware).

But when you access to square submatrices inside a 3000x3000 matrix, and CPU prefetcher brings the whole n rows, and you discard most of the rows to get another part of the matrix, that bandwidth is wasted.

Instead, you can rearrange your matrices to make the prefetcher happy, and do not throw what prefetcher brings, because you make it bring which is going to be used next.

When I was doing last analysis runs, it was running with ~20% cache trash meaning I'm wasting 20% of the data I'm pulling in regardless. This is a huge loss, considering I was saturating memory bandwidth before the cores.

mastax · on Oct 26, 2023

Which is an interesting aside, but unless I'm missing something the amount of wastage due to the prefetcher isn't at all proportional to the number of available memory channels so it's non sequitur.

bayindirh · on Oct 26, 2023

No, the wastage is due to because I'm not accessing data in a "linear" fashion if I lay out the matrices naively (actually pulling n by n square submatrices from the big matrix), and prefetcher brings in data with its baked in locality rules, but its assumptions doesn't fit my access pattern. So 20 percent of the data is thrown away.

If I change how I fill and lay my matrices out, I can align my data with how prefetcher assumes, and way less data will be trashed during the process.

I didn't do that because it needed a codebase-wide refactorization, and my code was already 30x faster than reference implementation with better accuracy, so we didn't pursue that avenue.

bobim · on Oct 26, 2023

Interesting, but is it even possible to optimize access to very sparse and unstructured matrices from finite elements?

bayindirh · on Oct 26, 2023

My matrices are dense. I’m working with boundary elements. But, there are sparse optimized libraries for efficient storage and access to sparse matrices (e.g.: Eigen has both sparse and dense sub-libraries). Also, there’s a whole literature concerned with sparse matrices and operations on those.

However, my knowledge is limited on sparse side. If you want, I might try to dig into it a bit.

bobim · on Oct 26, 2023

I guess money is the rate determining step here. Secondary issue is that cores are also getting faster and then need more bandwidth. Solution is a low core count, maybe 2-3 per channel, more servers and serious interconnect, but cost…

Scene_Cast2 · on Oct 26, 2023

I've heard that running CFD on thr CPU is still quite popular and prevalent, despite GPU implementations being widely available. What is it about the algorithm that makes GPUs less suitable? Model sizes? Branching? Data (non-)locality?

bobim · on Oct 26, 2023

Memory is one limiting factor, needing over 512 Gb is not uncommon.

But complex physics like chemical reactions, particle tracking, any user field functions, is probably what is tricky to implement on a GPU.

brucethemoose2 · on Oct 26, 2023

> What would a perfect machine look like for generating the next Stable Diffusion or ChatGPT5?

A recent Microsoft paper proposed that the "perfect" inference machine is a ton of large-ish, modestly clocked, SRAM heavy chips piped together on a motherboard.

There is no RAM. The model weights are split up between the chips, and operations are pipelined like they already are in llms that are split into "layers". And there is no expensive interposer, or crazy interconnect like Cerebras. The chips are just wired together on a cheap motherboard, as the bandwidth between layers is relatively modest.

...It seems like this would work for training too. But what do I know? If it doesn't, the next "ideal" architecture is probably Cerebras's wafer-scale machines. And I am thinking they make use of TSMC's TSV tech (like AMD's X3D chips) to cram more SRAM onto a single wafer.

Theres also some fuss asserting that doing a bunch of matrix multiplication is an inefficient way to emulate physical neural networks, so maybe future architectures will be radically different?

nimish · on Oct 26, 2023

Cerebras' strategy is the winner then if hardware was all that mattered

brucethemoose2 · on Oct 26, 2023

Yeah, I guess in the reality the perfect hardware is "an Nvidia GPU with sufficient VRAM" because its everywhere and already works by default.

xnx · on Oct 26, 2023

The linked part of this presentation goes into detail on best available machine learning hardware: https://www.youtube.com/watch?v=EFe7-WZMMhc&t=1217s

aperrien · on Oct 26, 2023

How soon do people think that we'll get in-memory computing chips?

RetroTechie · on Oct 26, 2023

I've always thought it logical to integrate RAM & cpu.

Look over the course of history: there's often a roughly-fixed ratio between cpu performance & RAM bandwidth (+ size), and cpu/memory <-> graphics interconnect speed. Any of those factors get too far out of wack, and you've got a machine with poor bang/$.

GPU integration is old now. Ok I've read there's a mismatch between silicon processes used for logic (like cpu/gpu etc) and memory? But hey, this is 2023. Chiplets? Stacked dies? Or just do the best with available tech? It's not like there are no ICs that integrate RAM & compute.

With the right architecture, the "memory wall" could be (mostly) a thing of the past. With enough cores & on-chip RAM bandwidth, maybe specialized gpu processing wouldn't even be needed anymore? For example:

https://en.wikipedia.org/wiki/Massively_parallel_processor_a...

nvgeele · on Oct 26, 2023

No idea, but it's being worked on: https://www.planetanalog.com/dram-for-energy-and-area-effici... (via https://www.imec-int.com/en/articles/igzo-based-dram-energy-...)

reportingsjr · on Oct 26, 2023

Who knows? IBM Research just released info on their updated compute in memory chip [0]. Intriguing, but it doesn't provide a ton of insight in to whether or not it will be commercially viable any time soon.

0: https://research.ibm.com/blog/northpole-ibm-ai-chip

moring · on Oct 26, 2023

Computing in the actual storage banks seems a bit over-the-top to me, but I always wondered about computing in the "currently open row". It seems like an obvious optimization candidate for fast memset() and memcpy(), by setting that row from another source than the storage bank, selectively loading or storing it, and shifting its contents around.

namibj · on Oct 26, 2023

Two recent (weeks) papers/presentations on: doing just that (LUT at the row buffer) https://youtu.be/9t1FJQ6nNw4; doing parallel sense in NAND for computing AND between rows (which they call pages, but in DRAM they're called rows. It's the same concept, though.) https://youtu.be/QWL2Rw_VhHg

bjourne · on Oct 26, 2023

As soon as memristors become viable for general-purpose computing.

charcircuit · on Oct 26, 2023

Never in practical designs due to thermal issues

dvh · on Oct 26, 2023

If we create 1-dimensional memory we could radiate heat to 2 dimensions.

pcurve · on Oct 26, 2023

It’s a little strange to see 36gb instead of 32gb. I’ll take the extra 4 though.

55873445216111 · on Oct 26, 2023

The extra 4 are for ECC

greggsy · on Oct 26, 2023

Is it addressable?

K0balt · on Oct 26, 2023

No. It’s an extra bit for parity and I believe it is used on-die to check the recalled data

bullen · on Oct 26, 2023

CAS Latency?

_xk9i · on Oct 26, 2023

This paper has lots of info about HBM. They explain CAS latency in page 4:

"Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems" https://upcommons.upc.edu/bitstream/handle/2117/362010/HBM.p...

> For this CAS latency, HBM (Observation 6) offers a significant advantage over regular DRAMs. This is because HBM has tCCD = 1 or 2 compared to tCCD >= 4 for DRAM. Accordingly, HBM can reduce the CAS latency component to at least half of its DRAM value.

mips_r4300i · on Oct 26, 2023

DRAM itself hasn't really changed too much for decades, and the length of time for the cas latency is determined by physics, so it's probably comparable to any other dram. Once you have the row open though, you can burst it out almost instantly over the very wide bus HBM gives you.

mat_epice · on Oct 26, 2023

Indeed, it probably has one.

awiesenhofer · on Oct 26, 2023

Great! Can't wait for Apple to get these. 36gb in a Macbook Air or MBPs with up to 144gb of memory will be fun.

yieldcrv · on Oct 26, 2023

I’m a fan of that form factor and software stack too

Getting over 128gb of usable memory changes use cases for me, like running nodes on networks I wouldn’t have bothered with

mastax · on Oct 26, 2023

Has Apple ever released a product using HBM (outside of some FirePro GPUs in some Mac Pros)? M1/2 Pro/Max use LPDDR5.

Though apparently Apple applied for a patent for a hybrid DDR/HBM system, so maybe in M3/4/5? https://patents.google.com/patent/US10573368B2/

Citizen_Lame · on Oct 26, 2023

Yes, but Apple will use only 8GB (not gb) .