So this is the secret sauce for speeding up AI stuff, right? Memory bandwidth is the thing that for training large language models or diffusion models, you run up against? Or is there other stuff? What would a perfect machine look like for generating the next Stable Diffusion or ChatGPT5? They train on H100X8 machines with HBM3 right now, is that correct?
Computational Fluid Dynamics and implicit Finite Elements Analysis are still severely memory bandwidh limited on our 128 cores zen3 dual sockets servers. More bandwidth is always welcome.
Zen 3, as in Milan? Upgrade to Genoa and you'll see a very nice bump in performance due to DDR5 support with 12 memory channels, plus due to the added AVX512 instruction set.
The problem is, unless you reorder and optimize your matrices to make the prefetcher happy, you end up throwing a lot of data, because you don't access the memory in the linear fashion where the prefetchers assume.
So, having more channels is always nice, but this "more cowbells" approach doesn't bring in the performance you expect in some cases.
more channels (and the fact that each ddr5 dimm is basically two channels) is a major win that doesn't depend on prefetching.
it is perhaps underappreciated that CPU cores have gotten a lot more "thready" even in the absence of SMT, since each core has hundreds of instructions in flight. you know, kind of like how GPU cores (CUs, SMs) do...
I have a similar simulation code, which can saturate the cores pretty efficiently (It was running with >1.7M integrations/sec/core and scaling linearly on last decade's hardware).
But when you access to square submatrices inside a 3000x3000 matrix, and CPU prefetcher brings the whole n rows, and you discard most of the rows to get another part of the matrix, that bandwidth is wasted.
Instead, you can rearrange your matrices to make the prefetcher happy, and do not throw what prefetcher brings, because you make it bring which is going to be used next.
When I was doing last analysis runs, it was running with ~20% cache trash meaning I'm wasting 20% of the data I'm pulling in regardless. This is a huge loss, considering I was saturating memory bandwidth before the cores.
Which is an interesting aside, but unless I'm missing something the amount of wastage due to the prefetcher isn't at all proportional to the number of available memory channels so it's non sequitur.
No, the wastage is due to because I'm not accessing data in a "linear" fashion if I lay out the matrices naively (actually pulling n by n square submatrices from the big matrix), and prefetcher brings in data with its baked in locality rules, but its assumptions doesn't fit my access pattern. So 20 percent of the data is thrown away.
If I change how I fill and lay my matrices out, I can align my data with how prefetcher assumes, and way less data will be trashed during the process.
I didn't do that because it needed a codebase-wide refactorization, and my code was already 30x faster than reference implementation with better accuracy, so we didn't pursue that avenue.
My matrices are dense. I’m working with boundary elements. But, there are sparse optimized libraries for efficient storage and access to sparse matrices (e.g.: Eigen has both sparse and dense sub-libraries). Also, there’s a whole literature concerned with sparse matrices and operations on those.
However, my knowledge is limited on sparse side. If you want, I might try to dig into it a bit.
I guess money is the rate determining step here. Secondary issue is that cores are also getting faster and then need more bandwidth. Solution is a low core count, maybe 2-3 per channel, more servers and serious interconnect, but cost…
I've heard that running CFD on thr CPU is still quite popular and prevalent, despite GPU implementations being widely available. What is it about the algorithm that makes GPUs less suitable? Model sizes? Branching? Data (non-)locality?
> What would a perfect machine look like for generating the next Stable Diffusion or ChatGPT5?
A recent Microsoft paper proposed that the "perfect" inference machine is a ton of large-ish, modestly clocked, SRAM heavy chips piped together on a motherboard.
There is no RAM. The model weights are split up between the chips, and operations are pipelined like they already are in llms that are split into "layers". And there is no expensive interposer, or crazy interconnect like Cerebras. The chips are just wired together on a cheap motherboard, as the bandwidth between layers is relatively modest.
...It seems like this would work for training too. But what do I know? If it doesn't, the next "ideal" architecture is probably Cerebras's wafer-scale machines. And I am thinking they make use of TSMC's TSV tech (like AMD's X3D chips) to cram more SRAM onto a single wafer.
Theres also some fuss asserting that doing a bunch of matrix multiplication is an inefficient way to emulate physical neural networks, so maybe future architectures will be radically different?
I've always thought it logical to integrate RAM & cpu.
Look over the course of history: there's often a roughly-fixed ratio between cpu performance & RAM bandwidth (+ size), and cpu/memory <-> graphics interconnect speed. Any of those factors get too far out of wack, and you've got a machine with poor bang/$.
GPU integration is old now. Ok I've read there's a mismatch between silicon processes used for logic (like cpu/gpu etc) and memory? But hey, this is 2023. Chiplets? Stacked dies? Or just do the best with available tech? It's not like there are no ICs that integrate RAM & compute.
With the right architecture, the "memory wall" could be (mostly) a thing of the past. With enough cores & on-chip RAM bandwidth, maybe specialized gpu processing wouldn't even be needed anymore? For example:
Who knows? IBM Research just released info on their updated compute in memory chip [0]. Intriguing, but it doesn't provide a ton of insight in to whether or not it will be commercially viable any time soon.
Computing in the actual storage banks seems a bit over-the-top to me, but I always wondered about computing in the "currently open row". It seems like an obvious optimization candidate for fast memset() and memcpy(), by setting that row from another source than the storage bank, selectively loading or storing it, and shifting its contents around.
Two recent (weeks) papers/presentations on:
doing just that (LUT at the row buffer) https://youtu.be/9t1FJQ6nNw4;
doing parallel sense in NAND for computing AND between rows (which they call pages, but in DRAM they're called rows. It's the same concept, though.) https://youtu.be/QWL2Rw_VhHg
> For this CAS latency, HBM (Observation 6) offers a significant advantage over regular
DRAMs. This is because HBM has tCCD = 1 or 2 compared to tCCD >= 4 for DRAM. Accordingly, HBM can reduce the CAS latency component to at least half of its DRAM value.
DRAM itself hasn't really changed too much for decades, and the length of time for the cas latency is determined by physics, so it's probably comparable to any other dram. Once you have the row open though, you can burst it out almost instantly over the very wide bus HBM gives you.