JEDEC publishes GDDR7 graphics memory standard

Scene_Cast2 · 2024-03-05T21:07:00 1709672820

One interesting thing to note is that all the high speed interconnect standards (GDDR, PCIE, USB, Ethernet) are moving (or have already moved) to RF-style non-binary communication (as opposed to bumping up clock speeds or increasing pin count). I wonder what the next steps will be for interconnects - full-blown baseband style transceivers with QAM perhaps?

bunnie · 2024-03-06T07:02:01 1709708521

Generally the limiting factor for approaching the asymptote you observe is power. There is a figure of merit with units of Joules per bit per second that gets worse - much worse - as you start to get fancier with your modulation.

Most of the fancier schemes take advantage of the fact that traditional binary signalling has excess noise margin, ie, they were throwing away energy to start with, and they are encoding extra bits in the energy budget. But to maintain noise margins as you cram more bits in, you have to up the amount of energy per bit.

The other half is that the physical layer implementation that does the encoding and decoding consumes more energy because it's doing more computation. This also figures into the energy per bit metric if you're being honest about your comparisons (and because it is not always clear if this is included in a metric you find papers where people cherry pick numbers to make their case). This number can become quite big because the baseline of a binary tx/rx is so low compared to doing effectively a DAC/ADC and phase recovery system.

What you find is that QAM or more schemes are certainly possible, but they can consume more power than the CPU just to keep the link idle and trained. The real art is picking the implementation and developing new circuit tricks that we hadn't thought of before to wring a little more bandwidth without killing the power budget.

redox99 · 2024-03-06T07:58:59 1709711939

Would power actually be an issue, for the 200w+ GPUs these memories will likely be used with?

pjc50 · 2024-03-06T13:07:16 1709730436

Power on-die is always an issue because many chips end up limited by thermal dissipation.

SmellTheGlove · 2024-03-05T22:11:21 1709676681

This is probably a stupid question but I don’t claim any knowledge here. Even though interconnect standards are moving to non-binary communication, doesn’t the signal eventually need to be converted back to a bitstream at its destination? Does this just push the bottleneck around, or do I just not understand the problem being solved? It’s almost certainly the latter and I’d love to understand more.

lpmay · 2024-03-05T23:06:02 1709679962

The important distinction is between on-silicon interconnects and off-silicon. On silicon, the interconnect density is much higher so it's feasible to get high bandwidth by having lots of "channels" (wide busses). This becomes less desirable off chip (on the PCB) for many reasons. Some of them are: 1) The interconnect density is much lower, and having such a large bus becomes physically large 2) The bus is larger and less precise, which makes maintaining skew between channels more and more challenging 3) The parasitic capacitance/inductance (i.e. energy storage) of the bus becomes larger as it gets physically larger, meaning each channel needs a relatively large driver circuit (which costs expensive silicon area) and dissipates more power to drive correctly, and even more power if the speed is increased.

Increasing the symbol complexity of each channel does more than just move the bottleneck around, because it allows fewer chip to chip interconnects to carry more data.

I don't work in this regime, but as a layman I'm not convinced using full QAM for on-board chip to chip interconnects makes sense. One major advantage you natively have over the RF case is you can be easily coherent (shared clock). Throwing this away to do carrier recovery introduces a lot of complexity and potentially reduces the available bandwidth. Assuming you transmit without a carrier, can you have "baseband" QAM without a separate I and a Q signal? If you transmit an I and Q signal separately, does that not just become the same thing as two PAM-32 signals?

BobbyTables2 · 2024-03-06T01:19:47 1709687987

There are a few drawbacks with transmitting QAM with two signals

- one needs 2 signals instead of one (2x total bandwidth) - requires each channel bandwidth to extend to to DC, which had many other challenges

If one modulates the signal to shift it away from DC, the “negative/mirror” frequencies also shift, which means now bandwidth has doubled.

A QAM signal still has double the bandwidth of an equivalent PAM one but pays for it by encoding two PAM signals.

Of course, Discrete Multitone Modulation puts QAM to shame for non-flat channels as it can adapt near-perfectly to such. Not likely to happen for high speed interconnects in our lifetime. I suspect photonics will happen first.

andy_ppp · 2024-03-05T23:28:55 1709681335

> On silicon, the interconnect density is much smaller

Did you mean higher?

lpmay · 2024-03-05T23:30:48 1709681448

yes sorry, pitch is smaller density is higher

wmf · 2024-03-05T22:50:49 1709679049

A serializer/deserializer (serdes) is used to convert between high-speed serial I/O outside the chip (e.g. 100 Gbps) and lower-clocked parallel signals inside the chip (e.g. 64 bits at 1.5 GHz). Using serial protocols reduces the cost and thickness of cables while parallel wires are cheaper inside the chip.

ak217 · 2024-03-05T22:59:13 1709679553

Not a stupid question - you can think of the problem by analogy with RF engineering. You have very high performance digital logic and precise clocks on the chip that you can use to encode/decode (convolve/deconvolve) bits into waveform signals and time those signals before they leave the chip at minimal latency/power expense. Once the bits are off the chip, you have no such resource and are dealing with all kinds of impedance and noise issues, which is why there are separate circuits/logic dedicated to training and calibration of the encoding parameters of the signals sent over the wire in DRAM chips.

This more complex encoding scheme is just the next level in that process, indeed moving it closer to techniques used in RF engineering.

kevvok · 2024-03-05T22:49:24 1709678964

You’re right that the signals have to be converted back into bits at the destination. Basically, this solves the problem of pumping those bits at high speed across traces on a circuit board vs within a chip. The longer a signal has to go, the harder it is to maintain its integrity.

bgnn · 2024-03-06T07:40:07 1709710807

10G-baseT ethernet uses 128-DSQ modulation (LDPC coded). It's 128 symbols sent over 2 cycles.

These complex modulations have a huge drawback though: latency. You need to pack, ramdomize, encode and modulate at the transceiver, and undo all these + equalization at the receiver. Especially feed-forward equalization is a huge latency source.

silvestrov · 2024-03-06T08:44:22 1709714662

Some numbers here: https://wavesplitter.com/technical-column/latency-and-power-...

> Cable latency is about 5ns/m, so 2.6 µs is equivalent to latency of a 520m cable. This latency might become a big issue for the HPC-based applications.

bgnn · 2024-03-06T17:48:56 1709747336

10GBASE-T is rated up to 300m cable. I don't think these numbers include 300m cable latency. It's just the PHY latency itself.

klysm · 2024-03-05T21:33:32 1709674412

Sharp signal edges are hard to get at higher frequencies - it seems quite natural

wmf · 2024-03-05T21:18:44 1709673524

Optical communication is using QAM so that's probably the next step.

crotchfire · 2024-03-07T04:14:48 1709784888

I wonder what the next steps will be for interconnects

PCB manufacturers start offering FR-4 with strands of single-mode fiber embedded in it?

2OEH8eoCRo0 · 2024-03-06T00:31:47 1709685107

At the physical layer, digital/binary doesn't exist. It's all signals.

kromem · 2024-03-06T00:41:11 1709685671

Technically, it's the opposite - all measurable physical things are only measurably digital (i.e. measurable in discrete units). But for all practical purposes, that threshold is low enough that it doesn't really matter for current signals transport standards.

sliken · 2024-03-05T20:04:36 1709669076

Doubling the channels from GDDR6 sounds good, the speed of light isn't changing, so at least we can handle more parallelism with the same latency.

oorza · 2024-03-05T20:34:35 1709670875

The ratio of light speed to the area of the universe is so stupidly small that I'm convinced our simulation is determining how low the speed of light can get before interstellar travel is outright impossible.

adtac · 2024-03-05T20:51:53 1709671913

With sufficiently advanced technology, travelling between stars will be more of a transfer of your consciousness using interstellar WiFi (remember to set TCP_NODELAY!) rather than transporting slow and heavy atoms. You just get transferred from one biological substrate to another. None of the time dilation, all of the space exploration.

drtgh · 2024-03-05T22:19:02 1709677142

>will be more of a transfer of your consciousness

Sounds like a copy, not a transfer. If you didn't physically transport the atoms, the matter, you would end up with two duplicates living at different places and time, and with different ways of thinking after the copy, as the living experiences will diverge from that moment.

This unless you exterminate the original with each copy. Also should be considered each copy may lose information, degrade (signal integrity through distance, number of travels, and so on).

codetrotter · 2024-03-05T23:06:57 1709680017

> unless you exterminate the original with each copy

The problem of synchronization is gonna be particularly nasty in this case.

reaperman · 2024-03-06T07:09:53 1709708993

The two generals problem suddenly gets very personal consequences.

lIIllIIllIIllII · 2024-03-06T03:07:37 1709694457

sure, but I don't even know how to differentiate between copying my brain and being in a coma, or just falling asleep really deeply. waking up sometimes feels like enough of a discontinuity

fennecbutt · 2024-03-06T10:22:16 1709720536

Idk I'd say that the clone theory also applies to our every waking moment, each fraction of a second we're no longer the same person, even in the same body.

And most certainly after sleep.

pyinstallwoes · 2024-03-07T05:19:33 1709788773

How do you ensure continuity from frame of reference (observer consciousness?)

fennecbutt · 2024-03-11T03:09:06 1710126546

I don't know if there is true continuity. As we age our memories become memories of memories. Just as we don't see the nose between our eyes I think our brains turn a stuttered disjointed consciousness into what seems like a continuous stream.

It's like walking into a room and forgetting why you walked in there.

WJW · 2024-03-05T21:06:26 1709672786

> all of the space exploration.

How does the receiving technology get built? Surely at least someone will have to go there the first time, and they will have to take the long way. It will still be quite a problem to get to a system 10k light years away.

cstrahan · 2024-03-05T21:26:54 1709674014

You just have to first hack (or maybe even just ask nicely) another suitable species (or their technological artifacts) wherever you want to go, and have them create the biological substrate and download/upload mechanism on their end. This limits travel to already inhabited corners of the universe, but that's better than nothing I suppose.

The tricky thing is that hacking is usually an iterative process, and these iterations are going to be an extreme exercise in patience.

Actually, another tricky thing: how do you know that the other end is actually cooperating? If the aliens are dicks they could give you the thumbs up while having zero intention to reconstitute your consciousness. If you wanted to round-trip some brave soul as a means of verifying everything works, they could just send one of their own minds back instead, just for the fun of wreaking havoc.

riskable · 2024-03-05T21:47:53 1709675273

> The tricky thing is that hacking is usually an iterative process, and these iterations are going to be an extreme exercise in patience.

No kidding! On the first try you accidentally end up causing a revolution because the targets/specimens ended up learning about the scientific method, gunpowder, and other dangerous things instead of just getting a proper advanced consciousness installed. So now all you can do is try to shape said species technological progress towards building the correct technology that you can hijack for your own purposes when ready.

"Just be patient"

gumby · 2024-03-06T03:28:19 1709695699

> You just have to first hack … another suitable species (or their technological artifacts) wherever you want to go, and have them create the biological substrate and download/upload mechanism on their end.

I presume humans are the result of such a hack a few billion years ago.

Interstellar travel require patience, at least to get beyond the initial latency.

kraquepype · 2024-03-05T21:24:56 1709673896

My inner sci-fi geek tells me that by this time, we discover faster than light travel, only it isn't compatible with life as we know it.

So we ship off these receivers to circumvent that limitation. Instead of travelling ourselves, we can send off our consciousness to inhabit a human-life analog to explore.

What that does to your psyche, and your body in limbo, are probably good material for a story, if it hasn't already been written.

cyanydeez · 2024-03-05T21:29:53 1709674193

My inner geek tells me it's more likely humans will plug themselves into the matrix because it'll be far more receptive to technological advances than actual exploration.

At best, you'll throw a bunch of nanoprobes everywhere to get new entropy into the system.

0cf8612b2e1e · 2024-03-06T02:31:41 1709692301

There was a bad sci fi movie that sort of incorporated this, The Beyond (https://m.imdb.com/title/tt5723416/)

0x457 · 2024-03-05T23:03:26 1709679806

I think Altered Carbon had these disposable sleeves one could rent to attend a remote meeting.

adtac · 2024-03-05T21:36:19 1709674579

Is YC accepting applications for interstellar body rental stations like Hertz is for cars? I'd bootstrap it, but I think this requires venture scale funding.

colejohnson66 · 2024-03-05T21:45:54 1709675154

Well, Hertz is selling off their whole Extraterrestrial Vehicle (EV) fleet, so it’s probably not profitable enough for VCs.

bitwize · 2024-03-05T21:50:27 1709675427

In elementary school I read a kids' novel, called My Trip to Alpha I, that had precisely this as a McGuffin. The main character travels to visit his aunt and uncle by "Voya-Code", in which his consciousness is transmitted to an android body on the destination planet.

ajcp · 2024-03-06T01:50:03 1709689803

When stuff like this is brought up I'm always reminded of the Stephen King short story The Jaunt[0]

0. https://en.m.wikipedia.org/wiki/The_Jaunt

joshspankit · 2024-03-05T21:49:18 1709675358

Who says we’re not doing that already and just calling it dreams?

exe34 · 2024-03-05T23:14:18 1709680458

What have you been eating before sleep? My dreams happen on Earth!

joshspankit · 2024-03-06T02:16:21 1709691381

Why is it Earth and not an Earth-like planet populated by other people who are also travelling when they sleep?

What if there were 100 billion of us in total?

exe34 · 2024-03-06T07:05:58 1709708758

Well how do you know you're on earth when you wake up? What if you just pick a random earth-like planet each time, and don't realise it's a new life?

sirsinsalot · 2024-03-06T21:13:05 1709759585

How do you know it is you who wakes up each morning and not just a new life with your memories and feelings who can't tell the difference?

Rip yesterday me.

joshspankit · 2024-03-06T11:07:48 1709723268

Good point. I agree

dkarras · 2024-03-05T23:44:08 1709682248

kinda easy to disprove, just think about it for a sec.

Phelinofist · 2024-03-05T21:05:42 1709672742

But that would still be limited by the speed of light, right?

adtac · 2024-03-05T21:15:58 1709673358

Of course, everything is. Doubling the speed of light means your network packets get there twice as fast, but accelerating matter to relativistic speeds, which too is limited by the speed of light, has less marginal utility from the doubling when it comes to energy needed for acceleration/deceleration and time dilation.

omneity · 2024-03-05T22:47:22 1709678842

You're thinking of latency vs bandwidth/throughput. You might not improve on the latency part (speed of light), but you can increase the bandwidth (amount of data transferred per unit of time) just like a highway with more lanes can carry more people without increasing the individual speed of cars. You might even decrease car speed and still get an improved throughput overall.

stronglikedan · 2024-03-05T21:41:38 1709674898

Maybe quantum entanglement, where the original "portals" would be set up around the universe at the speed of light, but then data could henceforth be transferred between the portals at the speed of entanglement.

parl_match · 2024-03-05T21:05:57 1709672757

Yeah, but you still need to get some bodies out there.

SequoiaHope · 2024-03-05T21:14:14 1709673254

I’m of the view that you can possibly duplicate consciousness but you can never send “me”. I’m stuck on the consciousness I’ve got. If you tried to upload my consciousness somewhere I’d still be sitting here like “hey look there’s another one of me”, but I’d not experience some shift in perspective myself.

adtac · 2024-03-05T21:28:27 1709674107

No, you are your consciousness. The self exists only in the story the mind tells itself, so both versions would think they are the original you.

Besides, the serialisation process is a form of quantum measurement. Depending on how coarse-grained it is, there might be no way to take a snapshot without modifying you (maybe the measurement process turns the original brain matter into soup).

pricecomstock · 2024-03-05T21:57:58 1709675878

They would think they are the original you, but I think the GP was saying that the original perspective would continue on the original consciousness-continuity/body/hardware.

Cloning a hard drive can produce the same data, but without any networking, there's no reason for the original machine to know anything from the perspective of the new one

p1esk · 2024-03-06T22:12:06 1709763126

I hope it will be possible to digitize my brain one neuron at a time, preserving the continuity of my consciousness.

sirsinsalot · 2024-03-06T21:14:08 1709759648

So by sleeping and breaking your consciousness stream, you die, and a new consciousness wakes up thinking it is you. With your memories and physical makeup.

0cf8612b2e1e · 2024-03-06T02:36:21 1709692581

Need to Ship of Theseus your neurons with digital equivalents. Every cell in your body is replaced within seven years, so can be a gradual digitalization.

p1esk · 2024-03-06T22:17:53 1709763473

Exactly what I was thinking. Note however that most neurons in your brain never get replaced naturally. An interesting extension of this idea is to replace one biological neuron with two digital copies, doing the same thing in parallel. Eventually two brain copies can be created with shared or duplicated consciousness.

foobarian · 2024-03-05T21:19:00 1709673540

I find this a scary topic, like touching a hot stove. Try as hard as I can but I can't figure out (and overall nobody has so far) how the "self" experience works.

jonathanlydall · 2024-03-06T10:22:35 1709720555

This is one of fundamental plot elements of the Altered Carbon novels.

polishdude20 · 2024-03-06T00:59:00 1709686740

TCP would be fun. "Oh I missed this packet" there goes another thousand years.

hnuser123456 · 2024-03-05T21:12:46 1709673166

Or is it the ratio of your lifespan to the age of the universe? The universe is only about 3x bigger instantaneously than it is when transversed at lightspeed. The ratio of the age of the universe to your expected lifespan is about 8 orders of magnitude.

oorza · 2024-03-06T10:55:44 1709722544

That only matters if we're the most special species in the universe. I'm not ready to make that leap of faith.

On the other hand, there exists two spots in the universe that are so far separated from each other that they observers in both spots will never be able to see, affect, or pass information between each other, simply because they're too far away and the speed of light is too slow. That's silly.

tutfbhuf · 2024-03-06T12:47:18 1709729238

"speed of light can get before interstellar travel is outright impossible"

Time is relative to your frame of reference. If you travel to Proxima Centauri at 99.99% c, it will take you 22 days from your point of reference sitting in the spaceship, which is quite acceptable. On Earth, 4.24 years have passed. So, your family and friends grow old quite fast when you do interstellar travel without them; hence, it's better to take them with you.

ko27 · 2024-03-05T22:38:00 1709678280

"Ratio of light speed to the area of the universe" does not determine how far you can travel in a set amount of time, because time dilation exists.

colechristensen · 2024-03-05T21:50:00 1709675400

>the speed of light isn't changing

Oh but the speed of the signal does depend quite a lot on the transmission medium. In Cat-6 signals travel 2/3c. Can't find a quick reference for on-die or motherboard kinds of interconnects. If you had optical interconnects traveling through vacuum in a silicon chip, that's a full 50% faster (as in lower travel time for one bit over a distance) than Most copper ethernet.

https://en.wikipedia.org/wiki/Velocity_factor

lloeki · 2024-03-06T08:55:28 1709715328

> Oh but the speed of the signal does depend quite a lot on the transmission medium

And this (actually, phase velocity) is what makes refraction a thing.

I knew that we could slow down light to subsonic speeds, but TIL we can put it at a complete standstill! Amazing.

https://en.wikipedia.org/wiki/Speed_of_light#In_a_medium

znpy · 2024-03-05T21:04:15 1709672655

Just checking my intuition: could we still get a speedup via pipelined execution and branch prediction?

From what i see reading https://en.wikipedia.org/wiki/Multi-channel_memory_architect... different channels could, in theory, be used "autonomously of each other".

Lramseyer · 2024-03-05T21:48:27 1709675307

Controllers kind of do that. At the end of the day, it's what makes designing a memory controller so difficult (and I'm not even talking about the Phy, those things are straight up cursed!) We see these eye popping numbers for maximum potential bandwidths, but the reality is a bit more complicated. There's a lot that goes on behind the scenes with opening and closing memory banks, refreshes, and general read and write latencies. Unoptimized prediction algorithms (as they are programmable) can result in losing _half_ of your performance.

sliken · 2024-03-06T16:55:47 1709744147

Pipelining helps, usually. You can have 10 stages run at 3 GHz instead of 1 stage at 300 MHz. Up until you have a branch misprediction and have to dump the pipeline.

Memory channels are independent, however generally all cores use all channels. There are two common designs. Intel has 4 dies, 2 memory channels per core, so 2 channels are closer/lower latency than the other 6. AMD has multiple chiplets, but a single memory controller with 12 channels. So All cores have the same latency to all channels.

Generally Intel has lower latencies to 25% of the channels, but AMD has more throughput (bandwidth or random IOPs).

One thing that surprised me is that for maximum throughput you want at cache misses queued to the memory controller, at least twice the number of memory channels. These days missing in L1/L2/L3 is often approximately half the total memory latency. So on an Intel Xeon at least 16 misses (per socket), AMD at least 24 misses (per socket.).

So on Intel you could tune things (and the NUMA support helps) to prefer the local channels. Most OSs help, and C calls like numa_alloc_local() allows local control.

For memory intensive codes I have found the best scaling when there's 2 cores per channel. Of course most codes are pretty friendly.

znpy · 2024-03-06T21:36:33 1709760993

thank you for the explanation!

Lramseyer · 2024-03-06T02:16:47 1709691407

> with the same latency

I highly doubt that. With on-die ECC and the ridiculously complicated PAM3 encoding/decoding, I would bet that latency is going to increase over GDDR6.

eigen · 2024-03-06T03:55:43 1709697343

as the press release says PAM3 sends 3 bits in 2 bit times whereas NRZ would require 3 bit times to send 3 bits. that, coupled with the increased WCK for shorter bit times, suggests latency shouldn't necessarily increase.

I assume by "the ridiculously complicated PAM3 encoding/decoding" you are referring to section 2.9.3,

"The total burst transfer payload per channel is encoded using 23 x 11b7S and 1 x 3b2S for the data, 6 x 3b2S for the CRC and 1 x 2b1S for the SEV/PSN, it adds up to the 176 PAM3 symbols that can be allocated for a 16 burst over 11 data lines."

that does seem complicated using 3 (11b7s, 3b2s, 2b1s) different modulation schemes in 1 burst.

Lramseyer · 2024-03-06T06:36:39 1709706999

> PAM3 sends 3 bits in 2 bit times whereas NRZ would require 3 bit times to send 3 bits

Yes, but the transactions are still 16 WCK half cycles (beats) just like in GDDR6. The designers opted for a narrower bus (per channel, and more of them) rather than shorter transactions. So that doesn't save any time. I didn't find anything on the WCK rates, but it looks like they're pretty similar to GDDR6 based on all of the examples I was able to find. So I'm not convinced of much of a time savings there either.

Now, latency numbers are measured in units of tCK, not WCK, and with GDDR6 those were pretty long relative to the time it took to actually send the transaction (2 tCK.) I'm not too familiar with the internals of the DRAM, but I assume that the process of loading the data into and out of the DRAM cells is a bit involved if it takes that much time. If that were to be sped up, then we could see improvements in latency, but I'm not holding my breath.

imtringued · 2024-03-05T22:07:51 1709676471

Unless Nvidia can somehow massively increase memory capacity, it is looking bleak for them in the consumer AI inference space. From the left Apple has a fully integrated SoC with insane memory bandwidth and capacity, from the right AMD is tackling the FLOPs advantage using their XDNA AI Engines that they got from their Xilinx acquisition and they are going to open source the compiler for their AI Engines. The only competitive advantage that NVidia has left is its high memory bandwidth, but even that is being threatened by Strix Point so they will need to adopt GDDR7 with 32 GB to 64GB VRAM fast or they will become irrelevant except for training. Oh and by the way AMD GPUs will stay completely irrelevant for AI so that explains why they didn't want to waste so much time on ROCm for consumer GPUs. Nobody is going to buy those for AI anymore by the end of the year.

wmf · 2024-03-05T22:54:40 1709679280

It's not clear that the "consumer (local) AI inference space" is a real market. Ultimately Nvidia has access to all the same technologies as their competitors and better software so anything they can do Nvidia can do better... if they want to.

mschuster91 · 2024-03-05T23:00:29 1709679629

> Unless Nvidia can somehow massively increase memory capacity, it is looking bleak for them in the consumer AI inference space.

Who cares about consumer inference? The money is in training because that's where utterly insane amounts of compute capacity are needed, and as long as no competition comes even close to CUDA, NVIDIA has a cash cow to milk.

As for potential competition, Apple doesn't sell their silicon to anyone else, and AMD lacks the available manufacture capacity (keep in mind, they also supply the game console market), the driver/tooling quality and most importantly developer trust/ecosystem quality. Everyone is using CUDA.

rubatuga · 2024-03-05T22:49:50 1709678990

AMD was the first to introduce consumer HBM cards.

account42 · 2024-03-06T09:00:50 1709715650

Well, on "consumer" cards that were AMDs desperate attempt to stay relevant in that space by rebadging their enterprise compute-focused design for consumers. They haven't used HBM in any of their actual consumer designs unfortunately.

oasisaimlessly · 2024-03-06T12:26:14 1709727974

Are you sure?

https://www.techpowerup.com/gpu-specs/radeon-r9-fury-x.c2677

> AMD has paired 4 GB HBM memory with the Radeon R9 FURY X, which are connected using a 4096-bit memory interface.

coffeebeqn · 2024-03-06T04:18:55 1709698735

Is there any reason they couldn’t make a card with lots of GDDR5 for example? Isn’t it “just” memory chips that you could put a lot of on a board ?

adgjlsfhk1 · 2024-03-06T04:32:36 1709699556

they do. the reason Nvidia doesn't do it for consumer cards is that they do it for their server cards and raise the price 40x.

IshKebab · 2024-03-05T22:52:34 1709679154

I mean... CUDA. nVidia is fine.

lloeki · 2024-03-06T09:04:09 1709715849

Too big to fail? Sure, they have an edge in being the current de facto go to solution, but that can be changing very rapidly if they rest on their laurels.

mistyvales · 2024-03-06T15:58:17 1709740697

Still interested to see if a newer gen of HBM will ever be on retail GPUs again

Sparkyte · 2024-03-06T07:03:39 1709708619

I just want memory that can dynamically change on the basis of workload so it can be great for graphics and great for system performance.

Lramseyer · 2024-03-06T01:50:51 1709689851

Just got my hands on the spec and read through it for a bit. A couple interesting things I noticed that were notable changes from GDDR6.

Obviously, the big news is PAM3 signaling, and on die ECC. These aren't all that new, as NVIDIA's GDDR6X used PAM4 signaling (at lower frequencies than traditional GDDR6) and an unnamed DRAM vendor had GDDR6 DRAMs with on die ECC, though at the cost of having annoyingly high read and write latencies. Thankfully, only the DQ (data) pins use PAM3 signaling.

I'm going to do my best to explain how this works in GDDR7, since I need to understand this for my work:

If you don't know what PAM3 is, it stands for "Pulse Amplitude Modulation, 3 levels" Traditional communication could be thought of as PAM2, since there's a level for 0 and 1. But we usually call it NRZ for "Non-return to Zero" There's a bit of nuance, since not all binary data communication is NRZ. But that's a different discussion. Now you might ask, why PAM3 and not PAM4? With 4 levels you can transfer 2 bits, and that seems much easier to work with. Well, it's because we hate ourselves, and we hate you. That's why.

For context, a GDDR6 channel uses 16 DQ (data) pins + 2 EDC (Error Detection/Correction) pins and 16 transfers for 256 data bits and 32 CRC bits transaction. GDDR6X (from my understanding) does the same thing, but since it's PAM4, it sends 2 bits per transfer with (I think) half the number of transfers.

GDDR7 on the other hand has 11 DQ pins of PAM3 signaling. Since PAM3 is 3 state {-1, 0, 1} these are defined as "symbols" or "trits" (I really hate the word "trit".) These symbols have are 3 separate encoding methodologies (yikes) that are used in this protocol: 11b7S, 3b2S, 2b1S. These essentially determine how many bits of data you can encode in a given number of symbols. 11b7S means 11 bits of data encoded in 7 symbols. 11 bits of data has 2048 unique combinations, 7 symbols have 2187 unique combinations (3^7). 3b2S is 3 bits (8 combinations) encoded in 2 symbols (9 combinations). And 2b1S is a misnomer but it's application specific to the Poison and Severity flag bits and the Severity flag takes precedent, so the unrepresented combination is invalid.

Like GDDR6, there are still 16 transfers, but the DQ bus width is reduced to 11 DQ pins, and a parity error pin. With this, we get a total of 176 symbols per transaction. When decoded, this allows us to send 256 bits of data, 2 bits for Poison/Severity flags (this is the 2b1S symbol), and 18 bits of CRC. Encoded though is a bit of a doozy: 163 symbols for the data, 1 symbol (2b1S) for the Poison/Severity flags, and 12 symbols (3b2s) for the CRC. If that's not confusing enough, The 163 data symbols don't even all use the same encoding. The first 161 symbols encode 253 bits in 23 sets of 7 symbol to 11 bit (11b7S) sets, for 23 sets of 11 bits. The remaining 3 bits a are encoded in the remaining 2 symbols as a 3b2S set.

How this is mapped to each pin is outlined in the spec. I'll take a look at that part later, as I have had enough unfriendly math for the day.

On a different note, there were some other notable changes that I noticed:

They also shrunk the CA (Command Address) bus width from 10 pins down to 5 and run it twice as fast relative to the CK and WCK clocks. Instead of 2 cycles of 10 bits, it's 4 cycles of 5 bits. The 5 bits are split into "Row" [0:2] and "Column" [3:4] bits. It was a bit weird at first glance, but it's actually kind of nice. The commands make a lot more sense now. Interestingly enough, it looks like the CABI (Command Address Bus Inversion) and CAPAR (Command Address Parity) are part of the 20 bit CA command now. Though I'm not sure how useful a CABI is if it's once every 4 cycles. Not sure how much power you're really saving.

There are 64 mode registers instead of 16. Still 12 bit wide though. Wait WTF? This (kind of) made sense in GDDR6, since the address and the data fit nicely into 16 bits (4 address, 12 data) and you could use the other 4 bits as the command ID for MRS. Now we have 12 bit wide registers, 6 bit wide addresses, and the MRS command is a double length command because it doesn't use the column bits. This is really weird. Registers 0-31 are defined by the spec, 32-47 are reserved for future use, and 48-63 are Vendor Specific.

Found this funny - There's a feature for addressing clock drift between CA and WCK clocks due to varying Voltage and Temperature. It's called the "Command Address Oscillator" I wonder why they picked that name... "There is a CAOSC associated with each channel and operates fully independent of any channel’s operating frequency or state" Someone had fun with that one!

Other Quick blurbs:

- No mention of Quad Data Rate (QDR)

- CTLE looks like it's supported from the DRAM side.

- They added an RCK (read clock) pair, and is generated on the DRAM side.

- They added a static data scrambler! You don't know how happy I am to see this!

Dylan16807 · 2024-03-06T03:48:28 1709696908

> Now you might ask, why PAM3 and not PAM4? With 4 levels you can transfer 2 bits, and that seems much easier to work with. Well, it's because we hate ourselves, and we hate you. That's why.

> For context, a GDDR6 channel uses 16 DQ (data) pins + 2 EDC (Error Detection/Correction) pins and 16 transfers for 256 data bits and 32 CRC bits transaction. GDDR6X (from my understanding) does the same thing, but since it's PAM4, it sends 2 bits per transfer with (I think) half the number of transfers.

GDDR6X did not transfer 2 bits at a time despite being PAM4, because they disallowed transitions between the lowest voltage and the highest voltage.

Because of that each group of 4 symbols has 139 easily-stackable sequences, so they went with 7b4S.

https://research.nvidia.com/publication/2022-04_saving-pam4-...

I couldn't tell you how each transaction is laid out.

Lramseyer · 2024-03-06T05:57:34 1709704654

Oh, interesting! I remember hearing rumblings that PAM4 was kind of a pain in GDDR6X. This makes sense. I still stand by my claim about PAM3 though.

paddy_m · 2024-03-05T20:00:58 1709668858

How much relevance does JEDEC still have?

I would think NVIDIA in particular and other chip makers/integrators like apple make up their own standards now. It also seems less relevant because memory is rarely interchangeable anymore

ajross · 2024-03-05T21:00:13 1709672413

JEDEC isn't a memory technology monopoly or anything, they're just a standards organization. You have a situation where lots of companies need to make products that interoperate, but interoperation is complicated in electrical engineering. There are a *lot* of ways to get a memory interconnect wrong.

So the solution is to pick (or create) a separate, notionally independent body staffed and supported by representatives of all the relevant stakeholders, and have them write "standards" that everyone agrees to adhere to. The body doesn't invent the technology, that happens at the individual chip companies. They then present their proposals to JEDEC[1] and everyone argues and agrees on what will go into GDDR19 or whatnot. And JEDEC then publishes the standards for all to see.

[1] Or whoever, JEDEC does DRAM, but there's a USB Consortium, Bluetooth SIG, WiFi is under IEEE, etc...

hedgehog · 2024-03-05T20:08:50 1709669330

Apple and NVIDIA are both members of JEDEC...

monocasa · 2024-03-05T20:14:55 1709669695

HBM is a JEDEC spec these days. Apple's on package memory is still LPDDR.

pillusmany · 2024-03-05T21:44:05 1709675045

Standardized memory chips allows economy of scale to work.

ryukoposting · 2024-03-06T00:57:14 1709686634

The physical memory modules might not be interchangeable, but JEDEC is still extremely important to hardware and firmware engineers. It gives us a baseline - we know anything that says "JEDEC" will play nice with our SoC's QSPI/DDR peripherals.

sliken · 2024-03-05T20:07:22 1709669242

Do you have any evidence that Nvidia or Apple are using non-standard memory chips? Link? From what I can tell they are both using standard chips, but very wide memory interfaces. Apple's lowest end is 128 bit, but offer 256, 512, and 1024 bit wide memory interfaces for more bandwidth, which is mostly a benefit for the iGPU in all of Apple's m series CPUs. This is part of why Apple are pretty good at LLMs, especially those needing more ram than is in even the most expensive GPUs.

Sad that the vast majority of x86-64 laptops and desktops have the same bus width of decades ago, while the core counts are ever increasing.

zeusk · 2024-03-05T20:28:33 1709670513

JEDEC never "standardized" GDDR6x that Nvidia does use; Micron and Nvidia worked closely on both GDDR6x and GDDR5x

kllrnohj · 2024-03-06T05:26:27 1709702787

> Apple's lowest end is 128 bit, but offer 256, 512, and 1024 bit wide memory interfaces for more bandwidth

Apple doesn't have any 1024-bit offerings. The Ultras are 2x 512-bit which is different from a true 1024-bit since it's a NUMA configuration (two different memory controllers).

> Sad that the vast majority of x86-64 laptops and desktops have the same bus width of decades ago, while the core counts are ever increasing.

The consistent trend is CPU performance on consumer workloads is still much more sensitive to memory latency than it is bandwidth. This is probably also why those 512-bit M1/2 Max's don't even bother to let the CPU make use of it, they top out at "just" 200GB/s memory bandwidth even though there's 400GB/s available to the SoC as a whole.

Being able to pair a lot of DRAM with the GPU of an M* Max/Ultra is definitely a currently unique perk, but the bandwidth numbers available to those GPUs are not actually all that special. Desktop GPUs passed 400GB/s mark way back in 2016, and are currently pushing 1TB/s in modern flagship offerings.

sliken · 2024-03-06T17:24:57 1709745897

Agreed. I did see a limit of 220GB or so for the M2 max, but haven't seen an updated number for the M3 max. Certainly GPUs have great bandwidth, but get quite expensive if you need more than 24GB ram. I've been impressed how well the M3 max runs 70B or larger models, plenty for single user use anyways.

jsheard · 2024-03-05T20:27:14 1709670434

Nvidia and Micron came up with GDDR6X, which isn't a JEDEC standard. JEDEC did standardize GDDR5X before that, but only Micron ever made it and only Nvidia ever used it AFAIK.

paulmd · 2024-03-06T02:27:33 1709692053

NVIDIA has basically been doing that with GDDR5X and GDDR6X already - they are "official standards" but nobody besides NVIDIA actually uses them.

Dalewyn · 2024-03-05T21:56:41 1709675801

>How much relevance does JEDEC still have?

Ah, a FOSSy dev insistent that his bits are gibi- and mebi- and kibi-bytes.