EPYC is a split L3 cache. Any particular core only benefits from 32MBs of L3, th...

rbanffy · on March 15, 2021

What AMD does is not magic and is not beyond what others can do. My question is why they chose to have just 32MB for up to 80 cores when AMD can choose to have 32MB per 8-core chiplet.

As a comparison, an IBM z15 mainframe CPU has 10 cores and 256MB per socket.

dragontamer · on March 15, 2021

> As a comparison, an IBM z15 mainframe CPU has 10 cores and 256MB per socket.

Well, that's eDRAM magic, isn't it? Most manufacturers are unable to make eDRAM on a CPU.

> My question is why they chose to have just 32MB for up to 80 cores when AMD can choose to have 32MB per 8-core chiplet.

From my understanding, those ARM chips are largely I/O devices: read from disk -> output to Ethernet.

In contrast, IBM's are known for database backends, which likely benefits from gross amounts of L3 cache. EPYC is general purpose: you might run a database on it, you might run I/O constrained apps on it. So kind of a middle ground.

floatboth · on March 16, 2021

Didn't Intel push eDRAM magic into various laptop chips around Broadwell?

> those ARM chips are largely I/O devices

Anything Neoverse-N1 is pretty good at general compute and databases. There's definitely a lot of Postgres running on AWS Graviton2 instances already :)

meepmorp · on March 15, 2021

IBM doesn't fab its own chips, right? I thought they used GF.

wmf · on March 15, 2021

It's basically IBM fabs that were "sold" to GlobalFoundries. AFAIK IBM processors use a customized process that isn't used by any other GF customers.

xirbeosbwo1234 · on March 15, 2021

That's not quite accurate. Every core has access to the entire L3, including the L3 on an entirely different socket. CPUs communicate through caches, so if a core just plain couldn't talk to another core's cache then cache coherency algorithms wouldn't work. Though a core can access the entire cache, the latency is higher when going off-die. It is really high when going to another socket.

The first generation of Epyc had a complicated hierarchy that made latency quite hard to predict, but the new architecture is simpler. A CPU can talk to a cache in the same package but on a different die with reasonably low latency.

(I don't have numbers. Still reading.)

dragontamer · on March 15, 2021

In Zen1, the "remote L3" caches had longer read/write times than DDR4.

Think of the MESI messages that must happen before you can talk to a remote L3 cache:

1. Core#0 tries to talk to L3 cache associated with Core#17.

2. Core#17 has to evict data from L1 and L2, ensuring that its L3 cache is in fact up to date. During this time, Core#0 is stalled (or working on its hyperthread instead).

3. Once done, then Core#17's L3 cache can send the data to Core#0's L3 cache.

----------

In contrast, step#2 doesn't happen with raw DDR4 (no core owns the data).

This fact doesn't change with the new "star" architecture of Zen2 or Zen3. The I/O die just makes it a bit more efficient. I'd still expect remote L3 communications to be as slow, or slower, than DDR4.