EPYC isn't NUMA anymore.

dragontamer · on Oct 7, 2021

Both EPYC and Intel Skylake-X are NUMA.

Yes, Skylake-X. It turns out that the placement of those L3 caches matter, and some cores are closer to some memory controllers than others.

https://software.intel.com/content/www/us/en/develop/article...

------------

Some cores have lower latency access to some memory channels than other memory channels. Our modern CPUs are so big, that even if everything is on a singular chip, the difference in latency can be measured.

The only question that matters is: what is the bandwidth and latencies of _EACH_ core compared to _EACH_ memory channel? The answer is "it varies". "It varies" a bit for Skylake, and "it varies a bit more" for Rome (Zen 2), and "it varies a lot lot more" for Naples (Zen1).

---------

For simplicity, both AMD and Intel offer memory-layouts (usually round-robin) that "mixes" the memory channels between the cores, causing an average latency.

But for complexity / slightly better performance, both AMD and Intel also offer NUMA-modes. Either 4-NUMA for AMD's EPYCs, or SNC (Sub-numa clustering) for Intel chips. There are always a set of programmers who care enough about latency/bandwidth to drop down to this level.

monocasa · on Oct 7, 2021

It looks like the parent edited the context out of their post.

They were specifically calling out EPYC's extreme NUMAness, in contrast to Intel's, as the cause of their problems. That distinction has more or less been fixed since Zen 2, to the point that the NUMA considerations are basically the same between Intel and AMD (and really would be for any similar high core count design).

jeffbee · on Oct 7, 2021

Don't blame me for editing out something you imagined. I didn't touch it. If you're having problems with hallucinations and memory see a neurologist.

MisterTea · on Oct 7, 2021

I can not find anything to back up this claim. How else is AMD linking the multiple dies/sockets together?

monocasa · on Oct 7, 2021

The DDR phys are on the I/O die, so all of the core complexes have the same length path to DRAM.

Multi socket is still NUMA, but that's true of Intel as well.

dragontamer · on Oct 7, 2021

> The DDR phys are on the I/O die, so all of the core complexes have the same length path to DRAM.

The I/O die has 4 quadrants. The 2 chips in the 1st quadrant access the 1st quadrant's 2-memory channels slightly faster than the 4th quadrant.

> Multi socket is still NUMA, but that's true of Intel as well.

Intel has 6-memory controllers split into pairs of 3 IIRC (I'm going off of my memory here). The "left" 3 memory channels access the "left 9 cores" a bit faster than the "right 9 cores" in an 18-core Intel Skylake-X chip.

--------

Both AMD and Intel have non-uniform latency/bandwidth even within the chips that they make.

monocasa · on Oct 7, 2021

There's a few cycles difference based on how the on chip network works, but variability in the number of off chip links between you and memory is what dominates the design. And in the context of what the parent said (but has since edited out), was what was being discussed.