AMD 3rd Gen EPYC Milan Review

fomine3 · on March 16, 2021

This is most interesting part of interview posted at the same time.

https://www.anandtech.com/show/16548/interview-with-amd-forr...

> IC: While AMD increases the performance on its processor product line, the bandwidth out to DRAM remains constant. Is there an ideal intercept point where higher bandwidth memory makes sense for a customer?

> FN: I think you’re absolutely right, and really at the top of the stack, depending on the workload, that can be the performance limiter. If you’re comparing top of the stack parts in certain workloads, you’re not going to see as much of a performance gain from generation to generation, just because you are memory bandwidth limited at the end of the day.

> That’s going to continue as we keep increasing the performance of cores, and keep increasing the number of cores. But you should expect us to continue to increase the amount of bandwidth and memory support. DDR5 is coming, which has quite a bit of headroom to DDR4. We see more and more interest in using high bandwidth memory, for an on-package solution. I think you will see SKU’s in the future from a variety of companies incorporating HBM, especially for AI. That will initially be fairly specialized to be to be candid, because HBM is extremely expensive. So for most the standard DDR memory, even DDR5 memory, means that HBM is going to be confined initially to applications that are incredibly memory latency sensitive, and then you know, it’ll be interesting to how it plays out over time.

ece · on March 16, 2021

I would love to see desktop systems move to 4 channels. 2 DIMMs/channel can die as far as I'm concerned, lets have 2 channels/DIMM for laptops. Memory channel density needs to increase along with core density (wider and/or more cores).

wmf · on March 16, 2021

They won't. DDR5-6400 doubles memory bandwidth and that's what we'll get for the next decade or so.

ksec · on March 16, 2021

They are still using GF 14nm on their IOD due to Wafer Supply Agreement. We will have to wait and see how they fit their iOD with PCI-Express 5 and DDR5 with Zen 4.

And I cant wait to see Netflix plays around with these in their FreeBSD box.

It is interesting the biggest upgrade from Zen 4 won't actually the Core part but the iOD. As I am expecting Zen 4 to be some tweaks and a Die Shrink to 5nm. TSMC also somehow unexpectedly announce doubling their 5nm capacity with new Fab capacity being built. My guess would be an aggregate demand from Apple and AMD exceed certain threshold to be worth doing.

Const-me · on March 17, 2021

Core chiplets only send signal to I/O die, the distance is measured in mm and the parameters of these wires is tightly controlled by AMD.

I/O die sends signals to the rest of hardware. Wires are way longer, often with multiple connectors in them. I/O chiplet needs to produce and consume more of these milli-amps of electrical current.

One can produce larger transistors with finer processes e.g. by adding more fins to FinFETs, but since they need larger transistors anyway to handle more current, I’m not sure there’s much value in upgrading process of the I/O chiplet.

bryanlarsen · on March 15, 2021

Hopefully they can fix their idle power consumption with a firmware tweak or a new stepping, that's a massive regression. It looks like that causes a significant performance degradation too -- more of the power budget going to IO means less to the compute cores.

ece · on March 16, 2021

> ...The core-to-core latency within a quadrant this generation has improved from a worst-case 112ns to 99ns, about a 10ns improvement. Access to remote quadrants has been reduced from up to 142ns to up to 114ns, which is actually a 24% improvement, which is considerable.

> What’s really interesting is that inter-socket latencies have also seen very notable reductions. Whereas the Rome part these went up to 272ns, the new Milan part reduces this down to 202ns, again a large 25% improvement between generations.

Good I/O is going to have a cost in terms of die area and power. Granted what Anandtech observed is interesting for 2P systems. Maybe there's a NPS4 vs NPS1 difference at work here, or AMD just isn't doing as much power gating on the IOD as it could in one or either case.

zhdc1 · on March 15, 2021

It looks like Zen 2 processors about about to become even more of a bargain than they already are.

I'll take a 7702P at $2-3K over a 7713P at $5K ten times out of ten.

dragontamer · on March 15, 2021

Considering both are made from the 7nm TSMC process, AMD probably aren't going to make any more Zen2 processors at this point.

I think you're right: that buying a generation old or so can offer gross cost savings. But that's only true for the time-period where those chips are available.

IanCutress · on March 15, 2021

AMD is going to keep Zen 2 EPYC sales going for a good while yet. Both families will co-exist in the market.

blagie · on March 15, 2021

I suspect so. A lot of the commercial market wants stability. Once I've validated a server config for a particular use, I want to be able to continue building those servers for a long time (often long past obsolescence).

That may seem odd, but a lot of safety-critical applications (e.g. medical, military, aerospace, etc.) require spending tens of thousands, hundreds of thousands of dollars, or even millions of dollars (not to mention months of time) re-validating a system with any substantive change.

Even for less critical applications, spending $2000 extra on each CPU is a bargain compared to re-validating a system.

If AMD wants to be a credible presence in those markets, and I'm pretty sure it does, it needs to chips with many year lifespans before EOL.

Some companies manage this by having a subset of devices or of software which is LTS.

derefr · on March 15, 2021

Rather than buying new-old-stock CPUs, why not just buy all the CPUs the long-term program will ever need when they're still cheap, and stockpile them? It's not like they go bad.

dragontamer · on March 16, 2021

There are exceptions of course, but computers are a depreciating asset. You should never buy more computers than you need at any given time, because next year offers better computers for cheaper. Similarly: don't "invest" into a car. They also depreciate.

In contrast: artwork, houses, and a few other goods (Magic: The Gathering cards?) seem to appreciate... or increase in price. These stuff you want to buy before they get more expensive.

derefr · on March 16, 2021

Yes, but isn't this exactly one of those exceptions?

You need this exact model of X, Y, and Z, because those have been validated as a working combination through a huge one-time capital expenditure; and changing anything would entail going through that again.

Vendors tend to not mark down prices for components, but rather just sell them for MSRP for as long as they can, and then immediately stop producing the item. The margins on each component were usually thin enough that there's no way to sell the component any cheaper but "make it up in volume." It's either "sell it at the MSRP" or "don't produce it at all."

As such, the only place to get the component after the original vendor ceases production, is the secondary market, e.g. new old stock. This is where you see governments buying old 5.25" floppy disks for huge markups.

In that case, like I said, why not just buy all the 5.25" floppies you'll ever need in advance? They know they'll be using them for 50 years, because that's the decomissioning schedule for the system they just built, and they'll be unlikely to ever get budget to replace/upgrade it before then. They also know how long a floppy lasts in use. So, do the math, buy that many floppies early, and toss them all in an airless vault. Right?

And the same thing would be true here with particular-model CPUs, no?

dragontamer · on March 16, 2021

Hmmm... I don't think so.

Old process nodes (such as 130nm process, which is 100x larger than today's 12nm or 7nm processes) are still useful today. They're much cheaper to operate, because all the equipment is old and better understood (while 7nm and even 12nm processes are highly proprietary). Case in point: the Perseverance Rover that just landed on Mars is literally built from the ancient 130nm process. (POWER architecture)

Instead of ordering all the chips you need, you instead want to just ensure that the 130nm (or 45nm process, or 22nm process or whatever) will remain useful for 50 years into the future.

Lets say you base a modern government project (a spaceship, a jet fighter, a naval super carrier, etc. etc) off of a modern 14nm process node today.

1. You can order all the chips you'd "ever need" today, which is expensive today because tons of people are still using 14nm nodes.

2. Alternatively, you can order the company to continue to support the 14nm node for the next 50 years. This seems like the superior option. You order the chips you need later, when you need them. And 10 years from now, people will be on 3nm or maybe even 2nm process, so it'd be cheaper to order 2020-era 14nm chips at that point.

zrm · on March 16, 2021

> Rather than buying new-old-stock CPUs, why not just buy all the CPUs the long-term program will ever need when they're still cheap, and stockpile them?

You might not know how many you need. Suppose one of your data centers burns down and has to be replaced or you have triple the expected customer demand.

Also, time value of money.

rbanffy · on March 15, 2021

There must be a typo on the 74F3 price. US$2900 for it is a steal.

wffurr · on March 15, 2021

RTFA:

“ Users will notice that the 16-core processor is more expensive ($3521) than the 24 core processor ($2900) here. This was the same in the previous generation, however in that case the 16-core had the higher TDP. For this launch, both the 16-core F and 24-core F have the same TDP, so the only reason I can think of for AMD to have a higher price on the 16-core processor is that it only has 2 cores per chiplet active, rather than three? Perhaps it is easier to bin a processor with an even number of cores active”

coder543 · on March 15, 2021

I really don't think the article's speculation there is helpful... it's really reaching.

As I said below the article in the comments:

> If I were to speculate, I would strongly guess that the actual reason is licensing. AMD knows that more people are going to want the 16 core CPUs in order to fit into certain brackets of software licensing, so AMD charges more for those to maximize profit and availability of the 16 core parts. For those customers, moving to a 24 core processor would probably mean paying significantly more for whatever software they're licensing.

This is the more compelling reason to me, and it matches with server processors that Intel and AMD have charged more for in the past.

"Even vs odd" affecting the difficulty of the binning process just sounds extremely arbitrary... definitely not likely to affect customer prices, given how many other products are in AMD's stack that don't show this same inverse pricing discrepancy.

smolder · on March 16, 2021

That paragraph struck me as poorly thought out/written as well. I think finding chiplets that will run two cores at the higher 3.5ghz clock is the tricky part, not that it's harder to pick the 2 fastest cores than pick 3.

wffurr · on March 16, 2021

It's definitely not a typo.

tecleandor · on March 15, 2021

I found it in the original press release (price for !K units, of course)

https://ir.amd.com/news-events/press-releases/detail/993/amd...

masklinn · on March 15, 2021

Still seems like a typo, it doesn't make any sense that the 24 / 48 would be priced between the 8 / 16 and the 16 / 32. Either the prices of the 73 and 74 were swapped or the tag is just plain wrong. "2900" is also very suspiciously round compared to every other price on the press release.

coder543 · on March 15, 2021

It makes perfect sense if you're an enterprise customer and your software dependencies charge you extremely different priced tiers for different maximum numbers of cores. AMD is selling a license-optimized part at a higher price because there will be plenty of demand for it.

People who don't save a boatload by getting the license-optimized CPU will invariably choose to buy the 24-core one, which helps AMD by making it easier for them to keep up with the demand for the 16-core variant, and the 16-core variant gets an unusually nice profit margin. Win win.

This is not the first time AMD or Intel have offered a weird inverse-pricing jump like this... I highly doubt it is a typo.

My other comment reiterates some of these points a different way: https://news.ycombinator.com/item?id=26469182

rodgerd · on March 15, 2021

Yep - in the past I've done "special orders" for not-publicly-advertised CPU configs from our hardware vendor to get low core count, high-clock servers for products like Oracle DB.

mrb · on March 15, 2021

It doesn't seem to be a typo. AMD offers many variations of each core configurations, with different base frequencies. It's just that there are simply pricing overlaps between some low-core high-freq version and some higher-core lower-freq versions. For example the 7513 (32 cores) is also cheaper than the 73F3 (16 cores).

  75F3 32-core 2.95GHz $4,860
  7543 32-core 2.80GHz $3,761
  7513 32-core 2.60GHz $2,840
  
  74F3 24-core 3.20GHz $2,900
  7443 24-core 2.85GHz $2,010
  7413 24-core 2.65FHz $1,825
  
  73F3 16-core 3.50GHz $3,521
  7343 16-core 3.20GHz $1,565
  7313 16-core 3.00GHz $1,083

Source: https://ir.amd.com/news-events/press-releases/detail/993/amd...

dragontamer · on March 15, 2021

How is it suspicious?

256MB L3 (or really, 8 x 32MBs L3) and 24-cores suggests the bottom-of-the-barrel 3 cores active per 8-core CCX.

8x CCX with 3-cores. The yields on those chips must be outstanding: its like 62.5% of the cores could have critical errors and they can still sell it at that price.

EDIT: My numbers were wrong at first. Fixed. Zen3 is double-sized CCX (32MBs / CCX instead of 16MBs/CCX)

---------

In contrast, the 28-core 7453 is $1,570. I personally would probably go with the 28-core (with only 2x32MB L3 cache, or 64MBs) rather than the 24-core with 256MBs L3 cache.

For my applications, I bet that having 7-cores share an L3 cache (and therefore able to communicate quickly) is better than having 1 or 2 cores having 32MBs of L3 to themselves.

There are also significant price savings, as well as significant power / wattage savings with the 28-core / 64MBs model.

masklinn · on March 15, 2021

> In contrast, the 28-core 7453 is $1,570.

Which is cheaper than the 24c 7443 and 7413 but not the 16c 7343 and 7313.

And it only has half the L3 compared to its siblings (1/4th compared to the 7543 top end), a lower turbo than every other processor in the range (whether lower or higher core counts), well as an unimpressive base frequency, and a fairly high TDP by comparison (as high as the 7543).

The 74F3 has no such discrepancy, it has the same L3 as every other F-series and slots neatly into the range frequency-wise: same turbo as its siblings (with the 72 being 100MHz higher), 300MHz base lower han the 73, and 250 higher than the 75.

dragontamer · on March 15, 2021

> Which is cheaper than the 24c 7443 and 7413 but not the 16c 7343 and 7313.

28-cores for $1570 seems to be the "cheapest per core" in the entire lineup.

It all comes down to whether you want those cores actually communicating over L3 cache, or not. Do you want 7-cores per L3 cache, or do you prefer 4-cores per L3 cache?

4-cores per L3 cache benefits from having more overall cache per core. But more-cores per L3 cache means that more of your threads can tightly-communicate cheaply, and effectively.

---------

More L3 cache probably benefits from cloud-deployments, Virtual Desktops, and similar (since those cores aren't communicating as much).

More cores per L3 cache benefits from more tightly integrated multicore applications.

EDIT: Also note that "more cores" means more L1 and L2 cache, which is arguably more important in compute-heavy situations. L3 cache size is great of course, but many applications are L1 / L2 constrained and will prefer more cores instead. 24c 7443 with 2x32MB L3 is probably a better chess-engine than 16c 7343 4x32MB L3.

fvv · on March 15, 2021

right, I think 3900 may be the correct price

cm2187 · on March 15, 2021

I am building a single socket server right now, I can't really justify more than twice the price of a 7443P for a marginally higher base clock and twice the cache. Does the cache makes that much of a difference? I thought these are already very large caches vs lots of Intel CPUs.

dragontamer · on March 15, 2021

Hmm, with AMD Threadripper, you're already looking at TLB issues at these L3 sizes. So if you actually want to take advantage of lots of L3, you need either many cores, or hugepages.

Case in point: AMD Zen2 has 2048 TLB entries (L2), under a default (in Linux and Windows) of 4kB per TLB entry. That's 8MBs of TLB before your processor starts to page-walk.

Emphasis: Your application will pagewalk when the data still fits in L3 cache.

------------

I'm looking at some of this lineup with 3-cores per CCX (32MBs L3 cache), which means under default 4kB pages, those cores will always require a pagewalk to just read/write its 32MBs L3 cache effectively.

With that being said: 2048 TLB entries for Zen2 processors. Maybe AMD has increased the TLB entries for Zen3. Either way, you probably should start looking at hugepage configuration settings...

These L3 cache sizes are absurd, to the point where its kind of unwieldy. I mean, with enough configuration / programming, you can really make these things fly. But its not exactly plug-and-play.

justincormack · on March 15, 2021

The 64k page size available on Arm (and Power) makes a lot more sense with these kind of cache sizes. With 2MB amd64 hugepages its only 16 different pages in that L3 cache, which for a cluster of up to 8 CPUs is not much at all when using huge pages.

dragontamer · on March 15, 2021

TLB-misses always slows down your code, even out-of-cache.

So having 2MB (or even 1GB) hugepages is a big advantage in memory-heavy applications, like databases. No, 1GB pages won't fit in L3 cache, but it still means you won't have to page-walk when looking for memory.

1GB pages might be too big for today's computers, but 2MB pages might be good enough for default now. Historically, 4kB was needed for swap purposes (going to 2MB with Swap would incur too much latency if data paged out), but with 32GBs RAM + SSDs on today's computers... fewer and fewer people seem to need swap.

There might be some kind of fragmentation-benefit for using the smaller pages, but it really is a hassle for your CPU's TLB to try to keep track of all that virtual memory and put it back in order.

---------

While there is performance hits associated with page-walks, the page-walk process is fortunately pretty fast. So most applications probably won't notice a major speedup... still though, the idea of tons of unnecessary page-walks slowing down untold amounts of code bothers me a bit for some reason.

Note: ARM also supports hugepages. So going up to 2MBs (or bigger) on ARM is also possible.

fvv · on March 15, 2021

Benchmarks https://www.phoronix.com/scan.php?page=article&item=epyc-700...

IanCutress · on March 15, 2021

The article linked at the top has pages of benchmarks. Did.... you miss them?

mattashii · on March 15, 2021

There are benchmarks, yes, but Anandtechs' bechmarks are mostly synthetic benches from systems almost no-one uses (I don't run SPEC benchmarks daily, and ok, I compile stuff, but not clang, nor using clang)

But Phoronix has synthetic benchmarks from systems quite a few people use (e.g. databases, rendering, simulation). SPEC is nice, but I wouldn't know where to look if I would want to apply that to e.g. postgres. With Phoronix' results, at least there is an idea of how I could apply the results of the benchmark to my usage.

So I find this link quite useful.

That's not to say that I think Anandtechs benchmarks are not valuable (I love reading the nitty gritty details like cache timing specifics), but I can't apply them to anything other than a generic CPU workload, not even slightly in the direction of my specific workload.

dragontamer · on March 15, 2021

SPEC benchmarks are hardly synthetic. They're Perl programs like SpamAssassin, or GCC-compile times, and other "server-like" tasks.

So one can maybe argue that those workloads don't match your use case. But... SPECINT is the "standard server benchmark" for a reason.

I think there's been some suggestions that SPEC benchmarks are a bit small these days and maybe bigger code would be more realistic. But the actual programs they use are decent.

mattashii · on March 16, 2021

> SPEC benchmarks are hardly synthetic. They're Perl programs like SpamAssassin, or GCC-compile times, and other "server-like" tasks.

That I didn't know.

SPEC doesn't really do much in regards to giving descriptive names to benchmarked items, so its hard to determine what the numbers indicate (other than "this benchmark, which emulates some workload, now has X higher performance"). Some names are descriptive-ish (I think I recognise 7 out of 22 names) but it's all gibberish without a link to why those benchmarks are run and what they are based on (at least not in a way similar to how SPECjbb is explained). It might just be me failing to find it, but searching SPEC in the page (or their website) doesn't seem to give me relevant links.

Does Anandtech have a page detailing the rationale behind their current (CPU) benchmark suite?

dragontamer · on March 16, 2021

https://www.spec.org/cpu2017/Docs/benchmarks/500.perlbench_r...

Among other benchmarks. But Perl is the first one in their suite.

EDIT: SPECjbb is the Java benchmark suite: https://www.spec.org/jbb2015/docs/designdocument.pdf

SPECjbb is a ~2 hour benchmark simulating a high-throughput Supermarket backend.

stillbourne · on March 15, 2021

I'm waiting for news on genesis peak, I'd love to get 4th gen threadripper on my next box.

MrFoof · on March 16, 2021

Computex 2021, June 1st.

* Computex 2017 is when Zen Threadripper was announced.

* Computex 2019 is when Zen 2 Threadripper was announced.

* Computex 2020 is when Zen 2 Threadripper PRO was going to be announced. After the show was delayed, and then officially canceled, AMD did their own announcement shortly after.

So odds are it'll be Computex 2021 on June 1st. This fits in with the usual release timing, as Threadripper tends to follow EPYC by a few months. I'd expect retail channel availability for Threadripper as early as late July, as there's usually a few months between announce and channel availability.

Trust me, I've been waiting to build a 5970X as well. :)

stillbourne · on March 17, 2021

Dude, thanks, I've been looking for any scrap of rumor or release date.

whatshisface · on March 15, 2021

That logarithmic curve fit to the samples from a step function...

blinkingled · on March 15, 2021

   INVLPGB New instruction to use instead of inter-core 
   unterrupts to broadcast page invalidates, requires 
   OS/hypervisor support
   
   VAES / VPCLMULQDQ AVX2 Instructions for 
   encryption/decryption acceleration
   
   SEV-ES Limits the interruptions a malicious hypervisor may 
   inject into a VM/instance

   Memory Protection Keys Application control for access- 
   disable and write-disable settings without TLB management
  
   Process Context ID (PCID) Process tags in TLB to reduce 
   flush requirements

Interruptions (Instructions) and Unterrupts (Interrupts) aside (the article obviously was pushed out as fast as AT could lol) - these additions seem like they would help with performance when it comes to mitigations of all the speculation vulnerabilities in an hypervisor env?

ChuckMcM · on March 15, 2021

Nice bump in specs. Perhaps now they will announce the Zen3 Threadripper :-). As others have mentioned the TR can starve itself on memory accesses when doing a lot of cache invalidation (think pointer chasing through large datasets). If the EPYC improvement of having the chiplet CPUs all share L3 cache moved into the TR space (which one might assume it would[1]) then this could be a reason to upgrade.

[1] I may be wrong here but the TR looks to me like an EPYC chip with the mult-CPU stuff all pulled off. It would be interesting to have a decap with the chiplets identified.

mcdevilkiller · on March 15, 2021

In fact, all Zen chips are equal in the cores part (except for clockspeed and other tuning) thanks to the reusability of chiplets. The only thing that differs between Ryzen, TR amd EPYC is the IO die. And the pins (TR and EPYC use the same package, yep).

kllrnohj · on March 16, 2021

With the notably exception of ryzen mobile, which doesn't use chiplets.

gameswithgo · on March 15, 2021

yes TR will have the new cache configuration, just like regular ryzen and epyc do.

eqvinox · on March 16, 2021

Kinda wish there was a 3rd gen 7282 equivalent, for those of us just building their home servers. 2 sockets with <$750/CPU was kinda the sweet spot there.

(... really need to replace my 2×E5-2630v2, it's getting quite old... but I guess a 2×7282 will do.)

[Ed.: nevermind, hadn't read to the "idle consumption" graphs yet... nope, I'll stick with Rome ...]

wmf · on March 16, 2021

Why would two weak sockets be better than a single-socket system?

eqvinox · on March 16, 2021

Have you even looked at the offerings? The P-series/single socket CPUs are, at the same specs, only slightly cheaper than the dual socket ones. 2×7282 gives me 32 cores at $1300 list price. 7443P is $1337 for 24 cores. 7502P is even more expensive.

Two 7282s is pretty much the best perf/power/money option as far as I can see.

jedbrown · on March 16, 2021

16 channels of DDR4-3200, versus 8 in single socket EPYC or 4 for Threadripper.

coder543 · on March 16, 2021

I think someone concerned with paying "<$750/CPU" would be unlikely to want to fill 16 channels of RAM... that's a lot of money, relatively speaking.

But, maybe their home server's singular purpose is to have a lot of RAM?

eqvinox · on March 16, 2021

The memory (64GB ea) for each socket in my old server cost about as much as the CPU for the respective socket, about 550€ each. (i.e. 1100€ per socket.)

For a 7282 box I'd spend about as much on the RAM as on the CPU again; 16GB DDR4-3200R sticks are about 80€ each, so filling one socket is 640€. The 7282 currently sells for ~700€ currently so that lines up all right.

(This would give me 2×128GB total; 32GB RAM sticks are unfortunately indeed still a bit too expensive, especially at 3200MHz rating. But maybe I should go with 32GB and half fill each socket & fill it out later when it's cheaper...)

rbanffy · on March 15, 2021

Why so little L3 cache on the competition?

dragontamer · on March 15, 2021

EPYC is a split L3 cache. Any particular core only benefits from 32MBs of L3, the 33rd MB is "on another chip". (EDIT: Zen2 was 16MBs, Zen3 is 32MBs. Fixed numbers for Zen3)

As such, AMD can make absolutely huge amounts of L3 cache (well, many parallel L3 clusters), while other CPU designers need to figure out how to combine the L3 so that a single core can benefit it from it all.

rbanffy · on March 15, 2021

What AMD does is not magic and is not beyond what others can do. My question is why they chose to have just 32MB for up to 80 cores when AMD can choose to have 32MB per 8-core chiplet.

As a comparison, an IBM z15 mainframe CPU has 10 cores and 256MB per socket.

dragontamer · on March 15, 2021

> As a comparison, an IBM z15 mainframe CPU has 10 cores and 256MB per socket.

Well, that's eDRAM magic, isn't it? Most manufacturers are unable to make eDRAM on a CPU.

> My question is why they chose to have just 32MB for up to 80 cores when AMD can choose to have 32MB per 8-core chiplet.

From my understanding, those ARM chips are largely I/O devices: read from disk -> output to Ethernet.

In contrast, IBM's are known for database backends, which likely benefits from gross amounts of L3 cache. EPYC is general purpose: you might run a database on it, you might run I/O constrained apps on it. So kind of a middle ground.

floatboth · on March 16, 2021

Didn't Intel push eDRAM magic into various laptop chips around Broadwell?

> those ARM chips are largely I/O devices

Anything Neoverse-N1 is pretty good at general compute and databases. There's definitely a lot of Postgres running on AWS Graviton2 instances already :)

meepmorp · on March 15, 2021

IBM doesn't fab its own chips, right? I thought they used GF.

wmf · on March 15, 2021

It's basically IBM fabs that were "sold" to GlobalFoundries. AFAIK IBM processors use a customized process that isn't used by any other GF customers.

xirbeosbwo1234 · on March 15, 2021

That's not quite accurate. Every core has access to the entire L3, including the L3 on an entirely different socket. CPUs communicate through caches, so if a core just plain couldn't talk to another core's cache then cache coherency algorithms wouldn't work. Though a core can access the entire cache, the latency is higher when going off-die. It is really high when going to another socket.

The first generation of Epyc had a complicated hierarchy that made latency quite hard to predict, but the new architecture is simpler. A CPU can talk to a cache in the same package but on a different die with reasonably low latency.

(I don't have numbers. Still reading.)

dragontamer · on March 15, 2021

In Zen1, the "remote L3" caches had longer read/write times than DDR4.

Think of the MESI messages that must happen before you can talk to a remote L3 cache:

1. Core#0 tries to talk to L3 cache associated with Core#17.

2. Core#17 has to evict data from L1 and L2, ensuring that its L3 cache is in fact up to date. During this time, Core#0 is stalled (or working on its hyperthread instead).

3. Once done, then Core#17's L3 cache can send the data to Core#0's L3 cache.

----------

In contrast, step#2 doesn't happen with raw DDR4 (no core owns the data).

This fact doesn't change with the new "star" architecture of Zen2 or Zen3. The I/O die just makes it a bit more efficient. I'd still expect remote L3 communications to be as slow, or slower, than DDR4.

ChuckNorris89 · on March 15, 2021

Limitations due to die size and power consumption since Intel Xeon is still on the ye olde 14nm++ process.

Also since Xeon dies are monolithic, unlike AMDs chiplet design, means that increasing the size of certain components on the die, like cache for example, increases the risk of defects wich reduces the yields, making them unprofitable.

greggyb · on March 16, 2021

FYI, 'ye olde' is actually using the letter 'y' to represent an Old English 'thorn' character. The thorn character represents the 'th' sound. Printing presses gained prominence in non-English speaking regions earlier than in English speaking regions. The 'y' was substituted for the 'thorn' due to some amount of similarity in shape.

Thus, 'ye olde' would be pronounced 'the old'. 'Ye' is just a different form of 'the' from a time when 'y' represented 'th'.

So, 'the ye olde' would be read as 'the the olde'.

rbanffy · on March 15, 2021

True, but the ARM ones have just 32MB for up to 80 threads.

I wonder if we could get numbers for L3 misses and cycles spent waiting for main memory under realistic workloads.

dragontamer · on March 15, 2021

That information changes with every application. Literally every single program in the world has its own cache characteristics.

I suggest learning to read performance counters, so that you can get information like this yourself! L3 cache is a bit difficult for AMD processors (because many cores share the L3 cache), but L2 cache is pretty easy to work with and profile.

General memory-reads / memory latency is pretty easy to read with various performance counters. Given the amount of latency, you can sorta guess if its in L3 or in DDR4.

xirbeosbwo1234 · on March 15, 2021

First off, it's not a direct comparison. The Epyc has one L3 cache per chiplet. This means that latency is not uniform across the entire L3 cache. This was a serious concern on the first generation of Epyc, where accessing L3 could take anywhere from zero to three hops across an internal network. AMD has greatly improved the problem on the more recent generations by switching to a star topology with more predictable latency.

That said, there are two major reasons:

1. Epyc is on a chiplet architecture. Large chips are harder to make than small ones. Building two 200mm^2 chips is cheaper than building one 400mm^2 chip. Since Epyc has a chiplet architecture, this means they can put more silicon into a chip for the same price. This means that Epyc can be just plain bigger than the competition. This comes with some complexity and inefficiency but has, so far, paid off in spades for them.

2. Epyc is on a newer process. This means AMD can fit more transistors even with the same area. Intel has had serious problems with their newer processes, so this is not an advantage AMD expected to have when designing the part. The use of a cutting-edge process was, in part, enabled by the chiplet architecture. It is possible to fabricate several small chips on a 7nm process even though one large chip would be prohibitively expensive, and AMD has been able to use a 14nm process in parts of the CPU that wouldn't benefit from a 7nm process to cut costs.

The first point is serious cleverness on the part of AMD. The second point is mostly that Intel dropped the ball.

totalZero · on March 15, 2021

What is the likelihood that mixed-process chiplets become the state of the art?

Macha · on March 15, 2021

Aren't they already for big (desktop/workstation/server) chips? I'd say Zen3 is the state of the art in that market and that uses a mixed process. The IO dies are global foundries 12nm for AMD.

The mobile market cares more about efficiency than easily scaling up to much bigger chips, so the M1 and other ARM chips are probably going to ignore this without much consequence for smaller chips.

Intel still tops sales because of non-perf related reasons like refresh cycles, distrust of AMD from last time they fell apart in the server space, producing chips in sufficient quantities unlike the entire rest of the industry fighting over TSMC's capacity, etc.

uluyol · on March 15, 2021

Intel already said they would use chiplets [1] and TSMC has been talking about the various packaging technologies being developed [2].

[1] https://www.anandtech.com/show/16021/intel-moving-to-chiplet...

[2] https://www.anandtech.com/show/16051/3dfabric-the-home-for-t...

modzu · on March 15, 2021

how is a 300W cpu cooled in a server environment? just high rpms and good environmentals? ive stayed on intel with my workstation so i can keep a virtually passive and quiet heatsink without having to go water

folago · on March 15, 2021

When I worked in HPC in Tromsoe, arctic Norway, one of local supercomputers was liquid cooled and the heat used warmed up some buildings on the university campus.

willis936 · on March 15, 2021

Servers are in 4U rack mounted enclosures with (relatively) low height heatsinks and huge amounts of airflow. Intake in front, exhaust in rear. Most clients will have beefy air conditioning to keep the ambient intake temp and humidity low.

wtallis · on March 15, 2021

I think 4U servers are quite rare these days, except for models designed to accommodate large numbers of either GPUs or 3.5" hard drives. Most 2-socket servers with up to a few dozen SSDs are 2U designs.

myrandomcomment · on March 15, 2021

Agreed. My last company we used a 1RU with 2 blades (servers) in that space. Front to back cooling, quite loud. No need to worry about the noise in the DC. Power density in a rack vs computer density is all that matters.

touisteur · on March 15, 2021

Y'all can try to pry my quad-cpu 3-upi 4U HPE DL.580 from my cold dead hands.

meepmorp · on March 16, 2021

Nobody wants to throw their back out lifting that monster, so you win.

touisteur · on March 18, 2021

:-) thanks, gunning for the 16 T4s to put in there, then.

Latty · on March 15, 2021

Yeah, noise is relatively unimportant so they tend to use a ton of extremely high RPM fans and big hunks of copper, from what I've seen.

wffurr · on March 15, 2021

And the DC staff wears hearing protection when they're working among the racks.

singlow · on March 15, 2021

I wish that hearing protection had been required or at least offered when I used to visit data centers frequently. They made a big deal of the fire suppression training, but never even suggested ear plugs. 20 year-old me had no idea how bad that noise was for my ears. I hope the staff there were wearing plugs, but it was never apparent.

mhh__ · on March 15, 2021

You only get one set of ears.

I suspect I still have the record for most expletives used in front of the headmaster at my old school because someone turned a few kW speaker on while I was wiring something under the stage - i.e. I wasn't pleased.

myself248 · on March 15, 2021

That still blows my mind. Coming from telecom where everything prior to the #5 ESS was convection cooled, a happy office is a quiet office.

Data got weird.

rbanffy · on March 15, 2021

I'm assuming telecom had very different volume-power requirements. Where I grew up there were many mid-city phone switches that were large, concrete exterior, almost windowless, buildings.

ben-schaaf · on March 15, 2021

To put it in perspective, the Intel Xeon 6258R is rated at 205W. Servers have oodles of (likely airconditioned) airflow so this isn't a problem.

epmaybe · on March 15, 2021

Performance per watt is better on amd right now, at least last I heard.

unicornfinder · on March 15, 2021

Performance per watt is better but unfortunately the idle power consumption is fairly high.

kllrnohj · on March 15, 2021

Epyc's idle power consumption is fairly high but Ryzen's isn't. The more workstation-focused Threadripper & Threadripper Pro is also still significantly better than Epyc here.

bsoft16385 · on March 16, 2021

Zen2/Zen3-based, non-APU Ryzen still has pretty awful idle power consumption compared to Intel desktop CPUs.

My 5900X at idle draws around 22W. Modern Intel desktop CPUs (which are basically all Skylake refreshes right now) are 8-10W at idle.

kllrnohj · on March 16, 2021

Sure, but 22W is still a trivial amount of heat & power in the form factor of a desktop system. Fans don't need to spin for that (or if they do it's at the lowest RPM - still silent). Fans do need to spin, and with some moderate oomph, to handle 80W, though.

fomine3 · on March 16, 2021

Is TR Pro really better? I suspect that it's just difference between motherboard because it's basically same as EPYC.

kllrnohj · on March 16, 2021

TBD on TR Pro using Zen 3, but TR Pro on Zen 2 appears to be a bit better than Epyc currently. I'd guess because optimizing idle power draw on TR Pro is a more worthwhile use of time vs. servers, where your fans are running at a bajillion RPM always anyway and you're not idle as much.

epmaybe · on March 15, 2021

Oh I see that in the article now. It also seems like the performance gains you saw on other Zen3 chips doesn't correlate with EPYC due to the sheer number of cores and other components on chip.

numpad0 · on March 15, 2021

By an array of fans specced like 8cm 10W 10krpm, like six fans wide two fans deep, blowing into a passive heatsink, with help of air ducts inside the chassis

Intel or AMD or SPARC or ARM the setup is all the same for rackmount hardwares, pure copper passive heatsink and high power axial fans

formerly_proven · on March 15, 2021

Typical servers are 1U or 2U high boxes. 1U=44.5 mm and each box is "full depth", so around 700 mm long (sometimes more), and ~450 mm wide. The 1U boxes typically just have a block of copper with a dense bunch of fins on them as a CPU cooler, while 2U designs usually incorporate heatpipes.

Lots of airflow.

High-density systems are usually 2U, but with four dual-socket systems per chassis.

rodgerd · on March 15, 2021

Servers are horrifyingly loud, and datacentres will destroy your hearing in no time flat. Airflow management in datacentres is quite the art form, as well: proper packing of the racks for airflow, height of raised floors to accommodate blown air, and so on; some people are starting to go back to the future with things like liquid-cooled racks as well.

greggyb · on March 16, 2021

Don't forget that you could feasibly use a 280W 64-core processor to replace a dual-socket 2x205W 2x28-core Intel box with 8 cores to spare. Obviously, there will be performance deltas based on specific use cases, but this should help frame the comparison.

Out_of_Characte · on March 15, 2021

a cooling unit provides a delta between surface area temparature and the ambient temparature. Epyc chips are significantly larger and more spread out with the chiplets so the heat density of these chips are relatively similar. So the cooler doesn't have to provide a larger temparature delta and ony has to be slightly larger if at all.

Overall heat density on chips has increased due to litography changes so the chiplet architecture in a way is only a stopgap.

freedom42 · on March 16, 2021

A bit offtopic. All AMD, Intel and ARM processors come with binary blobs that can't be disabled or removed which could potentially control your whole computer (Intel Management Engine, AMD Platform Security Processor).

Really powerful desktop and server-grade processors are available with fully open source firmware and hardware (even fully open BMC). It's not that expensive, comes with PCIe 4.0 and other modern features [1].

They are POWER9 processors, not x86_64. But many software work on it, checkout 'talos-workstation' on IRC to chat with the users [can access IRC through Element Matrix app] - someone is running Android Studio on it (yeah so much for privacy but still).

[1] https://www.raptorcs.com/

These systems are currently being used by businesses and developers alike, including Google.

wmf · on March 16, 2021

Honestly, Power9 is not really powerful compared to Milan; it's 1/3 to 1/2 the performance at best. https://openbenchmarking.org/vs/Processor/AMD%20EPYC%207763%...

The desktop-grade Power9 gets rekt by a Core i3: https://openbenchmarking.org/vs/Processor/POWER9%204-Core,In...

freedom42 · on March 16, 2021

Edit: Commented something without reading, deleted.

The advantage of POWER9 is fully open firmware and hardware while being really powerful. They won't beat the latest AMD in performance. But AMD can never be open. The RaptorCS products are FSF RYF (respects your freedom) certified.

https://ryf.fsf.org/vendors/raptor

Some benchmarks from 2019: https://www.phoronix.com/scan.php?page=article&item=rome-pow...