Assessing Cavium’s ThunderX2: The Arm Server Dream Realized

thisisit · on May 24, 2018

An important side note here is that Cavium has been bought by Marvell Semiconductors:

https://in.reuters.com/article/cavium-m-a-marvell-technlgy/m...

So future of ThunderX2 remains to be seen.

wtallis · on May 24, 2018

I don't think Marvell has anything that overlaps with ThunderX2, but they definitely have a lot of complementary products. I don't think they have any reason to cancel ThunderX2 if it looks like it can be successful.

alberth · on May 24, 2018

I thought TDP was a selling point of ARM for servers.

Cavium has a higher TDP (180W) than either of the Intel Xeon's listed in the article (150W, 165W).

Is this more about "performance per watt" then?

EDIT: Also makes me wonder just how much more energy efficient and/or performant Cavium would be if they moved away from TSMC 16nm fab over to Samsung 10nm fab (like Qualcomm). There's a sizeable difference between 16nm vs 10nm (assuming the fabs are using the same metrics). Qualcomm using Samsung 10nm fab is only 120W.

notacoward · on May 24, 2018

Yes, it's about performance per watt, and also performance per dollar. The Intel chips lose badly by that measure. Also, TDP isn't necessarily the right figure of merit here. It's defined and measured differently by different vendors, so it's not always directly comparable. The more important and more directly comparable measure is actual system-wide consumption when running a defined workload.

ajross · on May 24, 2018

> The Intel chips lose badly by that measure.

That's true, but then the Intel chips shown in that chart are Xeon Gold/Platinum parts that heretofore have had essentially no competition in the market and thus command huge premiums. Intel is making crazy bank on its server parts, basically.

So in the presence of real competition (which this looks likely to be, though to be fair it's not actually in the market yet and is likely priced at a point intended to grab eyeballs), it's likely that Xeon parts will drop in price. You won't be looking at that delta for long if Intel starts losing market share.

So... if they were merely cost-equivalent, is this part worth it? Because the pure silicon metrics here (performance per watt and per die area), as mentioned upthread, are sort of a wash. And maturity/stability issues (c.f. the bit in the linked article where they point out that the Cavium board draws 300W at idle!) kill products in the datacenter space all the time. Consumers are OK with occasional glitches, but no one is willing to put anything on their rack that has any kind of failure risk at all.

notacoward · on May 24, 2018

> no one is willing to put anything on their rack that has any kind of failure risk at all.

Are you sure about that? The people who have hundreds of thousands to millions of machines also have systems designed to handle a certain failure rate, and have historically been quite happy to run on non-premium hardware. The costs do add up. Even if it's just a handful of companies, in terms of market percentage that's far from no one.

ajross · on May 24, 2018

It's a standard thing in the enterprise space, yes. No one buys anything that seems flaky. It's true that big datacenter operators (Amazon, Google, Netflix et. al.) have a lot more freedom to handle risk.

But regardless, things like the 300W idle draw are going to kill a sale even there. That's almost a dollar a day, so there goes the price delta with Intel over the life of the system already.

notacoward · on May 24, 2018

The idle draw is exactly the sort of thing one would expect to be fixed well before volume production. I've been at companies that made hardware (in fact at least one of my co-workers from one of those jobs is at Cavium right now) so I honestly think that's a bit of a red herring. Comparing pre-production hardware with production is always a bit unfair; let's not make it more so.

ajross · on May 24, 2018

Wait wait wait. If it's preproduction hardware then the prices in the article are meaningless because you can't buy it. If it's unfair to compare performance metrics on preproduction hardware than surely it's equally unfair to compare prices (I mean, they may not even know the yields yet). You don't get to claim "Intel loses badly" and then hide behind "well, they can't have expected to have fixed everything yet!".

For comparison, those Xeon parts are Skylake-SP dies that first shipped 11 months ago. You can buy them on Amazon and rent them on AWS. They're a mature product in a well-established channel and their prices are stable and matched to the market.

The bottom line is that Cavium pulled out all the stops, shipped a board that works and performs... very adequately, with a few glitches. And frankly I don't know that glitchy acceptability is going to sell a whole lot of servers. So they slapped a price tag on it that draws eyeballs that the performance alone won't. If they can do that it's probably good news for them, but it's still a long road from here to revenue.

notacoward · on May 25, 2018

Yes, the prices on both sides are kind of meaningless, but one can make reasonable guesses about how they're likely to change. There is no plausible scenario in which Intel drops their prices far enough for this one to be close. They've never done anything like that before, even when it might have made sense. One can also make reasonable guesses about what kinds of flaws are likely to be fixed before the hardware goes into full production. Power management not kicking in when it should is a pretty obvious one. Yes, there's some guesswork, but there's a big difference between educated guesses vs. "Cavium's problems are carved in stone and Intel's prices are infinitely flexible" wishful thinking. Don't put all of your money into Intel stock.

ajross · on May 25, 2018

> There is no plausible scenario in which Intel drops their prices far enough for this one to be close.

"No plausible scenario." Just put that one on a T shirt and come back in a year. The tech industry is littered with the graves of hardware products that got close to being great but ultimately didn't work out. And you're comparing it to a year-old mature product literally stocked at Amazon.

notacoward · on May 25, 2018

You do realize that Intel themselves were the upstarts once, right?

ajross · on May 25, 2018

And unlike the literally dozens of semiconductor failures that littered the valley of the 1970's, they succeeded. This doesn't help your point.

I'm not saying Cavium cannot succeed, I'm saying delivering a chip to a reviewer that doesn't actually beat a Xeon probably isn't enough, no matter what price tag you invent.

notacoward · on May 29, 2018

Let me tip you off about something people with actual business experience know: if you don't get your stuff in front of reviewers, flawed as it may be, you fail. Hardware has long lead times. To get design wins, you have to sell on promise. Even some of Intel's own foundational products were objectively worse than contemporaries (e.g. even the IBM PC designers knew that the 68K was a better chip than the 8088) but got better over time. They got better because they had the wins due to selling on promise, and therefore have the funds. Any Intel competitor who waits until their product is better than Intel's on every metric, beyond any doubt, before they show anything to anyone, will run out of funding and quietly die.

As I said, one has to make educated guesses about what kinds of problems are likely to be fixed before full launch. Dogmatic statements based on experience limited to irrelevant domains aren't useful to anyone, and plain old pro-Intel FUD even less so.

jandrewrogers · on May 24, 2018

I know many people that have evaluated ARM for a variety of server applications and ARM's advantages, in terms of tradeoffs, are not what many people assume. Where ARM shows well is applications that spend a significant amount of time idle in power-restricted environments (battery-powered analytics clusters being a common example) and running software that is weakly optimized for efficiency (some web app stacks look like this). In practice, this is a niche set of requirements for servers.

The challenge that ARM has long had in servers is that highly optimized and efficient Intel codes have competitive performance per watt and performance per dollar. Cavium is quite open that their cores are optimized for low ILP/IPC software, which is essentially arbitraging software that is poorly designed for x86 silicon. As data and analysis scales have increased, more standard server software is specifically designed to be hyper-efficient on Intel, not just HPC codes like the old days. This makes it increasingly difficult for ARM to compete with the operation throughput per watt/dollar that is possible with Intel and well optimized software.

There is the additional issue with platforms like ThunderX2 that even though it could be competitive in theory, it would require software to be specifically engineered around the peculiarities of the microarchitecture. Software that assumes Intel microarchitectures and optimizes toward that, either explicitly or implicitly, will inherently be suboptimal on ThunderX2. This isn't something that can be fixed in a compiler, it is intrinsic to the architecture of the software at a higher level e.g. C++ codes embed many assumptions about CPU cache topology and properties.

moconnor · on May 24, 2018

I worked in HPC for over a decade. Most of the codes are not hyper-efficient on Intel - particularly not current-generation Intels that need efficient use of very wide vectors to reach peak performance.

Many codes are memory-bound, and the TX2 has excellent bandwidth. This shows up in real-world simulation codes such as OpenFOAM.

gnufx · on May 24, 2018

Indeed -- memory- or communication-bound. Somewhere there's a Dell "roofline" plot indicating IPC in some codes running in an unspecified way. See also John McCalpin's writings.

Things doing 3-D FFTs (common in materials science) tend to spend a lot of time in an MPI collective at sufficient scale. I spent half an hour profiling and then changing an MPI parameter for 30% improvement in one case at not very large scale.

auvi · on May 24, 2018

Very interesting as I used to do OpenFOAM parallel runs years ago on x86. Do you have any links to OpenFOAM benchmarks on ARM64?

Ar-Curunir · on May 24, 2018

The conclusion of the linked article has a reference to a relevant paper

gnufx · on May 24, 2018

But, as usual, you don't know what the profile of the calculations were, in particular because you don't know the mode it's operating in/data it's operating on. ("OpenFOAM" is actually many different programs.)

That said, the indications are that the performance is decent for HPC. What I haven't seen is a comparison with Ryzen (or POWER9). There's also a lack of data even on the SIMD hardware in ThunderX2 and POWER9, at least that I've been able to find.

namibj · on May 24, 2018

For cache-tuning, there is polyhedral optimization, available in LLVM via Polly and in GCC via Graphite. Of course they don't fix everything, but they are great at automating loop nesting decisions with the goal of matching data locality to cache size(s), and, potentially, even SSD/HDD storage (think compiling the software so that it makes good use of the 1GB ram/core you can give it, and not be awfully inefficient with it's swapping.

gnufx · on May 24, 2018

Unfortunately, the polyhedral optimization was still broken in gcc last I tried. Perhaps it's fixed in gcc 8.

namibj · on May 25, 2018

My tests on a recent (6 weeks, I'm too lazy to re-build) git snapshot from master worked reasonably well, but I didn't quite reach the full testing stage, as it was doing something (I was trying to benchmark CockroachDB on Goldmont (server atom), and the -mnative flags are just a couple weeks old on the mailinglist, and a month younger in git master.) and I didn't have time to check back. But it appeared to work when I activated the command line flags.

Symmetry · on May 24, 2018

Small microservers with cellphone class cores in them were something a few companies pursued but they never really took off. And there was one company with Atom based servers in the same space, it wasn't just all ARM A9s.

This one is about trading off minimum latency for more throughput with lots of more medium sized cores in a way that's probably better suited to most server workloads.

floatboth · on May 24, 2018

> assuming the fabs are using the same metrics

They're not. Nanometers are almost meaningless now.

notacoward · on May 24, 2018

Between this and Zen, Intel is definitely starting to feel some heat. The next year or so is going to be interesting. Maybe Intel will recover; maybe we'll look back and see this as the point where they lost their grip.

silisili · on May 24, 2018

I used to think this, but really don't see it happening. Intel has a lot of weight, money, and is pretty vertically integrated. They've been rather happy to stick where they're at, where the money is good. And no company is really threatening them there, and may not in the foreseeable future. But make no mistake, once that market begins to erode, they'll branch out more. Their mobile efforts have been mostly half assed, but I have no doubt they could cream both Apple and Qualcomm if they threw their resources at it.

alberth · on May 24, 2018

Can anyone speak to the techical merits (or disadvantages) of arranging your Cores by Mesh or Ring?

I noticed all the ARM chips use a Ring bus whereas Intel uses a Mesh.

Symmetry · on May 24, 2018

Intel was using a ring too until their most recent architecture. Sometimes multiple intersecting rings when the number of cores grew too large.

dweekly · on May 24, 2018

Fewer hops: https://software.intel.com/en-us/articles/intel-xeon-process...

brandmeyer · on May 24, 2018

Cost versus variation in memory access latency.

auslander · on May 24, 2018

If it could Libreboot, that would be like first free of firmware blobs practical server in decades.

Good Marvell bought it. Their ARM boards are already fw blob free.

spacenick88 · on May 25, 2018

Have a look at Talos II, it's using POWER9 with completely open firmware

auslander · on May 25, 2018

Checks out, I stand corrected, thanks. libreboot.org/news/talos.html

newprint · on May 24, 2018

I would curious, where can I buy this system ? Anyone one know resellers ? At $$$ listed, I would buy cheapest system in a heartbeat.

ajross · on May 24, 2018

This is a press release detailing the what appears to be the system under test: https://www.cavium.com/news/gigabyte-announces-thunderxstati...

I can't find a mention of it on Gigabyte's site, but there's an email there you could try.

Honestly: this is a brand new device that clearly isn't actually in the channel yet. They shipped out test systems for press hits like this, it's not really a "product" quite yet.

rwmj · on May 24, 2018

Cavium are actually pretty good out of all the ARM server vendors of, you know, selling the hardware. (For some reason other vendors found that really hard.) The ThunderX2 has not been generally released yet, but I'm expecting it'll turn up on the same sites selling the old ThunderX, eg. https://www.avantek.co.uk/store/arm-servers.html

aseipp · on May 24, 2018

Out of curiosity I looked, and they actually already have pricing for the ThunderX2 Workstation from Gigabyte -- the very one mentioned in the article: about 10,000 GBP out of the box.

https://www.avantek.co.uk/store/avantek-thunderx2-arm-workst...

(They have ThunderX2 blades on the server section of their store with no prices yet, however.)

shaklee3 · on May 24, 2018

You can try the ThunderX on packet.net.

pmontra · on May 24, 2018

I know ARM processors shouldn't be subject to the recent vulnerabilities, or not as much as Intel and AMD, but I'm surprised that the article doesn't address that point. A decent CPU not vulnerable to Spectre, Meltdown and friends should be able to sell well only because of that.

floatboth · on May 24, 2018

The ISA doesn't matter very much.

Meltdown is specifically Intel's fault.

Spectre is, AFAIK, inherent to out-of-order execution. So the original ThunderX is not vulnerable, ThunderX 2 is.

monocasa · on May 24, 2018

> branch intensive code (databases, AI...)

AI is branch intensive? Like, video game style state machines are, but that's not what people mean these days when the say AI in connection with servers, right?

yazr · on May 24, 2018

Can i get the SPECjbb benchmark mentioned in the article ?

The web site claims that it is pay-only $1500 https://www.spec.org/order.html

jsgo · on May 24, 2018

sounds great, didn't I read that ARM was backing off on server though (hoping it was a bad article).

edit: thanks, you all nailed it. I remember now and it was Qualcomm. Whew, I'm glad as I was bummed that this meant there was a good step for a piece of tech that would be DoA. Really wish Qualcomm wasn't abandoning based on how it appeared they were making progress as well to be a good competitor (and competition has been great where applied).

mtgx · on May 24, 2018

Not Arm, but Qualcomm. It was also a strange and sudden decision. I don't think they really wanted to do it, but might have been forced to do it by the circumstances: anti-trust lawsuits, Apple dumping them, Broadcom attempting a hostile takeover, U.S. government pressuring them not to sell to Broadcom no matter what, etc.

One or all of these may have forced Qualcomm to cut its losses. Either way, I hope they get to sell the Centriq line to some other company that can provide competition to both Cavium and Intel/AMD/IBM in the server space. We need more than one Arm chip to compete in the space, otherwise Cavium's chip could also be seen as some kind of outlier and be ignored by the industry.

wyldfire · on May 24, 2018

> One or all of these may have forced Qualcomm to cut its losses.

Yeah, I think the LBO really scared Qualcomm. Their royalties fund a lot of the other work in the company and I think they've learned that it also makes them a target. A domestic LBO could happen and then there'd be no escape hatch. They pledged to cut costs when Broadcom announced their intentions, this may be a part of that.

I don't think of them selling the server chip business as "cutting its losses" -- instead I would call it narrowing its focus back to businesses closer to its core competency. The Centriq launch with Amberwing seemed to be relatively positive for a first design release. They didn't turn the industry upside-down but they probably achieved most of what they set out to do. It will take some time to convince server customers to switch, recompile their code and get their vendors to recompile their code.

> We need more than one Arm chip to compete in the space, otherwise Cavium's chip could also be seen as some kind of outlier and be ignored by the industry.

Agreed. There's momentum that's been built [1,2] and I'd hate to see it flub now.

[1] https://www.redhat.com/en/blog/red-hat-introduces-arm-server...

[2] https://www.suse.com/c/news/suse-steps-up-to-support-innovat...

jsgo · on May 24, 2018

I hope they do something with it though. Whether sell it to someone who can be a good caretaker to it, share lessons and designs with other companies that could then build upon it, spinning it off, whatever they choose. Just abandoning it outright seems like an unfortunate move. I won't say bad, because I'm sure they have their reasons, but unfortunate in that it seemed like it was on track to be viable.

Microsoft's Surface Pro line, to me, was kind of an "eh, that's nice I guess" up until the Pro 3 at which point it seemed like a platform that had hit its stride. The stuff I was reading on Amberwing seemed like it wasn't a silver bullet, but what it did work for, it worked very well when comparing price vs performance so one could hope that with time, they could make it even better.

auslander · on May 24, 2018

Or NSA was pushing for closed source system firmware a.k.a blobs

pertymcpert · on May 24, 2018

On the contrary. That was Qualcomm perhaps you read about.