Since POWER9 and OpenSPARC are open architectures with open ISAs, I don't see why companies like Facebook, Amazon, Alibaba aren't using them and trying instead to build CPUs based on ARM. Is there a much better performance/power ration which can be achieved by ARM and not by POWER or SPARC?
It is valuable only if it has many users, e.g., application code, optimized for the ISA.
HW's job is to then run that code with good perf/cost.
OpenPOWER has little software.
ARM has a lot of software.
So from that POV, ARM is already many orders of magnitude more valuable than OpenPOWER.
But it doesn't end there. Do you need some software to be extremely optimized for ARM? ARM can do this for you at resonable price, no need to hire.
Also, for OpenPOWER, you need to hire 50-100 Facebook engineers, at 400k$/year, and it'll take them >3 years to produce a chip design, which then needs to be verified, etc. and then needs to be built, so you'll need a fab, specialized on OpenPOWER, or not. A fab churns 40k chips/month, so how many chips / month do these companies need ?
With ARM, you pick one of the many ARM farms, and there is little for you to do. And you get 5 engineers, and they just customize an ARM design to your needs. And they ship in 1 year instead of 3. And next year ARM gives you a way to update your chip to the next generation. And if next year you need some other feature, ARM gives it to you. And if you need software, like C library, profilers, math, all that is supplied by Arm.
And they take royalties on chips you built, and.... and....
So ARM is many orders of magnitude cheaper in perf / $ than OpenPower. Not only is the hardware better, but it is better, cheaper, has more software, and tools, and teams of experts ready to help your team, etc.
Porting most software to ARM64, Power, or RISC-V involves typing some variation of "make." Only a small percentage of software written in C/C++ or ASM is problematic. Anything in a higher level language like Go or a newer language like Rust is generally 100% portable.
Switching from X86_64 to ARM64 (M1) for my desktop dev system was trivial.
Endian-ness used to bite, but today virtually everything is little-endian above embedded. Power and some ARM support both modes but almost always run in little-endian mode (e.g. ppc64le).
- Have you ever, e.g., computed the sinus of a floating point number in C (sinf) ?
- Have you ever multiplied a matrix with a vector, or a matrix with a matrix (GEMM) using BLAS?
- Have you ever done an FFT ?
- Have you used C++ barriers? Or pthreads? Or mutexes?
An optimized implementation achieves ~100% of theoretical peak performance of a CPU on all of those, and these are all tailored to each CPU model.
There is software on any running system doing those things all the time.
Running at 0% of the peak just means increased power consumption, latency, time to finish, etc.
Generic versions perform at < 100%, often at ~0% (0.1%, 0.001%, etc.) of theoretical peak.
Somebody has to write software for doing this things for the actual hardware, so that you can then call them from python.
IBM has dozens of "open source" bounties open for PowerPC, and they pay real $$$, but nobody implements them.
---
Porting software to PowerPC is only as simple as doing make if the libraries your software uses (the C standard library, the libm library, BLAS, etc. ) all have optimized implementations, which isn't the case.
So when considering PowerPC, you have to divide the paper numbers by 100 if you want to get the actual numbers normal code recompiled with make gets in practice. And then you have to invest extra $$$ into improving that software, cause nobody will do it for you.
Er, no. I do that stuff (well, I'm not clever enough for C++ generally, and it would be OpenMP rather than plain pthreads) on the sort of nodes that Sierra uses. However they mostly use the GPUs, for which POWER9 has particular support. Then I can tell there isn't currently any GEMV or FFT running on this system, and not "all the time" even on our HPC nodes.
While it isn't necessarily clear what peak performance means, MKL or OpenBLAS, for instance, is only ~100% of serial peak on large DGEMM for a value of 100 = 90; ESSL is similar. I haven't measured GEMV (ultimately memory-bound), but I got ~75% of hand-optimized DGEMM performance on Haswell with pure C, and I'd expect similar on POWER if I measured. Those orders of magnitude are orders off, even for, say, reference BLAS. I don't know why I need Python, but the software clearly exists -- all those things and more (like vectorized libm). You can even compile assorted x86 intrinsics on POWER, though I don't know how well they perform relative to on equivalent x86, but I think you're typically better off with an optimizing compiler anyway.
I've packaged a lot of HPC/research software, which is almost all available for ppc64le; the only things missing are dmtcp, proot, and libxsmm (if libsmm isn't good enough).
You start with BLAS being a factor 2 off, and then go to PETSc, and are another couples of factors off, and then the actual app the scientist wrote, which many use all of the above and the kitchen sink, where every piece and the pieces they use are all a couple of factors off, and then your scientist app is at 0.01% of peak.
If you have used Sierra since the beginning, we have seen significant performance increases over the years, because the people using it have actually been discovering and then either getting IBM to fix, or fixing themselves, most of the software.
Compared with Power 10, I'd say that Power 9 is "mainstream" (many clusters available), and from the Power 9 CPUs in existence, IBM's are the most mainstream of them all.
Take the Power 10 ISA, build your own CPU that significantly differs from IBM's, and good luck with optimizing all the software above. It can be done, and dumping it on a couple of HPC sites where then scientists and staff won't have any change but to use it for 4-6 years is a good way to get that done.
But for a private company that just wants to deliver value, ARM is just a much better deal, cause it saves them from having to do any of this.
Endianess is not the only problem. You can have issues with different cache coherency model, different alignment requirements, different syscalls (which are partially arch-dependent, at least on Linux). The fact that the switch from x86 to arm was trivial just proves the point that arm has matured really well.
How? For example, a different memory model isn't something you can just flip a switch to fix — someone needs to review/test application code to see whether it has latent bugs which are masked (or simply harder to reproduce) on x86. Apple went to the trouble implementing support in their silicon to avoid that but if you don't run on similar hardware it's likely that you'll hit some issue in this regard for any multithreaded program, and those are exactly the kinds of bugs which people are going to miss in a simple cross-compile with limited testing.
x86 is at step 10000, ARM at step 5000, power is at step 0.
Firefox "worked" before this post on power. Now somebody put enough effort to actually make it usable.
The fact that you don't see people complaining about Firefox PowerPC performance on Linux is not because performance was good - it was unusably slow - but because nobody uses Firefox on Power.
Think about what that means. Think about how many bugs in Firefox are reported _every day_ for x86 and ARM, and how many are reported for PowerPC. Is that also because the PowerPC version has no bugs? (no, it is because nobody uses it, nobody reports them, and nobody fixes them).
> x86 is at step 10000, ARM at step 5000, power is at step 0.
I agree with your general point, but I do believe that Power is the most "practical" ISA after x86 and ARM - albeit it's a distant third, it's definitely not at 0. It has the full support of a bunch of mainstream distros, public container registries have a decent amount of support for their images, and people actually run pretty serious workloads on Linux on Power.
Power does have a lot of niche backing, albeit it's continuously being hurt by IBM's total lack of interest in doing anything but push it beyond the billion dollar contracts they're milking with it. That's totally destroying any mindshare Power has. There's really no way to get a cloud shell on a modern Power machine, or physical access to a modern one without forking over thousands of dollars for the privilege (the latter only really is possible due to Talos' amazing efforts, bless em).
I think i agree with what you say, however, an important detail is that any given bug (known or not) has pretty low odds to be architecture specific.
Afaik this article is a big deal because it's JIT, so a big chunk of architecture specific code to get good performance. But most code is not going to be that.
That's not to say that architecture specific bugs will not exist. But i think your outlook on this is a little pessimistic.
Well, ahem, somebody does try to fix them, and we do get reports which get triaged (I know this, because I've done a number of the fixes, some of which were not trivial). There are much fewer of them, which I think is your point, but there aren't none, and there isn't nobody who cares. I think you're overplaying your hand here.
If the implication is that POWER is somehow new, I first used it when it was RS/6000 and introduced FMA. There was subsequently a rather large installation at my lab. Firefox without the JIT is only a problem with the "modern" web, and I default to turning off Javascript anyway, and I guess someone uses it to make it worth porting.
Prior to this work, Firefox was also ported in the sense that it ran but it was much slower because it had not been optimized. How much of the software which has been compiled for Power has been well-tested, much less optimized?
The instruction set is not as critical as you might think, and ARM has the huge advantage of a lot of working implementations which you can already buy.
Semiconductor design teams don't exactly grow on trees, either. It was over a decade from Apple buying PA Semi whole (https://en.wikipedia.org/wiki/P.A._Semi ) to announcing the M1.
And of course if you're going to do that you already need the rest of the vertically integrated pipeline to build motherboards to put your chips on, peripheral IP to do all the other things other than processing, etc.
I would think USB would be more of a OS issue, though. If a USB driver came as a blob, my Talos II obviously couldn't run it, but otherwise pretty much all USB stuff just works if I have source code for it (Fedora). Page size can ruin your day for some devices -- I have to hack FireWire, for example, because it assumes a 4K page size and Fedora uses 64K pages -- but so far no issues with USB or AMD GPUs.
Optimization is a bigger issue, though autovectorization in compilers is making it less of a problem than it used to be, as well as us nerd pioneers getting things upstreamed.
Although it's one of the "different" things, along with memory model, the page size is the same for Fedora aarch64 if I remember the discussion right.
Is POWER9 compilation generally poor compared with other targets? It didn't seem so to me, apart from a pathological case on one benchmark set which IBM addressed swiftly. They've been supporting GCC for rather a long time.
That used to be the case until https://www.spinics.net/linux/fedora/fedora-kernel/msg12805.... which put aarch64 to 4K. I argue there should be a workstation ppc64le spin that does this, but I understand why Fedora doesn't want to put the releng resources towards a niche audience.
I agree IBM's autovector stuff is quite good in gcc but there's no substitute for hand-rolled assembly sometimes.
Quite. I can't think of any reason other than retrocomputing that someone would want an OpenSPARC.
I think people forget that the economies are different with hardware vs software; because you cannot eradicate the per-unit cost of hardware, paying a small part of that in license fees is not a big deal, especially since it comes with integration support that saves you a lot of non-recurring R&D expense. Whereas in software, being free makes it zero-friction and this has a huge impact on adoption.
As others have already said, ARM has the economy of scale which POWER doesn't have. And SPARC seems to be dead.
Also, ARM has always (?) been about getting the most out of limited resources, whereas POWER is about performance at any cost. With modern ARM designs, the performance is getting close to, and even exceeding, that of traditional desktop and server CPUs, while still being frugal with resources. POWER is still, well, power-hungry.
Agreed. Power frugality is absolutely crucial, especially at larger, enterprise scale. Unless you're in HPC, you're looking for a good balance of peak performance-per-watt and the lowest idle consumption, with a very good scaling down story. ARM has long done a better job of this than x86. It's one of the main reasons the chipset has been absolutely dominant in the smartphone market.
The biggest single cost for AWS etc. is per-rack running costs, encompassing power, cooling etc. It's hard to overemphasise just how much this dwarfs all other costs. To optimise those costs you've got to cut down the power consumption and associated heat production.
One could also ask why people are using / pushing Rust and RISC-V over Ada and OpenPOWER. The latter are simply not cool, gets nothing on their Resume Driven Development. And no companies wants to bet on it because without other company sharing some cost of ecosystem no one can sustain it by themselves. ( That is why we need marketing )
And finally, what benefits does POWER and SPARC brings to the table? The licensing cost from ARM is tiny in the grand scheme of things. I like open ISAs like POWER and OpenSPARC, but from a business POV it just doesn't make any sense.
Radiation hardening are mostly done with mature old node with less error rate and packaging. There are additional compute to check if results are correct ( I remember it was n+2 ? ). So nothing to do with ISA itself ( at least as far as I know ).
I would be sure you would for any off-the-shelf open designs. POWER9 and POWER10 are power-hungry (no pun intended) but are very powerful (I will neither confirm nor deny wether a pun was intended here), more than any ARM server option.
For "big POWER" like POWER8/9/10, they are clearly not positioned at that market. However, there are small Power ISA chips for embedded systems and companies like NXP still make them (the Amiga community even tries to shoehorn these into desktop systems, to their detriment, IMHO), and IBM has done "little POWER" versions of big POWER chips (the G5 being a scaled-down POWER4 with AltiVec, for example).
The long version is https://www.talospace.com/2020/01/another-amiga-you-dont-wan... but the tl;dr version is that the Amiga diehards who would buy this still want to use Amiga as their daily drivers, yet these are CPUs that a 15-year-old-plus Power Mac would mop the floor with. It's just handing their detractors another stick for a fresh beating. As a strictly retrocomputing solution that wouldn't be a problem, but that's not how these newer Amigas are positioned and by playing into the "Power is dead/underpowered" trope they're bad for the community as a whole.
With ARM you have a World of resources, with POWER or SPARC you have 2 abandoned platforms. There’s a reason they are “open” now and nobody wants to hold that bag.
I know they design their parts, but do they actually assemble them ? If not, how many fabs can build ARM boards compared to POWER or SPARC boards ? Is the performance/power ratio so important compared to the price of being tied to fewer companies ?
This is very exciting. The missing JIT has made web browsing much less pleasant than it should be and is currently the the only significant issue I have with my POWER9 workstation.
What's the context for this? Is there a new JIT compiler for Javascript in Firefox? Or is it a 3rd-party "add-on" that improves performance on some specific machine that this company makes?
It's JIT support for the OpenPOWER architecture, which is interesting, but as far as I can tell isn't exactly in wide use right now. At least not where you might need Firefox. https://openpowerfoundation.org/
EDITED: derp, confused: There are POWER Workstations from Raptor Computing Systems (https://www.raptorcs.com/). Talospace are POWER enthusiasts. Thanks amock for the correction.
They are target at datacenters. The last iteration has hardware support to decimal floating point, flexible IO (DDR3, DDR4, DDR5. GDDR6, HBM, PCI5, nvlink), Cores with SMT8, Tbps intra chip network, etc
Can change the endianess at run time.
Their new cache architecture is very unusual. A CPU can use the cache from another chip. Data is stored encrypted.
If the system can change the endianess at any time, does that mean that we should only be using palindromic data? Or is it that we should aim to make everything polyglotic such that both directions have valid but distinct interpretations?
you joke but I learned at a talk comparing genetics to TCP networking (at HOPE, maybe 2014 or 2016, cant find it on the website atm) that DNA is encoded such that it expresses different proteins depending on which direction it is read, might be something to learn from
Sounds like an architecture full of features you'd like to avoid when you want to run on battery, run as many of them in a datacenter of a given heat dissipation capacity or simply when you're big enough to tailor a CPU design to your needs because you can buy from a chip manufacturing as a service company. Truly an architecture for a different century.
It's more something you'd do at boot, if you had to select between an OS built for one or another.
The POWER ISA was used in PowerPC which was used for the successors of a few 68k machines (most famously the Macintosh) and in that case the OS was built for big-endian. So having big-endian support was key there.
IBM i and AIX still run big, in fact. Important for IBM's institutional customers.
As for endian shifts, technically every OpenPOWER chip goes big for every OPAL call into the low-level HAL, even if the OS is little. The overhead is minimal. I can't think of much application use for that, though (per-page endianness which some PowerPCs supported is much more useful).
The CPU probably works on just one endianess and convert the data format when reading from memory. The overhead is on kepping track when to do it. But Im speculating, havent looked into this.
Isn't a better allocation of scarce engineering resources on the POWER platform to implement a RDP client to x86 commodity desktop environment for the commercial consumer experience on the web with the benefit of offsetting and isolating potential security breach to that environment? Has anyone made the POWER CPU a raw node for crunching styled on the Plan 9 idea of cpu% ?
POWER9 is interesting (compared to RISC-V) because, right now, you can buy up to a 24 core, SMT4 system (96 threads), running dual CPU sockets if you like, and supporting up to 1 TB of memory per socket, with a maximum boosted clock speed of 3.8 GHz.
All this with fully open firmware, and an open ISA (as of the last couple years). The CPU implementation itself is not open, but all firmware and procedures for initializing the CPU are open. For people interested in that sort of thing, it's appealing as a practical computer with a full PCIe implementation with actually decent performance, compared to essentially every other open source platform.
In a sense they are, because with those core counts, you're comparing to Epyc and Xeon, which are similarly very expensive.
What they're really missing is a midrange product for a midrange price. I can't blame them for avoiding the low end, but can't I get anything for less than $2000?
While you're right, that's certainly not by choice but rather stems from the fact that right now, workstation-class OpenPOWER boards are a rather small market.
You have to design the board for this server-class chip and break even on the costs for that + manufacturing a board that can actually hold these kind of chips.
So while it's unfortunate, it's not a case of ignoring the low end deliberately but mostly flows from the economic realities of not having anywhere near the addressable market of x86 or ARM.
The small community of ppc64(le) enthusiasts is very much hoping for a future where this changes, however small that chance might be...
Catch 22: Either the hardware could still be used in some system, so used stuff is expensive because some companies pay through the nose for spares, or the hardware is way too old for that, in which case it's an expensive collector's item. If it's very old and common it goes in the crusher.
This seems to be universally true for all kinds of UNIX workstations and servers.
While that may be true from a cost/performance point of view, the point of the Talos is to keep the system as transparent as possible (schematics, open source firmware etc.).
If that is not a concern for you it doesn't matter I guess.
But for some people it is, and the Talos is the most attractive board out there from a performance point of view if that kind of transparency is a thing for you.
Precisely. Since there's no equivalent of the Power ISA in "x86 land", it's hard to make a direct comparison (I don't believe that formally Intel or AMD consider themselves to share an architecture, and they both have slightly different instruction sets), but the closest comparison would be if AMD or Intel released the source code for PSP or ME respectively, along with all other ancillary firmware and documentation for the bring-up procedures so that, without an NDA or business agreement, a third party could design a motherboard around a Threadripper or Xeon CPU, provide that to a customer, and allow the customer to make modifications to the firmware running on that motherboard.
From my point of view it is interesting because it provides an alternative to x86 and ARM. The more competition, the better. Competition will push the hardware forward. Imagine we only had only one CPU architecture and one CPU maker.
We have far less CPU architectures than we used to have.
The first thing to know is there's a community of people using POWER9 workstations, whether for actual "work" or just as personal computers for, well, personal use. In both cases, web browsers are very important.
These devices support basically all the DRM GPU drivers in Linux, and when coupled with Mesa, have very fast and responsive GUIs. Both Firefox and Chromium run well, but up until now, Firefox has been using a pure interpreter to run javascript. This is fine for sites not using much javascript, but a big chunk of the modern web quite literally loads megabytes worth of JS on a page, and it can really chug under the interpreter.
So it's pretty exciting that we'll have a second browser with a proper JS engine, Chromium being the first (IBM ported V8 to ppc64le, for node.js, but the port works for running chromium as well)
Also because it may not be clear, the other big arches all already have JIT compilers. x86, amd64, 32 and 64 bit ARM, (and I think MIPS does, on either chromium or firefox, don't recall), so this is less about boosting performance, and more about reaching "baseline expected performance".
MIPS has a JIT on both, though it's mipsle, not "classic" BE MIPS like sgimips. It's more targeted to CPUs like Loongson.
Anyway, I'm trying to go as fast as I can to get an actual browser mounted. But passing the test suites in totality, run two different ways, suggests a high probability of success at this point.
Greatly looking forward to it. Thanks for all your work on Firefox and PowerPC at-large. Being able to use Firefox for the JS heavy sites I've had to use Chromium for will be very, very nice. :-)
(A) at the very least my every attempt to buy a POWER9 system has been thwarted, mainly by the manufacturer themselves constraining availability or otherwise being unable to supply, and
(B) POWER10 has IP issues that have made it unattractive to Raptor CS, and
(C) they have in the past made noises about it not being worth the effort to continue to sell POWER9 systems to the public because of the support overhead
...I have to ask if this is effort well-spent, or if the sweat would be better poured into something with a more certain future.
Not that RISC-V boards currently are anywhere near fast enough for daily driver use, but I'm leaning in that direction.
Author here. For me it's well spent, since this is my daily desktop driver. I can't comment on (C) or (A), though I agree (B) is a problem. But while RISC-V has a "future" (or at least a more distributed one) I see no system currently or in the near future that's anywhere in the same performance ballpark. The architecture has a lot of potential but it feels to me like it remains unrealized. OpenPOWER exists today in competitive specifications and notwithstanding supply chain issues, you can get one (I have three).
Between the various keywords in the title and the domain, it took me about a minute and a resigned click to figure out whether this was about browsers or rockets
POWER9 is a family of processors. This is an announcement that Firefox's coding engine (for JavaScript) can run this code fast on those processors. It's somewhat interesting because OPENPOWER is a competitive alternative to Intel/AMD's processor architecture, and also ARM (used mostly in mobile.)