So little about chip success is down to just the core. Sure, some basic performance aspects, clock range, area and thus cores/chip. What about inter-core performance? Memory interface?
There's still no real synergy between CPUs and GPUs, even though they get less different with time. No one seems to have a decent plan for how to balance local memory resources with cache-coherency over CXL. Inter-system networking seems to be frozen in 1980 (just with faster serdes). Does in-chip and inter-chip optical change anything, or does it just mean littering the place with tranceivers?
yes, I was thinking of HSA, but they don't seem to be working on it. I think it turned out to be hard, and therefore expensive, and they got distracted by staying alive by developing decent processors.
I think we could argue that tensorflow and/or pytorch has displaced any HSA interest, too. These are programming interfaces that do an OK job of abstracting from the hardware details, and are totally embraced by the most demanding field (AI).
They have abandoned it at their lowest to focus on Zen. Now they seem to start picking up the slack. Upcoming Instinct MI300 APU brings HBM as unified system memory, together with hardware cache coherency between CPU cores and GPUs within and across NUMA nodes.
In the wee hours of this morning, I was trying to figure out why my attempts to build LLVM 17.0.3 from source were failing because it couldn't find an HSA-related symbol.
My impression was that the build-system's support for using HSA was a little wonky, but I'm not sure if that's fair. IIRC I worked around the issue by not trying to build LLVM's openmp code.
It's a real shame that AMD didn't embrace HSA for the long term, it's the future of computing. Intel's QuickSync is very popular with certain folks for streaming, and HSA would have taken that a step further.
My understanding is that later pre-Zen APUs and Zen chips all essentially have the HSA hardware side... it's just that AMD never managed to prepare a long-term coherent software offering like nvidia did with CUDA. So outside of one-offs or very closed platforms like consoles, the HSA capabilities were just... lying there, used as super fast PCI-e alternative to access the GPU.
Now consider that with Phoenix Point APUs you have Zen 4 CPU, memory controllers, IO controller, RDNA3 GPU, and the XDNA "transputer-like" coprocessor, all hanging off common cache coherent transport (Infinity Fabric). All the hardware parts are there.
Seems this is the approach Apple has taken with the M chips. Everything is unified more or less. Myabe one day in the future the CPU will just be one fpga that recompiles itself for a given usecase and codepaths become hardware paths and then goes back to general for the next task.
it seems like everyone settles for ugly, barely usable interfaces like CUDA, OpenCL, and then scabs them over with a higher-level interface like tensorflow.
>No one seems to have a decent plan for how to balance local memory resources with cache-coherency over CXL
I have seen a patent application for a technology from {big hardware company} addressing exactly this. Dunno if the approach will work, but there are definitely plans
We have a few programming languages today that can deal with NUMA, but we still design cpus for C.
As caches have gotten bigger I’ve wondered when or if we will treat them as local memory instead of trying to maintain an illusion of cache coherency at a high cost.
On x86 the cache is implicit when you touch memory, and there are many ISA ops (push, call, ...) that implicitly update stack memory.
But if you never call those, it is possible to access the cache explicitly. Coreboot have a custom in-tree compiler, ROMCC [1] that compiles C to a subset of x86 that never emits memory operations.
They use this to treat cache as RAM [2] for a brief period during early boot before the RAM itself gets initialized.
Does anyone know a good resource at explaining what each of the components in the cpu diagrams does? For example what are the main functions and differences between rename/dispatch, reorder buffer, scheduler?
chipandcheese does a very good job comparing architectures but sometimes I struggle to really understand the meaning of their diagrams.
Maybe it's more than you want, but "Computer Architecture - A Quantitative Approach" from Hennessy and Patterson is a very good reference.
If your main interest is CPU cores even a relatively old version will be enough. From memory recent versions (5 and 6, 6 being the latest) have mostly added content on GPU and data center computing, but the sections on CPU architecture are rather stable.
I strongly recommend you to pick up H&P's Computer Organization and Design: the Hardware/Software Interface. I think you can really only appreciate their other great book, "A Quantitative Approach" if you properly learn a new architecture, then the big ideas behind "A Quantitative Approach" will really speak to you.
Organization and Design has different editions for RISC-V, aarch64 and MIPS. The MIPS one is what I learned computer architecture on and it's super solid.
you've got to pick up the basics of modern processor design (such as from H&P). without understanding basic pipelining, all the prediction, OoO, and superscalar stuff won't make any sense. and I think you probably need to try some asm programming for any of this to really click.
Are there any viable ARM Servers available in a homelab price range? Every time I look I only see the 2000$ enterprise 2U servers. I have outgrown the Raspberry Pis but still love the idea of moving my homelab to ARM
The fastest ARM-based computers that are not more expensive than better Intel or AMD based computers, which means that a complete fully-equipped computer with DRAM and SSD should not exceed the $200 to $250 range (where computers with Intel N100 or with older Zen 3 Ryzen 5 mobile CPUs can be found), are the computers with CPUs having Cortex-A76 cores (like the smartphones from 2019).
In this class, there are many models with RK3588, having e.g. dual 2.5 Gb/s Ethernet ports (with the possibility of attaching more Ethernet NICs on USB 3 or on PCIe M.2 adapters) and supporting PCIe 3 x4 M.2 SSDs and/or eMMC (and the attachment of more SSDs on USB). (The model that I like most is NanoPC-T6, which exploits best all the interfaces of RK3588, but without adding things that should better be added externally, only when they are needed, like the additional USB hub present in many other models.)
A cheaper option, but with much slower peripheral interfaces, is the new Raspberry Pi 5 model.
Nevertheless, a homelab server with an ARM CPU makes sense only for developing Aarch64 applications.
For just doing the job there are many small and cheap fanless computers with Intel N100 (4 E-cores) and 4 to 8 2.5 Gb/s Ethernet ports (typically sold on Amazon as firewall appliances). For only a few dollars extra it is possible to find much faster cheap small computers (from companies like Minisforum or Beelink) with older AMD Zen 3 CPUs, like the 6-core Ryzen 5 5600H.
For a higher price of $500 to $600, there are small computers with AMD Ryzen 9 7940HS, which can support e.g. dual M.2 PCIe 4 x4 SSDs and SATA SSDs, and dual 10 Gb/s Ethernet NICs (on Thunderbolt), dual 2.5 Gb/s ports on the MB + many other 2.5 Gb/s ports on USB, while being faster than big and expensive servers from some years ago.
NanoPC-T6 looks great. One of my concerns is about the software that the company is providing. For example, how can I trust that the OS images of this company don't contain any spyware?
The software cannot be trusted, but it is easy to replace everything but the Linux kernel with another Linux distribution, including with one compiled from sources (like Gentoo). This should be doable just by following the generic installation instructions for Aarch64 of that distribution.
The recompilation of the Linux kernel may be more difficult, because the right configuration file and modules must be selected before doing it, but it should be possible as most support for RK3588 is included in the mainstream kernel. Also U-Boot (the boot loader that loads the Linux kernel) should be recompilable from sources.
The hardware is more trustworthy than that of Intel or AMD computers, because it comes with the complete schematics and the technical reference for RK3588 is much more complete than for any Intel or AMD CPU.
A hardware backdoor could have been implemented only in the Ethernet interface of RK3588, but that is not used in NanoPC-T6, which uses Realtek Ethernet interfaces on PCIe lanes. Any hardware backdoor in those would have required a close and secret cooperation between a major Taiwanese company and a major mainland Chinese company, which is unbelievable.
I'm a noob at this, but at a first glance it looks difficult to get all the peripherals and hardware acceleration working when compiling from sources. Anyway, I am really keen on getting one.
The hardware acceleration that is problematic is that of the Arm Mali GPU and of the video codecs.
This is normally provided in all Arm SoCs by binary blobs. Nevertheless, at least for the Mali GPU there is a reverse-engineered driver in the Linux kernel, which might be usable with RK3588.
In any case, this hardware acceleration is the same in all RK3588 boards, regardless of the vendor, it is not specific to NanoPC-T6.
> how can I trust that the OS images of this company don't contain any spyware?
Software from board manufacturers shouldn't be used anyway, if not because in a few years it is often discontinued and not updated anymore because they're pushing newer models. Thankfully we have Armbian and Dietpi which are the distros of choice for all boards that don't run major PC oriented distros (and a nice alternative for those that do).
The number of boards supported by these two distros is astonishing:
Intel had in development a "rugged NUC" with the Atom-branded equivalent of the Intel N100 CPU, which supported ECC and which was expected to have a low price.
Unfortunately, I assume that this product has been canceled a few months ago, when Intel sold their NUC business to ASUS.
There are a few small computers that support ECC and which use obsolete Intel Tiger Lake or Tremont-core-based Intel Elkhart Lake CPUs, but those CPUs are a dead end, being slow and supporting instruction sets that are different from the current mainline Intel CPUs, so I would not recommend any of them.
The best remaining choice depends on which is more important, the size and the power consumption or the price of the server.
For very small size and low power consumption I am not aware of any good solution at a reasonable price, because even when some of the Arm CPU SoCs or Intel or AMD mobile CPUs support ECC, I have not seen any such computer board that includes the ECC support. There are some industrial computers with ECC, but those are expensive for what they offer.
If only the cost is the problem, and second-hand servers are avoided because the server to be bought is intended to be used for many years, then a server with desktop Intel or AMD CPUs must be used. The MBs with the Intel W480 chipset are expensive, so the cheapest solution is to use one of the AM5 MBs that specify ECC memory support, e.g. from ASUS or ASRock Rack, together with one of the cheaper Ryzen 7000.
Another option is an older AM4 MB, like the Mini-ITX ASRock Rack X570D4I-2T ($400 due to including dual 10 Gb/s Ethernet ports), which has the advantage of using cheaper older Ryzen 5000 CPUs, with cheaper DDR4 ECC memory, so the total system cost would be reasonable.
The only disadvantage of the desktop Ryzen CPUs when used as servers is that, even if they have excellent energy efficiency when they are actually running programs, they have a relatively high idle power consumption, because only the cores are shut down when doing nothing, while the I/O die has a permanent consumption around 20 W or more. Therefore one must choose between the low idle power consumption of a few watts of the laptop CPUs and the ECC memory support of the desktop CPUs.
Because in my home lab most servers alternate between times when they are used intensively with times when they stay idle for hours or days, except for one server that is connected permanently to the Internet, all the others are used with Wake-on-LAN, so they are shut down when idle, for negligible power consumption.
Rk3588-based are gaining traction. Upstream support for uboot and linux has been improving rapidly.
I bought orange pi 5+ to replace my x86-based odroid-h2+. Will start the installation in the next few weeks. I have seen enough anecdotes people running NixOS on this chip, so feel pretty optimistic about it.
Why? Cost, performance, and power consumption should be in the same ballpark as x86, but support for x86 is still significantly better. There's nothing magical about running a workload on ARM. The math obviously changes when you have thousands of machines.
How solid does it run? Is it able to make use of all what is currently making the Mac Studio so interesting for LLMs? And does it handle all the power management as it should? These are honest questions, as I'm in the need of a new workstation and eying a Mac Studio even if I have never used any Apple products before (with the exception for the iPods).
You can also just run macOS and use Tart, Docker, or UDM and run VMs/containers for your services. Asahi isn't bad either, but for some people it's a bit more hassle to set it up and maintain versus macOS.
That is because performance isn't the only metric. Things like cost make a big difference as they would need to make it decently priced for silicon board partners to buy. Just the amount of L2 cache in the M1 is 3x-6x as compared to the X3 core which adds cost to the chip.
The M1/M2 is an amazing chip, however you need to consider that these chips are in different price brackets hence the performance discrepancy.
Yes it is for performance. But it is performance at a reasonable cost.
The new Qualcomm Oryon chips are more powerful but also much more expensive. So the X lineup is actually quite reasonable if you want smartphones to be "affordable".
Oh, these are cellphone chips. I guess we’ll have to wait for the next Neoverse before we get interesting comparisons to Apple’s M chips and Intel/AMD’s mainline ones.
If you're referring to the Tensor SoCs, no they're not using custom ARM cores. They're using the reference Cortex codes from ARM because Tensor is pretty similar to the Exynos.
Yes, but that says you can’t phone it in. Google’s senior managers are not sweating Pixel performance unless it will lead to a decline in ad revenues - contrast with AI where they very quickly realized that people asking ChatGPT questions would mean zero ad sales.
Apple mainly does two things: Much larger caches ($$$ but they have the margins) and memory inside the CPU package (shorter and faster connections, but can't upgrade memory).
[Edit] + Buying up all state of the art production capacities so competition is one node behind.
There is no Apple secret sauce.
As long as the others don't want to go that route - and they seem not to be in need to cut into their margins (AMD shows how X3D helps with performance).
I think what is interesting especially for Intel/AMD is that Xiaomi drops legacy 32 bit ARM and translates apps to 64bit.
Dropping 16/32bits can reduce die size which can be used for larger caches for the same price.
> Buying up all state of the art production capacities so competition is one node behind.
This is such a funny statement. Do people think Apple is dumping wafers into the ocean? Or buying the capacity and not using it?
The economic reality is that Apple can pay more for cutting edge process because they have higher prices and margins. So, people paying a premium for hardware get more advanced hardware.
How is this in any way surprising? Is the theory that if only Apple wasn’t willing to pay a premium, TSMC would sell the same wafers cheaper to other manufacturers? Wouldn’t that make TSMC 1) dumb, and 2) less profitable and therefore less able to invest in the next process?
The memory is not inside the package any more than on any other flip chip or pop soc, ie every mobile ap soc made in the past 5 years. Please stop propagating this myth.
One of Apple's actual secret sauces is they can make their big caches fast. Typically latency increases with cache size so it's a tradeoff. Apple trades off less here. And it's not some "only fast because tsmc" it's just really solid engineering at both the architectural and physical design level.
It's a board space and cost saving measure but it does not change performance. The tooling is also expensive and Intel have their own internal mature packaging processes.
The drams on an apple chip are still bog standard lpddr. Most benchmarks find the actual memory middle of the road at best.
Critically they aren't magically on the die or any more inside the package than most other high end mobile chips.
ram is ordinary POP, you got lied to by Apple marketing. If you acted on this marketing and spend money then re-programming will be very difficult with brain actively fighting on every step to prevent cognitive dissonance.
Anyway the point is, this is not a meaningful performance benefit as it's still just off the shelf LPDDR5. In fact the M SoCs tend to underperform in memory latency tests.
> + Buying up all state of the art production capacities so competition is one node behind.
From what I read, before that, there’s “paying billions to get state of the art production capabilities built”
Chances are that capacity wouldn’t be there without Apple’s money, so if Apple didn’t exist, it still wouldn’t be available to others as rapidly as it is now.
> There is no Apple secret sauce.
They didn’t always have loads of money, so, historically, there must have been something else than “they have loads of money and large margins, so can afford to buy the best”.
I think there still is something more than that. For example, it also is about having the courage to decide that milled aluminum is a better way to build laptop chassises, so spending billions on buying/creating the capacity to build millions of such chassises is a good idea, or to decide that, at their size, building your own CPUs is worth doing.
I think part of their secret sauce also is that they have higher standards for what they want to sell. Take for example foldable screens. They must have prototypes with them, but don’t have a product because they don’t deem them good enough.
Having fairly high standards is, I think, a consistent perk of theirs. In particular, they don’t seem to let anything slip below a sort of entry-level enthusiast quality; might not be the best at anything in particular but there’s nothing the Apple device will be truly awful at.
But the Apple that stayed alive in the 90’s-early 00’s is pretty different from modern Apple. Modern Apple makes some of the best chips out there. Old Apple stayed afloat by selling a Unix clone on commodity x86.
Except for secret sauce like super wide instruction decode and enough registers to keep all their execution units filled[0], sure I guess there's no secret sauce.
Caches are only useful when they're serving execution units and Apple packed their chips with them. That's special sauce. If it wasn't special then every ARM chip would have the same levels of performance. It's not like the M1 was Apple's first chip. The A-series have been kicking the shit out of other ARM chips for almost a decade. If Apple didn't have any special sauce in their chip designs this wouldn't have been the case. It's not like Qualcomm doesn't have good chip designers and hasn't tried to compete with Apple's chips.
Super wide instruction decode won't help you much unless you're able to feed and retire those instructions at a consistent pace. This means being able to keep you ALU busy and for that to happen, there's plenty of problems one has to solve but two major bottlenecks in CPU design are (1) branch-prediction in the CPU frontend and (2) hiding the memory latency in the CPU backend. Both of those are tightly coupled to the instruction- and data-cache design.
Coincidentally, both of those caches in Apple M design are unusually large - 192KB for instruction cache size and 128KB L1 data cache size - per core (!). The same goes for L2 cache size - 3MB per (performance) core.
When compared to bleeding edge _server_ CPUs from AMD and Intel, it's crazy to see that those figures are by several _magnitudes_ larger in the Apple M design. E.g. Zen3 Epyc - 32KB of instruction cache size, 32KB of L1 data cache size and 512KB L2. Intel Xeon Gold - 32KB of instruction cache size, 64KB of L1 data cache size and 1.25MB of L2.
I don't think that's quite true. It's clearly a combination of better microarchitecture (very wide decode, 128 byte cache lines, etc), and also massively bigger area budgets. Maybe more the latter, but it's pretty clear that Apple is right at the top of the "good microarchitecture" leader board.
A device shipping in 2024 being almost as fast as one which shipped several years earlier is … crazy? We don’t even know what the final product will be like - they have a history of not delivering due to yield or thermal issues - but I hope they pull it off because better competition benefits everyone.
Edit: this is the benchmark I’m basing my comment on finding it a bit slower than the 2022 A16.
Xiaomi 14 ( using Snarpdragon Gen 3 ) is announced and reviewed. Pre-Order started and shipping in November.
>being almost as fast as one which shipped several years earlier
The A16 has been shipped for a little more than a year. I could have compared it to A17 Pro and the answer would still be the same. Since A17 Pro is only slightly faster with a Clockspeed improvement. And that is comparing in Cortex X4 vs A17 Pro Clock to Clock, X4 would be landing within 10% range.
>is … crazy?
Considering 99.99999999% of the internet said Apple will always be 3-5 years ahead and ARM ( or Snapdragon or any other non Apple CPU design ) will never be able to match Apple, despite being shown how X3 is the proof and X4 finally shown it. Yes. I find it pretty crazy.
As noted, I’m relying on early leaked benchmarks showing near-A16 level performance - we’ll see how those devices perform when they ship and people can run tests on production hardware. I’m perfectly willing to change my “in 2024” to “in November 2023 for China, 2024 everywhere else” - I don’t have an emotional stake in this.
> Considering 99.99999999% of the internet said Apple will always be 3-5 years ahead and ARM ( or Snapdragon or any other non Apple CPU design ) will never be able to match Apple
Serious citation needed on that. Most people said Apple did a good job and that Qualcomm needed to step it up considerably to stay in the game. If they have, that’ll be great for many millions of buyers so competition will have worked exactly the way we wanted it to.
The switch from A to X is really just marketing. They call them X-series cores but they're just larger and more powerful application processors. It's 100% intended to run full operating systems, it implements the ARMv9-A ISA :)
Ah yes already looking forward to everyone cutting support for armv7 package building on apt, just like they did for v6 when v8 was 'the thing'. This rolling cycle of incompatibility and obsolescence is so goddamn infuriating.
Do note that ARMv6 and ARMv8 are entirely incompatible ISAs; one is 32-bit and another is 64-bit. Generally one doesn’t cut ARMv6 because ARMv8 exists, they drop all 32-bit support all at once.
> Do note that ARMv6 and ARMv8 are entirely incompatible ISAs; one is 32-bit and another is 64-bit.
IIRC, ARMv8 has both 32-bit (AArch32) and 64-bit (AArch64); yes, ARM's naming is confusing. What's being dropped is 32-bit (AArch32), similar to what's happening in the x86 world (and AFAIK also the Linux on mainframe world), and for similar reasons.
It was dropping 32 bit builds this gen, the next one will be maybe some secure boot nonsense like Windows required for 11, or whatever-thing compatibility. There's always some random excuse.
You joke, but the MC68000 is actually still in wide use haha. At some point we'll have 32 bit microcontrollers that are fast enough and have enough memory to run what will then be modern linux distros and we'll be sorry that we deprecated everything related to it.
Or maybe everything will just be 64 bit from now onward, idk.
The Cortex X2 has already been shipping in phones for more than a year (in the form of the big core in the Snapdragon 8 Gen 1). On Geekbench 6 it scores around 1500-1600 single core, which makes it comparable to a Intel 11th gen laptop core or a low power (U series) AMD Zen 3 core.
Even the successor Cortex X3 is already available phones (Snapdragon 8 Gen 2 or Dimensity 9200). It benches around 1800 on Geekbench, which is comparable to a Zen3+ low power U laptop core, or an lower clocked Intel 12th gen U laptop core.
Fro comparison, the latest Raspberry Pi 5 features A76 cores, which benches around 900 in phone implementations, comparable to 8th gen Intel cores. Apple's A13 scores around 1600, the M1 around 2200.
Hashed perceptron branch prediction is pretty much industry standard and has been since the 2010s (it arrived in academia around 2000 and in AMD Piledriver in 2012), it’s just up to marketing whether it’s a “neural network” or not.
Am I right in thinking that the larger load and store queues compared to a Zen core are enabled partially by more relaxed memory ordering rules for A64 compared to x86?
Upper management seeing a really straight line between performance and dollars.
Right now there are steps in between which means they care more about selling the design and its variations. If there is a PPA miss you could point the finger in a few directions
It is always the time to read again the book "Only the Paranoid Survive" [1] where Andy Grove, CEO of Intel at the book time, has a specific chapter about RISC/CISC.
That is the chapter 6: "Signal" or "Noise". The core wisdom here is that it is obviously difficult to separate signal from noise. I would say that now in 2023 is probably even more difficult because the data flow has increased while the mental/group focus has declined.
In that chapter, Andy Grove recognized the merits of both technologies, they even developed RISC chips. The preference for CISC was maintaining compatibility of the 386 line of extremely successful microprocessors recognizing that following another technology was resource intensive and against the extremely popular and fruitful 386 line of business.
One more thing, indeed the book addressed the problem we see today with Intel, ARM, and mobile devices: he call them 10X forces and inflection points. Obviously Intel hasn't recognized the concepts clearly defined in the book.
BTW, I think this is a book recommended to any entrepreneur since it addresses atemporal concepts and you return to the book anytime you gain more experience to clarify your experiences.
Do they still have it? He left around the turn of the century so it seems like a better argument would be that having an immensely profitable company puts you at risk of having caretaker CEOs who will be optimizing for Wall Street analysts rather than success.
A better target would be Itanium: that was in his era and showed the danger of picking the wrong gamble. They wanted a proprietary platform which couldn’t be copied by AMD, Cyrix, etc. but they staked the whole exercise on a highly speculative CPU design and made a number of horrible tactical errors like trying to screw a few million in compiler licensing out of developers at the time when their multi-billion bet-the-company investment crucially depended on developers porting to a chip which was critically dependent on having the best compiler. Being paranoid about preventing competition lead them to be shockingly cavalier about their plans being robust.
Intel has embraced the complacent culture for 30 years. Perf/watt is meaningless, phones won’t be a big market, nothing can displace x86 in the datacenter, nobody else can fab like we do…
There's still no real synergy between CPUs and GPUs, even though they get less different with time. No one seems to have a decent plan for how to balance local memory resources with cache-coherency over CXL. Inter-system networking seems to be frozen in 1980 (just with faster serdes). Does in-chip and inter-chip optical change anything, or does it just mean littering the place with tranceivers?