> most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory
Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.
I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.
> everything except scientific computing or insanely optimized code
for insanely unoptimized code, such as accidentally ending up writing something compute intensive in pure python, its very plausible for it to be compute constrained -- but less because of the hardware and more because 99 %or 99.9% of the operations you're asking the cpu to perform are effectively waste.
I'm skeptical of this. I would expect many of those wasted instructions to be expensive memory instructions. I'm pretty sure regular Python code allocates a lot and the interpreter spends lots of time chasing after memory references.
Certainly, but many "wasteful" operations are reading/writing main memory unnecessarily - and since a single trip out to memory and back can take many many CPU cycles, typically optimizing memory access time does more for "bad code" than optimizing the number of CPU cycles per second. But obviously, you're right too - faster is faster after all :)
The team that designed the original Arm CPU in 1985 came to the conclusion that bandwidth was the most important factor influencing performance - they even approached Intel for a 286 with more memory bandwidth!
In the 90’s there were people trying to solve this problem by putting a small CPU on chip with the memory and running some operations there. I routinely wonder why memory hasn’t gotten smarter over time.
Distributing work to leverage faster memory locality is hard. It's not quite what you're talking about, but consider the Cell processor used in the PS3 - the compute capability in HTPC space was supposed to be prodigious, but even having faster (streaming) access to RAM had the tradeoff of dealing with code dispatch. (It's not a perfect example, the SPE model was also just kind of a pain, but you have to think about how to get your code local to the memory you want, how to keep allocation nonrivalrous, etc. - it's a lot!)
Heh, funny you mention this, because I was _so excited_ about the CELL architecture when it was announced - exactly because of memory read/write speed. Then I tried to actually write some code for it, got depressed, and moved back to x86 for the next decade :(
I suspect some future improvement on borrow checkers will facilitate doing this to a degree. But it's likely to be one of those things that only comes into being when someone needs it very badly.
For some problems, there is a choice of how to organize its data structures. One that requires random access, and another that is mostly sequential access.
The latter might take an order of magnitude more space, while still being faster.
An example of such a problem is the Cuckoo Cycle Proof-of-Work [1].
Certain instruction sequences prohibit you from fully utilizing memory bandwidth and decrease it by as much as 40-60% so this statement is not true. For example, not using store and load buffers in ways they were designed to be used will lead to subpar performance for no apparent reason
Can you elaborate? I have a side project(1) where all profilers I've used give a very muddled picture, so I'm very interested in the question of what slows down code on a modern "big" CPU with wide dispatch, a few kB of decoded ops buffer and a lot of OoO hardware.
(1) It encodes and decodes a protocol from a potentially untrusted source, so there is obviously a lot of waiting for previous results. That much is clear, however I expected profilers to show me some causal link between serial nature and slow execution, but they don't. I have tried perf, Valgrind-Callgrind and AMD μProf (because I have a Ryzen CPU on both of my main private computers). I'm not sure if the tools suck, my test cases suck, or I just don't know how to interpret the tools' results - assignment of cost to lines of code seeming highly unreliable is my main problem. Maybe the stupid things (most of optimization is about not doing stupid things, after that it gets properly hard) that these profilers are designed to catch aren't the stupid or unavoidable things my code is doing.
KDAB's hotspot is quite nice for analyzing perf recordings, and I suggest looking at stall cycle and "cycles with less than X uops dispatched" events to sample on.
Yes, attributing to lines in code is hard for optimized compiler output, but it can (in the continuous release/`hotspot-git` AUR package) attribute to the disassembly of a function.
Also using Intel tools gives you much clearer answers. There are a bunch of scripts on top of perf that try to automate this. You can find them in pmu-tools git repo.
The basic workflow is always the same. Find out which part of the pipeline gets bottlenecked. Is it stuck in PCIE IO, memory, instruction decoding, etc. in my experience most of the time it is memory due to heavy pointer dereferencing and it is just a sad state of programming languages these days.
There are a bunch of memory benchmark tests that demonstrate these effects very well in the NUMA tools package source repository.
The problem with memory being the slow part is that it affects instruction fetch/decode cycle as well.
Poor instruction selection and sequencing can impact throughout because of port saturation (e.g. in case of AVX2) and leave other dispatch ports idle, premature store buffer flushes where you send 1-entry updates instead of sending stuff in chunks, leaving 70% of store buffer bandwidth idle, etc.
There are a lot of foot guns in a modern processor and usually it’s a mixture of these problems with one being dominant. A lot of times it is not possible to completely address the problem without rebuilding the software from scratch and properly leveraging hardware knowledge from the beginning when architecting your code.
I find this claim hard to believe honestly,could you point to examples where performance is limited by Dram speed and not by cpu / caches? They must be applications with extremely bad design causing super low cache hits.
> They must be applications with extremely bad design causing super low cache hits
Yep, this is exactly the case - also, on systems that are busy and context-switching often and thus flushing their cpu caches more frequently. Combine the two, busy systems running loads of un-optimized code, and boom, you have described how most computers run in the real world. This is why "synthetic" benchmarks, which are well designed code running on quiet machines more or less match up to CPU Frequency exclusively.
The AMD Opteron 240 1.4GHz keeps up with chips close to 2x it's frequency - and the memory access times are close to 1/2 as costly (ie: almost all the performance gain from 2x frequency is made up by 1/2 memory access time) - this makes sense, but remember these are well optimized applications (POV-Ray and Lightwave were extremely synthetic). In the real world, opening 10 misc windows applications from 2003, the K8 (particularly when overclocked) was a _beast_.
Well, Opteron is an ancient processor, I don't think we can make any conclusion based on that. Today's server processors has enormous caches compared to Opteron.
Honestly in cases you mention, badly designed processes killing the Cpu, I fail to see how faster ram makes a huge difference.
I mean - the people who designed Gravaton 3 seem to agree with my premise, so at least that’s some validation. Alternatively, do some CPU profiling with your workstation - a massive amount of time is simple waiting for memory returns.
This is why you dedicate entire machines to the same kind of load. All application code or all database. There was a moment where people tried to integrate—which was indeed faster but only for very limited use cases.
In data compression, inverting a BWT with large blocks or using Context Mixing to compress large blocks (which requires huge context maps). These 2 cases require a lot of random memory accesses.
I thought I’d heard that Java VMs go to great lengths to maintain cache coherence? I’d be curious to hear from the Lisp folks because I always hear that Lisps can be surprisingly performant.
Even then, today's cpus have enormous caches, and not all parts of program is pointers. you cant make a crappy application much faster just because you have faster ram.
Well, I disagree with pretty much everything in the claims.
First, most real unoptimised code faces many issues before memory bandwidth. During my PhD, the optimisation guys doing spiral.net sat nextdoor and they produced beautiful plots of what limits performance for a bunch of tasks and how each optimisation they do removes an upper bound line until last they get to some bandwidth limitation. Real code will likely have false IPC dependencies, memory latency problems due to pointer chaising or branch mispredictions well before memory bandwidth.
Then the database workload is something I would consider insanely optimized. Most engines are in fierce performance competition. And normally they hit the memory bandwidth in the end. This probably answers why the author is not comparing to EPYC instances that have the memory bandwidth to compete with Graviton.
Then the claims that they choose not to implement SMT or to use DDR5 are both coming from their upstream providers.
Wouldn't SMT be a feature that you are free to use when designing your own cores? I'm assuming Amazon has an architectural license (Annapurna acquisition probably had them, this team is likely the Graviton design team at AWS). So who is the upstream provider? ARM?
And if they designed the CPU wouldn't they decide which memory controller is appropriate? Seems like AWS should get as much credit for their CPUs as Apple gets for theirs.
Bottom line for Graviton is that a lot of AWS customers rely on open source software that already works well on ARM. And the AWS customer themselves often write their code in a language that will work just as well on ARM. So AWS can offer it's customers tremendous value with minimal transition pain. But sure, if you have a CPU-bound workload, it'll do better on EPYC or Xeon than Graviton.
> I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).
ARM is just a design. AWS brought it to market. ARM-based server processors are still rare on the ground. IIRC Equinix Metal and Oracle Cloud offer them (Ampere chips) but not GCP or Azure.
We've tested Graviton2 for data warehouse workloads and the price/performance was about 25% cheaper and 25% faster than comparable Intel-based VMs. Still crunching the numbers but that's the approximate shape of the results.
Yeah, the tone of these talks is kind of weird. They talk about how "we decided to do foo" when the reality is "we updated to the latest tech from our upstream providers which got us foo".
It is an important distinction between designing an SOC with ARM provided cores and designing the cores from scratch. People on this thread are comparing AWS’s achievement to the M1, but that’s just in a totally different ballpark. Obviously still hard to design custom silicon and custom servers around it, but it’s fair to say that’s a for cry from “optimizing the cores for the workloads that run on EC2” as has been suggested in this thread.
The rest of the server has to be designed too, since they can’t just buy from Dell or some other OEM and put Graviton into it. At their scale this means management software and hardware too, which is a right old pain in the butt to design and deploy.
They do not "just buy from x". They have their own motherboard design and they also got Amazon flavored Intel CPUs. It was just question of time to start to produce their own CPU. A vertically integrated stack pays off in the long run.
A recurring theme is "build a processor that performs well on real workloads".
It occurs to me that AWS might have far more insight into "real workloads" than any CPU designer out there. Do they track things like L1 cache misses across all of EC2?
Reality varies. Its a truism in optimization that the only valid benchmark is the task you are trying to accomplish. These chips have been optimized for an average of the tasks run on AWS (which is entirely sensible for them), but that doesn't mean they'll be the best for your specific job.
They'll definitely have information that traditional CPU designers won't. Check out this talk from Brendan Gregg (he's probably lurking), where he specifically calls this out:
There is a strong internal mandate for internal services to switch over to gravitron. So they either likely have this data, or are just trying to free up more x64 cores for external customers.
They may not run those particular services on the same hosts but they heavily use Lambda (and docker) which can share hosts and be tossed around the data centers to saturate cores.
I think the flaw in Linus’ argument is that this happened in the 90s-2010s for x86. A foundational time, especially for his worldview, but I don’t know that the pattern repeats (some of his viewpoint is colored by his time at Transmeta).
The development world today looks very different. Back then, language support for other architectures was more bespoke and CPU vendors had to add support for their chips. Today, there are plenty of very rich, platform-agnostic (both CPU and OS) libraries. Additionally, mobile development has sufficiently matured ARM development that I don’t think that argument holds. If it did, then developers wouldn’t be able to develop on their x86 MacBooks and deploy to their mobile Apple devices (yes it’s ARM now but it hasn’t been for the majority). I think the plain x86-box -> server story is pretty solid for but the cloud has changed that. Everyone is now starting out in the cloud with CPU-agnostic languages where switching architectures usually is as simple as changing 1 line in a config. In some cases it matters but the vast majority of SW dev shops don’t feel this like you used to in the 90s and 00s. Plus M1s now provide developers with local ARM development.
Linus's reasoning is sound, but the issue is that ARM development platforms are becoming a thing and to be honest I see x86 as being in the early stages of a death spiral and so does Intel the way they're focusing on the fabrication side of their business.
If anything programmers are adopting ARM based computers faster than the rest of the market. As pretty much every developer tool gets ported for Apple silicon every company is going to shrug and go "May as well release an ARM Windows/ARM Linux build as well".
I totally agree with everything you said except that devs are switching faster. I think first to switch was low end chrome books and surface go type devices. M1 is pulling the devs and professionals in, and gaming will be the last holdout (due to optimized IP that may be abandonware and never updated).
The good thing I see at work is that we all make everything work for x86 and arm. So we can deploy on any kind of cloud platform cpu and not worry about that anymore.
We've been migrating our production to Graviton2 (now Graviton3). Our developers run x86 Macs. Everything runs on the JVM, Python, Node, Go, so nobody feels like there's a difference. The ARM transition has been transparent for us.
Linus' reasoning makes sense, but the real world disagrees with him (at least in our case).
Linus's argument is that devs will use the same processors in production that they develop on. But everyone already has to develop for ARM because mobile runs ARM. And now the M1 Macs do too (and these AWS servers). So if you're forced to use ARM because of mobile and now there are good options for desktop and servers to use ARM as well, I don't see why people wouldn't switch to them. Basically Linus's own logic seems to contradict his claim.
That is, as far as the reasoning applies, why I consider the M1 Macs so pivotal. The MB Pro already was a very popular machine for developers. Now it not only got much faster and better, it also offers access to great ARM development machines. Be it for the largest market, smartphones, or for the cloud solutions based on ARM machines as the Graviton.
1. Fewer and fewer people run their stack on the laptop. There is tooling today to run even unit tests remotely pretty painlessly like bazel (and the like) and docker
2. With languages like java, go, python, node it doesn’t even matter
Agree with 1. I'm part of 3. Regarding 2, it does matter for anything that has bits of optimized c code that was only done for x86. I have a lot of Node and Python things that don't run on my M1 natively (they even crash on qemu x86 vms whatever the kind of cpu features I emulate).
Right which highlights why C is an outdated language that didn’t fully live up to its original promise. Linus’s entire argument seems to hinge on the premise that most userland devs still use it as their primary language. Eventually those python and node bugs will get fixed and people will largely not care
Don't forget Ampere's A1 i found them really, really impressive for SAT solving and that you can get them at 1ct/core/hour at Orcale makes them really financially attractive.
5 or 6 years ago Marc Andreesen was saying this would happen eventually. I was skeptical when I first heard the claim, but it's seeming more and more likely.
Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.
I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.