Graviton2 and Graviton3

erulabs · on Dec 4, 2021

> most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory

Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.

I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.

shoo · on Dec 4, 2021

> everything except scientific computing or insanely optimized code

for insanely unoptimized code, such as accidentally ending up writing something compute intensive in pure python, its very plausible for it to be compute constrained -- but less because of the hardware and more because 99 %or 99.9% of the operations you're asking the cpu to perform are effectively waste.

throwaway894345 · on Dec 4, 2021

I'm skeptical of this. I would expect many of those wasted instructions to be expensive memory instructions. I'm pretty sure regular Python code allocates a lot and the interpreter spends lots of time chasing after memory references.

erulabs · on Dec 4, 2021

Certainly, but many "wasteful" operations are reading/writing main memory unnecessarily - and since a single trip out to memory and back can take many many CPU cycles, typically optimizing memory access time does more for "bad code" than optimizing the number of CPU cycles per second. But obviously, you're right too - faster is faster after all :)

klelatti · on Dec 4, 2021

The team that designed the original Arm CPU in 1985 came to the conclusion that bandwidth was the most important factor influencing performance - they even approached Intel for a 286 with more memory bandwidth!

hinkley · on Dec 4, 2021

In the 90’s there were people trying to solve this problem by putting a small CPU on chip with the memory and running some operations there. I routinely wonder why memory hasn’t gotten smarter over time.

eropple · on Dec 4, 2021

Distributing work to leverage faster memory locality is hard. It's not quite what you're talking about, but consider the Cell processor used in the PS3 - the compute capability in HTPC space was supposed to be prodigious, but even having faster (streaming) access to RAM had the tradeoff of dealing with code dispatch. (It's not a perfect example, the SPE model was also just kind of a pain, but you have to think about how to get your code local to the memory you want, how to keep allocation nonrivalrous, etc. - it's a lot!)

erulabs · on Dec 4, 2021

Heh, funny you mention this, because I was _so excited_ about the CELL architecture when it was announced - exactly because of memory read/write speed. Then I tried to actually write some code for it, got depressed, and moved back to x86 for the next decade :(

hinkley · on Dec 5, 2021

I suspect some future improvement on borrow checkers will facilitate doing this to a degree. But it's likely to be one of those things that only comes into being when someone needs it very badly.

y4mi · on Dec 5, 2021

DDR5 has basically minor cpus on the RAM for improved performance. That's the main reason why they're so expensive and sometimes need active cooling

Terry_Roll · on Dec 5, 2021

So like a (micro)SD card that has some arm cpu for wear levelling and other performance enhancements.

It sounds interesting. https://www.bunniestudios.com/blog/?p=3554

GordonS · on Dec 5, 2021

Interesting - do you know what kind of performance gains DDR5 will provide for real-life workloads?

Having active cooling, making the system noisier and potentially hotter, seems like a pretty big down side.

vlangber · on Dec 5, 2021

Anandtech compared DDR4 and DDR5 performance in their Alder Lake review:

https://www.anandtech.com/show/17047/the-intel-12th-gen-core...

spenczar5 · on Dec 5, 2021

Man that sounds so hard to program. I wonder if that's the reason.

baybal2 · on Dec 5, 2021

Memory bandwidth is not a problem. Even puny 1 channel memory desktops don't usually saturate it.

It's memory latency.

tromp · on Dec 5, 2021

For some problems, there is a choice of how to organize its data structures. One that requires random access, and another that is mostly sequential access.

The latter might take an order of magnitude more space, while still being faster.

An example of such a problem is the Cuckoo Cycle Proof-of-Work [1].

[1] https://github.com/tromp/cuckoo

mrcode007 · on Dec 5, 2021

Certain instruction sequences prohibit you from fully utilizing memory bandwidth and decrease it by as much as 40-60% so this statement is not true. For example, not using store and load buffers in ways they were designed to be used will lead to subpar performance for no apparent reason

ahartmetz · on Dec 5, 2021

Can you elaborate? I have a side project(1) where all profilers I've used give a very muddled picture, so I'm very interested in the question of what slows down code on a modern "big" CPU with wide dispatch, a few kB of decoded ops buffer and a lot of OoO hardware.

(1) It encodes and decodes a protocol from a potentially untrusted source, so there is obviously a lot of waiting for previous results. That much is clear, however I expected profilers to show me some causal link between serial nature and slow execution, but they don't. I have tried perf, Valgrind-Callgrind and AMD μProf (because I have a Ryzen CPU on both of my main private computers). I'm not sure if the tools suck, my test cases suck, or I just don't know how to interpret the tools' results - assignment of cost to lines of code seeming highly unreliable is my main problem. Maybe the stupid things (most of optimization is about not doing stupid things, after that it gets properly hard) that these profilers are designed to catch aren't the stupid or unavoidable things my code is doing.

namibj · on Dec 5, 2021

KDAB's hotspot is quite nice for analyzing perf recordings, and I suggest looking at stall cycle and "cycles with less than X uops dispatched" events to sample on. Yes, attributing to lines in code is hard for optimized compiler output, but it can (in the continuous release/`hotspot-git` AUR package) attribute to the disassembly of a function.

mrcode007 · on Dec 8, 2021

One way of approaching the problem is with the top down cycle accounting methodology

You can find a nice spreadsheet from Intel here:

https://download.01.org/perfmon/TMA_Metrics.xlsx

Also using Intel tools gives you much clearer answers. There are a bunch of scripts on top of perf that try to automate this. You can find them in pmu-tools git repo.

The basic workflow is always the same. Find out which part of the pipeline gets bottlenecked. Is it stuck in PCIE IO, memory, instruction decoding, etc. in my experience most of the time it is memory due to heavy pointer dereferencing and it is just a sad state of programming languages these days.

There are a bunch of memory benchmark tests that demonstrate these effects very well in the NUMA tools package source repository.

The problem with memory being the slow part is that it affects instruction fetch/decode cycle as well.

Poor instruction selection and sequencing can impact throughout because of port saturation (e.g. in case of AVX2) and leave other dispatch ports idle, premature store buffer flushes where you send 1-entry updates instead of sending stuff in chunks, leaving 70% of store buffer bandwidth idle, etc.

There are a lot of foot guns in a modern processor and usually it’s a mixture of these problems with one being dominant. A lot of times it is not possible to completely address the problem without rebuilding the software from scratch and properly leveraging hardware knowledge from the beginning when architecting your code.

adgjlsfhk1 · on Dec 5, 2021

For well optimized code it is.

mda · on Dec 5, 2021

I find this claim hard to believe honestly,could you point to examples where performance is limited by Dram speed and not by cpu / caches? They must be applications with extremely bad design causing super low cache hits.

erulabs · on Dec 5, 2021

> They must be applications with extremely bad design causing super low cache hits

Yep, this is exactly the case - also, on systems that are busy and context-switching often and thus flushing their cpu caches more frequently. Combine the two, busy systems running loads of un-optimized code, and boom, you have described how most computers run in the real world. This is why "synthetic" benchmarks, which are well designed code running on quiet machines more or less match up to CPU Frequency exclusively.

I don't really have any good charts to show you, but you might checkout an old review of the processor I mentioned as having one of the first on-die memory controllers: https://techreport.com/review/5655/amds-opteron-146-processo...

The AMD Opteron 240 1.4GHz keeps up with chips close to 2x it's frequency - and the memory access times are close to 1/2 as costly (ie: almost all the performance gain from 2x frequency is made up by 1/2 memory access time) - this makes sense, but remember these are well optimized applications (POV-Ray and Lightwave were extremely synthetic). In the real world, opening 10 misc windows applications from 2003, the K8 (particularly when overclocked) was a _beast_.

mda · on Dec 5, 2021

Well, Opteron is an ancient processor, I don't think we can make any conclusion based on that. Today's server processors has enormous caches compared to Opteron.

Honestly in cases you mention, badly designed processes killing the Cpu, I fail to see how faster ram makes a huge difference.

erulabs · on Dec 5, 2021

I mean - the people who designed Gravaton 3 seem to agree with my premise, so at least that’s some validation. Alternatively, do some CPU profiling with your workstation - a massive amount of time is simple waiting for memory returns.

sroussey · on Dec 5, 2021

This is why you dedicate entire machines to the same kind of load. All application code or all database. There was a moment where people tried to integrate—which was indeed faster but only for very limited use cases.

hexxagone · on Dec 5, 2021

In data compression, inverting a BWT with large blocks or using Context Mixing to compress large blocks (which requires huge context maps). These 2 cases require a lot of random memory accesses.

therealcamino · on Dec 5, 2021

Anything where the working set is larger than L3 cache.

staticassertion · on Dec 5, 2021

> They must be applications with extremely bad design causing super low cache hits.

So basically any program written in a language with pointer types exclusively.

throwaway894345 · on Dec 5, 2021

I thought I’d heard that Java VMs go to great lengths to maintain cache coherence? I’d be curious to hear from the Lisp folks because I always hear that Lisps can be surprisingly performant.

staticassertion · on Dec 5, 2021

There are some optimizations in the JVM that will improve cache coherence. Bump allocation helps, inlining and escape analysis help, etc.

In theory the GC can also rearrange memory to 'compact' it. I'm not aware of this optimization in practice.

throwaway894345 · on Dec 5, 2021

As far as I know, all mainstream Java GCs are compacting; however, I don’t have an idea about the degree this improves cache coherence.

mda · on Dec 5, 2021

Even then, today's cpus have enormous caches, and not all parts of program is pointers. you cant make a crappy application much faster just because you have faster ram.

veselin · on Dec 5, 2021

Well, I disagree with pretty much everything in the claims.

First, most real unoptimised code faces many issues before memory bandwidth. During my PhD, the optimisation guys doing spiral.net sat nextdoor and they produced beautiful plots of what limits performance for a bunch of tasks and how each optimisation they do removes an upper bound line until last they get to some bandwidth limitation. Real code will likely have false IPC dependencies, memory latency problems due to pointer chaising or branch mispredictions well before memory bandwidth.

Then the database workload is something I would consider insanely optimized. Most engines are in fierce performance competition. And normally they hit the memory bandwidth in the end. This probably answers why the author is not comparing to EPYC instances that have the memory bandwidth to compete with Graviton.

Then the claims that they choose not to implement SMT or to use DDR5 are both coming from their upstream providers.

wyldfire · on Dec 5, 2021

Wouldn't SMT be a feature that you are free to use when designing your own cores? I'm assuming Amazon has an architectural license (Annapurna acquisition probably had them, this team is likely the Graviton design team at AWS). So who is the upstream provider? ARM?

And if they designed the CPU wouldn't they decide which memory controller is appropriate? Seems like AWS should get as much credit for their CPUs as Apple gets for theirs.

Bottom line for Graviton is that a lot of AWS customers rely on open source software that already works well on ARM. And the AWS customer themselves often write their code in a language that will work just as well on ARM. So AWS can offer it's customers tremendous value with minimal transition pain. But sure, if you have a CPU-bound workload, it'll do better on EPYC or Xeon than Graviton.

wmf · on Dec 5, 2021

I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).

hodgesrm · on Dec 5, 2021

> I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).

ARM is just a design. AWS brought it to market. ARM-based server processors are still rare on the ground. IIRC Equinix Metal and Oracle Cloud offer them (Ampere chips) but not GCP or Azure.

We've tested Graviton2 for data warehouse workloads and the price/performance was about 25% cheaper and 25% faster than comparable Intel-based VMs. Still crunching the numbers but that's the approximate shape of the results.

antifa · on Dec 6, 2021

Any rumors on when GCP will get an ARM offering?

magila · on Dec 5, 2021

Yeah, the tone of these talks is kind of weird. They talk about how "we decided to do foo" when the reality is "we updated to the latest tech from our upstream providers which got us foo".

pm90 · on Dec 5, 2021

Like how Apple takes credit for packaging new technology in an easy to use product? What’s wrong with that? They’re not exactly hiding it.

aledalgrande · on Dec 5, 2021

Isn't making the CPU wider one of the things Apple also did with M1? Doesn't feel like they are the first.

wmf · on Dec 5, 2021

Apple designed the M1. AWS is (probably) using off-the-shelf Neoverse V1 cores that they did not design.

[Imagine "you made this, I made this" meme here]

vineyardmike · on Dec 5, 2021

They have a huge design team making custom silicon. They deserve a bit more credit even if they’re leaning on ARM IP.

MobiusHorizons · on Dec 5, 2021

It is an important distinction between designing an SOC with ARM provided cores and designing the cores from scratch. People on this thread are comparing AWS’s achievement to the M1, but that’s just in a totally different ballpark. Obviously still hard to design custom silicon and custom servers around it, but it’s fair to say that’s a for cry from “optimizing the cores for the workloads that run on EC2” as has been suggested in this thread.

vineyardmike · on Dec 7, 2021

Do we know that they're just using ARM cores without any custom work? Do we know this isn't similar to a "cloud M1" in terms of design work?

ashtonkem · on Dec 5, 2021

The rest of the server has to be designed too, since they can’t just buy from Dell or some other OEM and put Graviton into it. At their scale this means management software and hardware too, which is a right old pain in the butt to design and deploy.

StreamBright · on Dec 5, 2021

They do not "just buy from x". They have their own motherboard design and they also got Amazon flavored Intel CPUs. It was just question of time to start to produce their own CPU. A vertically integrated stack pays off in the long run.

wyldfire · on Dec 5, 2021

Maybe but M1 doesn't really compete in this market.

phamilton · on Dec 5, 2021

A recurring theme is "build a processor that performs well on real workloads".

It occurs to me that AWS might have far more insight into "real workloads" than any CPU designer out there. Do they track things like L1 cache misses across all of EC2?

uplifter · on Dec 5, 2021

Reality varies. Its a truism in optimization that the only valid benchmark is the task you are trying to accomplish. These chips have been optimized for an average of the tasks run on AWS (which is entirely sensible for them), but that doesn't mean they'll be the best for your specific job.

w1nk · on Dec 5, 2021

They'll definitely have information that traditional CPU designers won't. Check out this talk from Brendan Gregg (he's probably lurking), where he specifically calls this out:

https://www.brendangregg.com/blog/2021-07-05/computing-perfo...

See slide 26 (and the rest ofc :)).

virtuallynathan · on Dec 5, 2021

Hard to track for other people’s VMs, but they probably have (or can sample) that data for every AWS-operated service (dynamo, S3, redshift, etc..)

vineyardmike · on Dec 5, 2021

There is a strong internal mandate for internal services to switch over to gravitron. So they either likely have this data, or are just trying to free up more x64 cores for external customers.

sroussey · on Dec 5, 2021

Yeah, this is key. They definitely optimize for their own services. And they don’t run S3 and redshift on the same cpu/server at the same time.

vineyardmike · on Dec 5, 2021

They may not run those particular services on the same hosts but they heavily use Lambda (and docker) which can share hosts and be tossed around the data centers to saturate cores.

trhway · on Dec 5, 2021

AWS can also build slightly different CPUs under the same name for different workloads and not tell anybody.

pm90 · on Dec 5, 2021

Arm seems poised to replace x86 in servers. If I were Intel this would make me really nervous.

betaby · on Dec 5, 2021

Very unlikely. See for example Linus reasoning https://www.realworldtech.com/forum/?threadid=183440&curpost...

vlovich123 · on Dec 5, 2021

I think the flaw in Linus’ argument is that this happened in the 90s-2010s for x86. A foundational time, especially for his worldview, but I don’t know that the pattern repeats (some of his viewpoint is colored by his time at Transmeta).

The development world today looks very different. Back then, language support for other architectures was more bespoke and CPU vendors had to add support for their chips. Today, there are plenty of very rich, platform-agnostic (both CPU and OS) libraries. Additionally, mobile development has sufficiently matured ARM development that I don’t think that argument holds. If it did, then developers wouldn’t be able to develop on their x86 MacBooks and deploy to their mobile Apple devices (yes it’s ARM now but it hasn’t been for the majority). I think the plain x86-box -> server story is pretty solid for but the cloud has changed that. Everyone is now starting out in the cloud with CPU-agnostic languages where switching architectures usually is as simple as changing 1 line in a config. In some cases it matters but the vast majority of SW dev shops don’t feel this like you used to in the 90s and 00s. Plus M1s now provide developers with local ARM development.

faeriechangling · on Dec 5, 2021

Linus's reasoning is sound, but the issue is that ARM development platforms are becoming a thing and to be honest I see x86 as being in the early stages of a death spiral and so does Intel the way they're focusing on the fabrication side of their business.

If anything programmers are adopting ARM based computers faster than the rest of the market. As pretty much every developer tool gets ported for Apple silicon every company is going to shrug and go "May as well release an ARM Windows/ARM Linux build as well".

vineyardmike · on Dec 5, 2021

I totally agree with everything you said except that devs are switching faster. I think first to switch was low end chrome books and surface go type devices. M1 is pulling the devs and professionals in, and gaming will be the last holdout (due to optimized IP that may be abandonware and never updated).

ta988 · on Dec 5, 2021

The good thing I see at work is that we all make everything work for x86 and arm. So we can deploy on any kind of cloud platform cpu and not worry about that anymore.

solatic · on Dec 5, 2021

We've been migrating our production to Graviton2 (now Graviton3). Our developers run x86 Macs. Everything runs on the JVM, Python, Node, Go, so nobody feels like there's a difference. The ARM transition has been transparent for us.

Linus' reasoning makes sense, but the real world disagrees with him (at least in our case).

jeffreyrogers · on Dec 5, 2021

Linus's argument is that devs will use the same processors in production that they develop on. But everyone already has to develop for ARM because mobile runs ARM. And now the M1 Macs do too (and these AWS servers). So if you're forced to use ARM because of mobile and now there are good options for desktop and servers to use ARM as well, I don't see why people wouldn't switch to them. Basically Linus's own logic seems to contradict his claim.

_ph_ · on Dec 5, 2021

That is, as far as the reasoning applies, why I consider the M1 Macs so pivotal. The MB Pro already was a very popular machine for developers. Now it not only got much faster and better, it also offers access to great ARM development machines. Be it for the largest market, smartphones, or for the cloud solutions based on ARM machines as the Graviton.

dilyevsky · on Dec 5, 2021

1. Fewer and fewer people run their stack on the laptop. There is tooling today to run even unit tests remotely pretty painlessly like bazel (and the like) and docker

2. With languages like java, go, python, node it doesn’t even matter

3. Devs are migrating to arm en masse (M1s)

ta988 · on Dec 5, 2021

Agree with 1. I'm part of 3. Regarding 2, it does matter for anything that has bits of optimized c code that was only done for x86. I have a lot of Node and Python things that don't run on my M1 natively (they even crash on qemu x86 vms whatever the kind of cpu features I emulate).

dilyevsky · on Dec 5, 2021

Right which highlights why C is an outdated language that didn’t fully live up to its original promise. Linus’s entire argument seems to hinge on the premise that most userland devs still use it as their primary language. Eventually those python and node bugs will get fixed and people will largely not care

nichch · on Dec 5, 2021

Unlikely for now. The ball has just started rolling with the M1.

taf2 · on Dec 5, 2021

Possibly on the desktop too... I imagine we'll see many m1 like windows pc options in the near future...

ta988 · on Dec 5, 2021

They are coming. Clearly not as impressive as the latest m1 max and pro, but getting there.

freemint · on Dec 5, 2021

Don't forget Ampere's A1 i found them really, really impressive for SAT solving and that you can get them at 1ct/core/hour at Orcale makes them really financially attractive.

jeffreyrogers · on Dec 5, 2021

5 or 6 years ago Marc Andreesen was saying this would happen eventually. I was skeptical when I first heard the claim, but it's seeming more and more likely.