Great practical information. Nice to see people who know what they are talking about putting data out there. I hope eventually these persistent HN memes about M1 memory will die: that it's "on-die" (it's not), that it's the only CPU using LPDDR4X-4267 (it's not), or that it's faster because the memory is 2mm closer to the CPU (not that either).
It's faster because it has more microarchitectural resources. It can load and store more, and it can do with a single core what an Intel part needs all cores to accomplish.
This seems to be a recurring theme with the M1, and one that, in a sense, actually baffles me even more than the alternative. There is no "magic" at play here, it's just lots and lots of raw muscle. They just seem to have a freakishly successful strategy for choosing what aspects of the processor to throw that muscle at.
Why is that strategy simultaneously remarkably efficient and remarkably high-performance? What enabled/led them to make those choices where others haven't?
I don't have any inside-Apple perspective, but my guess is having a tight feedback cycle between the profiles of their own software and the abilities of their own hardware has helped them greatly.
The reason I think so is when I was at Google is was 7 years between when we told Intel what could be helpful, and when they shipped hardware with the feature. Also, when AMD first shipped the EPYC "Naples" it was crippled by some key uarch weaknesses that anyone could have pointed out if they had been able to simulate realistic large programs, instead of tiny and irrelevant SPEC benchmarks. If Apple is able to simulate or measure their own key workloads and get the improvements in silicon in a year or two they have a gigantic advantage over anyone else.
That's bizarre. As if CPU vendors were unable to run "realistic" workloads. If they truly aren't, that's because they are unwilling and then they are designing for failure and Apple can just eat their lunch.
As a data scientist, I feel this. Intel and AMD don't own an OS or an app store, and you might be surprised how hard it is to get good data. Data is the new gold. If a company that can corner a piece of the market, they can collect data no one else can, and from that companies are often forced to partner or they can't properly provide services that will keep them competitive.
This makes me think that any sort of data advantage Apple may have has nothing to do with them owning an OS. Intel has a massive computer network, managed by their own IT team, just like any other large corporation. Intel could collect whatever performance data they want from actual users of actual programs just as easily as Apple could.
Apple doesn’t just control an operating system or an App Store. They also control a development toolchain and the two primary languages compiled for their platforms, as well as most of the frameworks used in commonly used apps (excepting the dreaded electron). They have a platform that’s been tailored to be profiled and optimized.
One early benchmark showed allocating and destroying an NSObject performing drastically better on the M1 vs recent Intel Macs. This wasn’t an accident. It’s probably not representative of performance overall. They have enough vertical integration to make their own first party solutions clear optimization targets.
This sort of "data," that optimizing contention free locks could have big rewards, isn't something that you need to control the OS or compiler/profiler/debugger toolchain to understand and learn. And for that matter, Intel has excellent compilers and profilers too.
All it takes is looking at what's going on in commonly used code, deciding to optimize for X, Y, and Z, and commit to it. If Intel isn't doing this already, that's all the fault of current management for not making it a priority.
The only way that Apple's vertical integration helped them make that management decision is that they were able to say "our customer is a typical laptop user." Intel tries to cater to much larger markets, so perhaps when management goes to plan a laptop chip, they are less aggressive with deciding to optimize. But I have a feeling that Apple's optimizations are generally good for nearly all code, not just for specific use cases.
There's really no explanation for why EPYC "Naples" was so bad other than AMD did not internally understand the performance of realistic large-scale programs. I mean even if they had taken anything off the shelf, for free, like MySQL, they could have determined at some point before mass production that their CPU, in fact, sucked. But they shipped it and prospective customers rejected it.
Don't discount how a weak organization can make poor decisions even when all necessary information seems to be readily available.
Apple is the only large company with a functional organization. Could that be it ? Coupled with their unparalleled ownership of a family of platforms (intel’s OS comparably is nonexistent).
I think you're probably right, but it's funny because I think Intel was well known for having really exceptional organizational function in the past. They used to be paranoid about everything!
Intel do own an OS, Clear Linux, but they probably lack profiles of typical usage of that OS, and probably there are not many users of it apart from Phoronix when they do benchmarking.
It’s a big world out there. Workloads in data science vs gaming vs hft vs packet processing vs web servers are all extremely different.
Even if you know about them, you need an expert in each to truly push the hardware to the real limits that get hit in the respective industry. The small differences between real implementations and simulated loads can drastically alter the performance characteristics and cause proc manufacturers to miss the mark.
There’s another factor here, that I learned when I did a six month contract at Apple Retail Software Engineering.
In short, Apple has a number of 10x engineers that they move around to whatever project needs the most help, whether that is hardware, or application software, or operating system software, or services infrastructure, or whatever.
If some project starts getting enough negative “above the fold” coverage, then they will be temporarily gifted one or more of these special engineers.
Do that enough times, and those 10x engineers will gain enough experience in enough different areas that they will be able to reason well about the other end of whatever pipeline they’re on, and will know other 10x engineers that they can work with on that other end of the pipeline, because they had previously worked with them on some other project in the past months or years.
And those 10x engineers really will make a huge difference in what that project is capable of delivering.
The key failure of this operating mode is that most of the 10x engineers never get enough time to transfer much knowledge or skills to the others on the temporary team they are currently working with, and so things will start slowly deteriorating when they are necessarily moved on to the next project.
> I don't have any inside-Apple perspective, but my guess is having a tight feedback cycle between the profiles of their own software and the abilities of their own hardware has helped them greatly.
Also we see not the first chip but the first one that met their needs (demonstrably better performance on their workloads).
By which I mean: presumably MacOS has been running a many generations of A processors, so they have had a lot of time to figure out what tweaks would be good and which turn out to be pessimization and overkill. It doesn't hurt that there is significant internal overlap between modern macOS and iOS.
Microsoft doesn't need to acquire Intel, they need to do what Apple did and acquire a stellar ARM design house that will build a chip with x86 translation, tailored to accelerate the typical workloads on Windows machines and sell those chips to the likes of Dell and Lenovo and tell developers "ARM Windows is the future, x86 Windows will be sunset in 5 years and no longer supported by us, start porting your apps ASAP and in the mean time, try our X86 emulator on our ARM silicon, it works great."
Microsoft proved with the XBOX and Surface series they can make good hardware if they want, now they need to move to chip design.
Microsoft has a pretty good relationship with AMD from the Xbox. AMD already made an Arm Opteron. Windows has been multiplatform since NT 3.1 (Alpha, MIPS) and then in 3.51 adding in PowerPC. You can download Windows for Arm for free and run in on a Raspberry Pi.
Microsoft has at least one homegrown processor that it has ported Windows and Linux to with the confusingly named 'Edge'.
Microsoft doesn't even _need_ to target Arm, they could easily team up with AMD or go the whole thing solo and target anything from RISC-V, Arm to an in-house ISA.
PowerPC was the last architecture standing. It was supported by NT 4.0 SP2 (technically this means Microsoft supported it on paid support contracts as recently as 2006).
Apple has at most 10% of the computer market and is just one player among many. I am skeptical Microsoft with their 90% dominance would or should be allowed this much power over the industry.
The traditional personal computer market isn't nearly as important as it used to be. However you slice the pie, there's no way you can define the pieces to be 10% Apple and 90% Microsoft with a straight face.
90% dominance of what is increasingly a small niche market. Apple controls a large fraction of the mobile device market, and everything else runs linux.
The fact is outside of the tech scene, most businesses and consumers runs Windows. To say this is a "small niche market" is laughable. Microsoft is everywhere.
From what I understand, Microsoft is excellent at hardware design. They’re just focused on a different market (services->business rather than consumers->services)
Apple has been iterating on their proprietary mobile ARM-based processors since 2010, and has gotten really good at it. I would imagine that producing billions of consumer devices with these chips has helped give them a lot of experience in shortened time frame.
I also wonder if having the hardware and software both worked on in-house is an advantage. I mean, if you're developing power management software for a mobile OS, and you're using a 3rd-party vendor, then you read the documentation, and work with the vendor if you have questions. If it's all internal, you call them, and could make suggestions on future processor design too based on OS usage statistics and metrics.
In fact there is clear evidence of this with M1. It has optimised instructions to speed up retain/release of NSObject subclasses, which is a frequent operation on almost all Objective-C and Swift classes. They also designed the M1 to support a memory management profile used by x86 (and not ARM) to accelerate Rosetta translated binaries. I'm sure there are more.
>Why is that strategy simultaneously remarkably efficient and remarkably high-performance? What enabled/led them to make those choices where others haven't?
The things people give them complains about:
(a) keeping a walled garden,
(b) moving fast and taking the platform to new directions all at once
(c) controlling the whole stack
Which means they're not beholden to compatibility with third party frameworks and big players, or with their own past, and thus can rely on their APIs, third party devs etc, to cater to their changes to the architecture.
And they're not chained to the whims of the CPU vendor (as the OS vendor) or the OS vendor (as the CPU vendor) either, as they serve the role of both.
And of course they benchmarked and profiled the hell out of actual systems.
Neither A nor C makes any sense, are not supported by evidence. There is no aspect of the mac or macOS that can be realistically described as a "walled garden". It comes with a compiler toolchain and ... well, some docs. It natively runs software compiled for a foreign architecture. You can do whatever you want with it. It's pretty open.
A "walled garden" is when there is a single source of software.
"A" does matter a bit. Builds are uploaded to the App Store include bitcode, which Apple strips on distribution.
According to docs, enabling bitcode: "Includes bitcode which allows the App Store to compile your app optimized for the target devices and operating system versions, and may recompile it later to take advantage of specific hardware, software, or compiler changes."
It seems quite likely they have (and probably used) the capability to recompile any app on their platform to benchmark real workloads against prototype silicon changes.
Walled garden has many meanings, depending on context.
macOS promotes the App Store as the source of software (even if it's not the sole), and has walls like notarization requirements and the Gatekeeper to prevent weeds from intruding.
With the App Store Apple knew that there's a pool of N apps that follows its guidelines, has passed internal checks for API use, and can be converted quite easily to a different architecture, that it could count on.
Their control over the platform allowed them to enforce Metal and deprecate OpenGL pronto, to add a new combined iOS/macOS UI libs, to introduce Marzipan.
They have also added stuff like Universal Binary support, and most importantly Bitcode, which abstracts away parts of the underlying architecture.
All of those where steps towards the ARM/M1 (and future developments), and all were enabled via Apple's control of the hardward, software, and - sure, partial - control of third party apps.
Running apps downloaded outside the store requires jumping through an increasing number of hoops or vendors pay to get every build signed off by a single party.
> They could still do all this shit without the walled garden.
They do. MacOS isn't a walled garden.
> They're anti-competitive
Have you heard of this little company from Washington called Microsoft? They have something like 85% of the PC market. There is another OS called Linux. About 85-90% of the internet runs on it.
I can understand a little where people get the idea the iPhone is anti-competitive, but we're talking about MacOS here.
I do wish people would keep the quasi-religious aspects out of things.
Amphetamine is a perfect example of how the Mac isn't a walled garden. They always had the option top sell outside the App Store. That is fundamentally the difference between what makes a platform a walled garden. They might have lost some sales because they couldn't participate in the Mac App Store, but they could still sell their product. Some companies choose to avoid the Mac App Store because they don't like Apple's policies.
(a) Amphetamine could still be sold outside the Mac App Store.
(b) An app name could be problematic even in FOSS land. It's just that instead of Amphetamine being the name that causes it, it will be something else. E.g. with the trend of banning/changing terms like "master" (as in replication primary master, not as in the owner of slaves), unfortunately named apps could be thrown out something or ask to be renamed to be included in a distros package manager or a project.
With the walled garden, Apple can set enforceable timelines for the software ecosystem to adopt to architectural changes.
Remember the transition to arm64? Apple forced everything on the App Store to ship universal binaries.
Without the App Store walled garden, software isn’t required to keep up to date with architectural changes. Instead, keeping current is only a requirement to being featured on the App Store (which would just be a single way to install software, not the only method).
Well, and on the Mac, it's not the only method. The walled garden here has big open gates.
That said, all software on the Mac, post-Catalina, has to be 64-bit, whether it's distributed through the Mac App Store or not, because the 32-bit system libraries are no longer included at all.
>Well, and on the Mac, it's not the only method. The walled garden here has big open gates.
Gates are not incompatible with walled gardens. Most walled gardens have those.
Plus, I mentioned the walled garden as a good thing. It's part of the Apple proposition (even if not alll get it), and part of what it enables it to move at the speed it does (whether in the right or wrong direction).
But one can susbstitute "walled garden" with "tight control of the OS, hardware and imposed requirements on most of third party software, and willingness to enforce hard schedules (e.g. regarding removing 32-bit, OpenGL, etc) to all (or tons) of its developers at once.
They are little tiny 6" tall walls that you can step over. Like micro-walls. Except for the bits where you have no walls at all. Like if you install literally any programming language, HomeBrew, or MacPorts.
The walls in the walled garden only exist in the heads of people who never use a Mac.
>They could still do all this shit without the walled garden
With much slower adoption, pushback, and bike-shedding, like in the Microsoft and Linux world.
>To me, it suggests they aren't willing to compete.
Compete with what? With themselves? They compete with Windows (and to a degree Linux, though few care for that), and with Android. They'd compete with Windows Phone too if MS wasn't incompetent.
But they didn't do anything to preclude others from making their OS/hardware and selling it to customers. In fact, they have nowhere near a monopoly in either the desktop (10% or less) or the mobile space (40% or less).
Whereas MS for example, had 98% of the desktop (home and enterprise), and abused its power to threaten OEMs to do its bidding against Linux etc.
What's with downvotes? I know some people don't mind, but it's a deal breaker to me, and it something I don't want to support.
I don't care how good their hardware is. Moreover, good luck sourcing parts if the device has trouble. Apple will not sell you the parts. Even if you wanted.
A walled garden does not make their hardware any better. If anything it makes it worse. I hope for Mac users apply does not clamp down further on Macs.
Didn't downvote, but I think it's the same as when people read a "letter to the editor" of yore, declaring that some person "cancelled their subscription" because of something in the magazine.
A natural response is "Don't let the door hit you on your way out", which on HN might be expressed through a downvote by some.
>I don't care how good their hardware is. Moreover, good luck sourcing parts if the device has trouble. Apple will not sell you the parts. Even if you wanted.
Well, they repair all kinds of parts, and have guarantees and guarantee extension programs. But in any case, their allure was never "can find parts to build my own / repair damages forever" or in their stuff being cheap to own or fix/replace.
>A walled garden does not make their hardware any better.
Well, it does in a few ways. Mandating how the software is made, and what software is sold, when it should adapt new libs to continue being sold, etc, means that they can move the platform in different ways faster.
I never bother to enter apple's tyrannical ecosystem . So there is no door to hit me on the way out.
> Well, they repair all kinds of parts, and have guarantees and guarantee extension programs.
You can not get a lot surface mount chips to repair a mac book or iPhone without having to look on the gray-market. This is even before possible firmware issue if you manage to find parts. Heck even getting full replacement boards is basically impossible, unless they are from donor machines that have other problems.
> "Well, it does in a few ways. Mandating how the software is made, and what software is sold, when it should adapt new libs to continue being sold, etc, means that they can move the platform in different ways faster."
I disagree, allowing people to side-load does not stop apple from having policies in place for it's app stores. That's honestly the only problem. It's the owners hardware, they should not need apples permission to run code on it. Unless the owner can sign software themselves and/or run it without apple's consent this will always be a problem. You can't even install gcc without jailbreaking an iPhone.
At the very least I should be able to install an other OS on the device like GNU/Linux. If apple does not want to open iOS the user should at least have that option for the hardware.
This is even before you get into how apple treats developers. Have you read the entire App store guidelines. Some of it is ridiculous. Some of the insanity prevents Firefox from even porting their own browser engine.
I think it's worth saying that because AMD have only just really hit their stride, Intel were under almost zero pressure to improve which has really hurt them especially with the process.
X86 is definitely a coefficient overhead, but if Intel put their designs on 5nm they'd look pretty good too - Jim Keller (when he was still there) hinted their offerings for a year or so in the future are significantly bigger to the point of him judging it to be worth mentioning so I wouldn't write them off.
It seems like Apple listened when people talked about how all modern processors bottleneck on memory access and decided to focus heavily on getting those numbers better.
Of course this leads to the question that if everyone in the industry knew this was the issue why weren't Intel and AMD pushing harder on it? They already both moved the memory controller onboard so they had the opportunity to aggressively optimize it like Apple has done, but instead we have year after year where the memory lags behind the processor in speed improvements, to the point where it is ridiculous how many clock cycles a main memory access takes on a modern x86 chip.
The Apple, Intel, and AMD memory controllers all look pretty similar in performance to me. Memory latency is the same at ~100 ns; Firestorm is clocked lower so latency is lower in terms of cycles. One Firestorm core can saturate the memory controller while Intel/AMD can't so that should be an advantage for single-threaded scenarios. Intel/AMD are behind, but I wouldn't say embarrassingly so and they haven't been lazy.
My guess is it had to do with limitations tied to the x86_64 instruction set. It doesn't matter how much modifications you do, if you don't start with a good foundation, you're going to be limited to that foundation.
I think the current consensus among experts is that the instruction set is not the limiting factor. Modern x64 microprocessors have a separate front-end that handles instruction decoding. These instructions are decoded to internal proprietary "micro-ops". The internal buffers and actual execution units see only these µops. One can measure where the bottlenecks are, and it's rare to find that the front-end is the bottleneck. While it's arguably true that x64 is a poorly designed "foundation", it's unlikely to be causing any performance difference here.
> One can measure where the bottlenecks are, and it's rare to find that the front-end is the bottleneck.
Part of this is due to the fact that x86 processor designers won't include more execution units than they can feed from their instruction decoders. Apple's processors are much wider than x86 on both the decode and execution resources, and it's pretty clear that the M1 would not perform as well if its decoders were as narrow as current x86 cores.
> it's pretty clear that the M1 would not perform as well if its decoders were as narrow as current x86 cores
This would imply that it's able to sustain ILP greater than 4 (or maybe 5 with macro-fusion). Does it actually manage to do this often? If so, that's really impressive. I was guessing that most of the advantage was coming from the improved memory handling, and possibly a much bigger reorder buffer to better take advantage of this, but I'm happy to be shown otherwise.
There are real differences in processors caused by their ISAs - it's not true that decoders mean it's all the same RISC in the backend.
For instance, it's hard to combine instructions together, which is actually an advantage for x86 (the complex memory operands come for free). But it also guarantees memory ordering that ARM doesn't which is a drawback.
> For instance, it's hard to combine instructions together, which is actually an advantage for x86 (the complex memory operands come for free).
True, although I just looked at the ARM assembly for Daniel's example, and it's making good use of "ldpsw" to load two registers from consecutive memory with a single instruction. So in this particular case, it may be a wash.
> But it also guarantees memory ordering that ARM doesn't which is a drawback.
Yes, I wasn't considering the memory model to be part of the instruction set. I agree that in general this could be a big difference in performance, although I don't think it comes up in Daniel's example.
I added a comment to Daniel's blog with my guess as to what's happening to cause the observed timings in his example. Feedback from anyone with better knowledge of M1 would be appreciated.
> I think the current consensus among experts is that the instruction set is not the limiting factor.
Yes and no. Yes, because modern super scalar CPU's don't execute the instructions directly, but rather use a different instruction set entirely (the "micro-ops") and effectively compile the native instructions into that. This makes them free to choose whatever micro-ops they want. Ergo the original instructions don't matter.
But .... that means there is a compile step now. For a while that was no biggie - it can pipelined if the encoding is complex. But now the M1 has 12 (iirc) execution units. In the worst case that means they can execute 12 instructions simultaneously, so they must decode 12 instructions simultaneously. The is a wee exaggeration as it isn't that bad. In reality the M1 appears to compile 8 instructions in parallel.
This is where the rot sets in for x86. Every ARM64 instruction is 32 bits wide. So the M1 grabs 8 32 bit words, compilers them in parallel to micro-ops. Next cycle, grab another 8 32 words, compile them to micro-ops, and so on. But the x86 instructions can start on any byte boundary, and can be 1 to 16 bytes in length. You literally have to parse the instruction stream a byte at time before you can start decode it. In practice they cheat a bit, making speculative guesses about where instructions might start and end, but when you're being compared to someone who effortlessly processes 32 bytes at a time that's like pissing in the wind.
So the instruction set may not matter, but how you encode that instruction set does matter, at lot. Back in the day, when there we few caches and every instruction fetch cost memory accesses, you were better off using tricks like using one byte for the most common instructions to squeeze the size of instruction stream down. That is the era x86 and amd64 hark from. (Notably, the ill-fated Intel iAPX 32 took it to an extreme, having instructions start and end on a bit boundary.) But now with execution units operating in parallel, and on chip caches putting instruction stores on chip right beside the CPU's, you are better off making storage size worse in order to gain parallelism in decoding. That's where ARM64 harks from.
It's interesting watch RISC-V grapple with this. It's a very clever instruction set encoding that scales naturally between different word sizes. This also naturally leads to a very tight, compressed instruction set. But in order to achieve that they've got more coupling between instructions than ARM64 (but far, far less than x86), and any coupling makes parallelism harder. Currently RISC-V designs are all at the small, non-parallel end, so it doesn't effect them at all. In fact at the low power end it's almost certainly a win for them. But I get the distinct impression the consensus of opinion here on HN is it will prevent them from hitting the heights ARM64 and the M1 achieve.
Great comment, and great accurate explanations of complex stuff! I'm still going to disagree with the conclusion, though. Yes, x64 has to jump over horrible hurdles to prevent instruction decode from being a bottleneck. Yes, this makes for more complex processors, possibly with poorer performance per Watt. But I'm asserting (in my reasonably expert opinion) that the ridiculous contortions currently in use are (in almost all cases) adequate to prevent the instruction decoder from being the bottleneck.
What's missing from your description is the extra level of decoded µop cache between the decoder and instruction queue on modern Intel chips. In a tight loop, this pre-decoder kicks in and replays the previously decoded µops at up to 6 per cycle. It's a mess (and complicated enough that Intel needed to disable part of it with a microcode update on Skylake) but it provides enough instruction throughput that the real bottleneck is almost always elsewhere. Specifically, the 4-per-cycle instruction retirement limit almost always maxes out my attempts at extremely tight loop code earlier than instruction decoding.
Which is to say, you are right about how much easier it is to decode ARM64 instructions, but I think you are wrong that decoding x64 is in-practice a limiting factor for performance. If you have a non-contrived example to the contrary, I'd love to see it.
Certainly Apple's processors are far ahead, but they're a full process generation (5nm) ahead of their competitors. They paid their way to that exclusive right through TSMC.
I'm sure they'll still come out ahead in benchmarks, but the numbers will be much closer once AMD moves to 5nm. You absolutely cannot fairly compare chips from different fab generations.
I don't see many comments hammering this point home enough... it's not like the performance gap is through engineering efforts that are leagues ahead. Certainly some can be attributed to that, and Apple has the resources to poach any talent necessary.
A node shrink gives you a choice of cutting power, improving performance, or some mix of the two.
Apple appears to have taken the power reduction when they moved to TSMC 5nm.
>The one explanation and theory I have is that Apple might have finally pulled back on their excessive peak power draw at the maximum performance states of the CPUs and GPUs, and thus peak performance wouldn’t have seen such a large jump this generation, but favour more sustainable thermal figures.
Apple’s A12 and A13 chips were large performance upgrades both on the side of the CPU and GPU, however one criticism I had made of the company’s designs is that they both increased the power draw beyond what was usually sustainable in a mobile thermal envelope. This meant that while the designs had amazing peak performance figures, the chips were unable to sustain them for prolonged periods beyond 2-3 minutes. Keeping that in mind, the devices throttled to performance levels that were still ahead of the competition, leaving Apple in a leadership position in terms of efficiency.
That seems like a pretty good trade off for mobile devices. They usually don’t have sustained performance needs, you aren’t going to render a movie or do other long running computational tasks. But mobile has a lot of bursty power demands: Launch dozens of app many times a day for example. You want your short-ish interactions with your phone to be snappy.
Yeah, totally agreed. But if you read these comments, they seem to be in total amazement about the performance gap and not acknowledging how much of an advantage being a fab generation ahead is.
Customers don't care, but discussion of the merits of the chip should be more nuanced about this.
It also implies that the gap won't exist for very long, as AMD will move onto 5nm soon
People keep pointing this out but has Intel had such significant performance improvements since sandy bridge? With x86 it seems that lately you would be foolish to upgrade less than once every 3-4 years because the difference is just not that significant
The i7-2600K (Sandy Bridge) benchmarks at ~5000 on Passmark, and the i7-10700K at about 20,000. So it seems they've had quite a bit of improvement. Note this is going from 32nm to 14nm.
Intel is in a really bad place now (in a forward-looking sense), primarily due to their fab process falling behind TSMC and others. You can't design your way ahead while using old manufacturing technology
Over the last decade or so Apple has gone from 10x slower than Intel to parity, mostly by implementing techniques that were already known. Surpassing the state of the art may be harder to do consistently.
A node shrink is going to help AMD by 15% at best, they are much farther behind than that on performance per watt.
AMD has done mobile CPUs that look as if they are close to or even ahead of the M1 in performance, but they all use 2x to 4x as much power. When higher core count versions of Apple Silicon are available, they will be able to have double the core counts of AMD chips at the same power levels.
And each those cores are significantly faster than individual AMD cores.
And Ryzen chips are offered with more cores, but that’s an extremely temporary advantage (reminds me of the friend who told me not to buy Apple stock because they didn’t have big screen phones).
When Apple fits 32 Firestorm cores in a 135 watt TDP package, AMD isn’t going to have an answer.
Power consumption is not linear as it relates to performance. CPUs designed for the desktop are going to use excessive power by design. They'll often use many times more power than mobile equivalent, but only have slightly better single core performance.
It might not be linear in terms of single core performance, where Apple already dominates, but it sure has a nearly linear relationship to multicore performance.
The M1 only has 4 Firestorm cores and 15 watts TDP. It’s other four cores are Icestorm cores offering roughly 1/10th the performance and 1/3rd the power draw.
Now as you shown it’s way ahead in Cinebench 23 Multicore benchmark. Let’s assume that’s more representative of real world use, and that being a process behind it will add 15% higher performance when AMD makes a 5 nm successor. That would increase its Cinebench Multicore to roughly 12,700, roughly 60% higher than the M1.
But all Apple has to do is come out with an M1X with eight Firestorm cores. That’s a multicore performance in the same range as the best possible 5 nm AMD CPU, and a TDP barely half of the AMD CPU. And far higher Cinebench single core, and GeekBench single/multicore ratings.
Obviously to swap Firestorm for IceStorm cores they need more transistors or something else has to go (on chip GPU?).
And Apple won’t be making 20 hour MacBook Airs out of this M1X, but they will be able to make 14 hour MacBook Pros with faster discrete GPUs.
And they are about 6 months from doing exactly that. Which is likely going to before AMD 5 nm, so top of line 4900H laptops get smoked on all Cinebench and GeekBench scores while using nearly double the power.
Apple M1 is fully optimized for efficiency because it's delivered from mobile SoC and still aimed for portable laptop, meanwhile Zen3 (or Willow Cove) arch is aimed for all laptop/desktop/server category so they optimized for both efficiency and max performance.
Even though M1 is designed for efficiency, it sometimes outperforms AMD/Intel for performance. That confuses the story.
The 5nm vs 7nm vs 10nm vs whatever nm narrative is highly reminiscent that of clock rate flame wars raging in the tech bubble over a decade ago. Back in the early aughts, when the great engineering was still a thing, alternative CPU designs (DEC Alpha 21264, MIPS 10k/12k, PA RISC, POWER and later UltraSPARC designs) were consistently outperforming any x86 design by at least an order of magnitude or more whilst being clocked at 30% to 50% less of the x68 designs, especially in FP operations. The alternative designs explored and utilised wider and deeper pipelines, bigger L1 caches and various optimisations across the entire CPU arch. Every alternative CPU design had something unique to offer, and that was a great thing to read about and study.
The commoditisation of the PC hardware has driven great CPU designs into an extinction. Heck, even Oracle, that are now in the business of litigation for fun and a massive profit, with its prodigious cash war chest has discontinued the UltraSPARC architecture due to it requiring extraordinary investments on multiple fronts. PC users have long been forced to be content with whatever bone the CPU architecture coloniser would throw at them. There appears to be a resurgence of the great engineering with M1, and, hopefully, that will lead to more of the thoughtful engineering in medium to long term.
M1 is fast due to: a solid, single vision of what a modern CPU should be like, continuous investment into R&D over an extended period of time, a well concerted effort of the engineering, design ideas reuse across multiple product lines, supply chain management, and, of course, the manufacturing process. Nanometers do not make for a great CPU design but rather play a supporting role. If the nanometers were so important, the 2017 POWER9 design manufactured at a 14 nm process with a smaller L1 cache would not have been able to outperform any existing x86 design in 2020 in both, single core and multi core (with 25% to 50% lesser number of physical cores) setups? Ryzen 3 has narrowed the gap, but POWER9 still takes the lead and POWER10 is around the corner.
There is a great quote by Michael Mahon, a principal HP architect, in the foreword to the PA RISC 2.0 CPU architecture handbook from 1995:
The purpose of a processor architecture is to define a stable interface which can efficiently couple multiple generations of software investment to successive generations of hardware technology. Stability and efficiency are the goals, and the range of software and hardware technologies expected during the architecture’s life determine the scope for which the goals must be achieved
...
Efficiency also has evident value to users, but there is no simple recipe for achieving it. Optimizing architectural efficiency is a complex search in a multidimensional space, involving disciplines ranging from device physics and circuit design at the lower levels of abstraction, to compiler optimizations and application structure at the upper levels.
Because of the inherent complexity of the problem, the design of processor architecture is an iterative, heuristic process which depends upon methodical comparison of alternatives («hill climbing») and upon creative flashes of insight («peak jumping»), guided by engineering judgement and good taste.
To design an efficient processor architecture, then, one needs excellent tools and measurements for accurate comparisons when «hill climbing,» and the most creative and experienced designers for superior «peak jumping.» At HP, this need is met within a cross-functional team of about twenty designers, each with depth in one or more technologies, all guided by a broad vision of the system as a whole.
Well executed holistic approach is the reason why the entry level M1 is fast. We need more of «holistic-ism» in engineering everywhere.
Sorry, but alternate CPU designs were never outperforming x86 by an "order of magnitude", especially not at a lower clock speed. That is a complete exaggeration. I was around during that time period and can find nothing that supports this. The Alpha was fast, yes, but you're talking 2 to 3x best case with floating point compared to a 2 to 3x cheaper Intel system. I did find some old benchmarks: http://macspeedzone.com/archive/4.0/WinvsMacSPECint.html
UltraSPARC was not very competitive. Those machines were very, very expensive and you didn't get much bang for the buck. They weren't even that fast. The later chips had tons of threads but single threaded performance was pretty bad...
> There is no "magic" at play here, it's just lots and lots of raw muscle. They just seem to have a freakishly successful strategy for choosing what aspects of the processor to throw that muscle at.
There is no freakishly successful strategy at play there as well. It's just all previous attempts at "fast ARM" chip were rather half hearted "add a pipeline step there, add extra register there, increase datapath width there," and not to squeeze it to the limit.
The answer is that they have raw hard numbers from the hundres of millions of iPads/iPhones sold each year, and can use the metrics from those devices to optimize the next generation of devices.
These improvements didn't come from nowhere. It came from iterations of iOS hardware.
> What enabled/led them to make those choices where others haven't?
Others have to some extent — AMD is certainly not out of the game — so I'd treat this more as the question of how they've been able to go more aggressively down that path. One of the really obvious answers is that they control the whole stack — not just the hardware and OS but also the compilers and high-level frameworks used in many demanding contexts.
If you're Intel or Qualcomm, you have a wider range of things to support _and_ less revenue per device to support it, and you are likely to have to coordinate improvements with other companies who may have different priorities. Apple can profile things which their users do and direct attention to the right team. A company like Intel might profile something and see that they can make some changes to the CPU but the biggest gains would require work by a system vendor, a compiler improvement, Windows/Linux kernel change, etc. — they contribute a large amount of code to many open source projects but even that takes time to ship and be used.
Intel does lots of contributions across the OS (Linux and glibc) to compilers including their own (gcc, icc, ispc, etc). Their problems aren't their ability, it's that Intel is poorly managed and internal groups are constantly fighting with each other.
Also, compiler support for CPUs is very overrated. Heavy compiler investment was attempted with Itanium and debunked; giant OoO CPUs like Intel's or M1 barely care about code quality, and the compilers have very little tuning for individual models.
> Intel does lots of contributions across the OS (Linux and glibc) to compilers including their own (gcc, icc, ispc, etc). Their problems aren't their ability, it's that Intel is poorly managed and internal groups are constantly fighting with each other.
I wasn't just talking about Intel but the concept of separate CPU and compiler vendors in general. Intel contributes a ton of open source but even if they were perfectly organized it takes time for everything to happen on different schedules before it's generally available: get patches into something like Linux or gcc, wait possibly years for Red Hat to ship a release using the new version, etc. Certain users — e.g. game or scientific developers — might jump on a new compiler or feature faster, of course, but that's far from a given and it means they're not going to get the across-the-board excellent scores that Apple is showing.
> Also, compiler support for CPUs is very overrated. Heavy compiler investment was attempted with Itanium and debunked; giant OoO CPUs like Intel's or M1 barely care about code quality, and the compilers have very little tuning for individual models.
This isn't entirely wrong but it's definitely not complete. Itanium failed because brilliant compilers didn't exist and it was barely faster even with hand-tuned code, especially when you adjusted for cost, but that doesn't mean that it doesn't matter at all. I've definitely seen significant improvements caused by CPU family-specific tuning and, more importantly, when new features are added (e.g. SIMD, dedicated crypto instructions, etc.) a compiler or library which knows how to use those can see huge improvements on specific benchmarks. That was more what I had in mind since those are a great example of where Apple's integration shines: when they have a task like “Make H.265 video cheap on a phone” or “Use ML to analyze a video stream” they can profile the whole stack, decide where it makes sense to add hardware acceleration, and then update their choice of the compiler toolchain and higher-level libraries (e.g. Accelerate.framework) and ship the entire thing at the time of their choosing whereas AMD/Intel/Qualcomm and maybe nVidia have to get Microsoft/Linux and maybe someone like Adobe on board to get the same thing done.
That isn't a certain win — Apple can't work on everything at once and they certainly make mistakes — but it's hard to match unless they do screw up.
> Itanium failed because brilliant compilers didn't exist and it was barely faster even with hand-tuned code, especially when you adjusted for cost, but that doesn't mean that it doesn't matter at all.
What you said is true for libraries, I just don't think it's true for compiler optimizations. Even Apple's clang just doesn't have any new optimizations that work on their own; there are certainly new features but they're usually intrinsics and other things that need to be adopted by hand. They thought this would happen (it's what bitcode was sold as doing) but in practice it has not happened.
The big "enabler" was their mass-purchase of 5nm lithography across the board. Even still though, 4ghz*8c isn't anything new, and isn't really that remarkable besides the low TDP (which is incidentally dwarfed by the display, which draws up to 5x more power than the CPU does). I think the big issue is that Apple has painted themselves into a corner here: ARM won't play nice with the larger CPUs they want to make, and the pressure for them to provide a competent graphics solution on custom silicon is mounting. They spent a lot of time this generation marketing their "energy efficiency" and battery life, but many consumers/professionals (myself included) don't really care about either of these things.
"the display, which draws up to 5x more power than the CPU does" - wat? Apple-supplied monitoring tools report that M1 under full load (all CPU and GPU cores) can draw over 30 watts. Laptop displays don't consume 150 watts. If anything, the displays in either of the M1 portables likely consume about 5 times less than 30W, even at full brightness.
"ARM won't play nice with the larger CPUs they want to make" - wat? Apple holds an architectural license. This means they paid a lot upfront a long time ago and therefore have a more or less perpetual right to design their own Arm cores without input from Arm.
"a competent graphics solution" - Also wat? M1 has an excellent GPU. It doesn't compete with discrete GPUs that use 300 watts, but that's fine: M1 is the chip for entry level Macs, designed for the smallest and lightest segments of their notebook line. And in that product segment, it has been every bit as much a revelation as the CPU. It's very fast, and uses little power given the performance.
What exactly do you think is going to happen when they scale that basic GPU design up? Despite your dismissiveness, in modern silicon architecture energy efficiency is incredibly important: for any given power budget, the more efficient you are the more performance you can deliver. The performance Apple gets out of about 10W on M1 suggests they'll have few problems building a larger GPU to compete with Nvidia and AMD discrete GPUs.
didnt they also make some interesting hires a few years ago like Anand from Anandtech and some other silicon vets that likely helped them design the M1 approach?
1. they are the only ones who have 5nm chips because they paid a lot to TSMC for that right
2. they gave up on expandable memory, which lets them solder it right next to the cpu, which likely makes it easier to ship with really high clocks. and/or they just spent the money it takes to get binned lpddr4 at that speed.
So a good cpu design, just like AMD and Intel have, but one generation ahead on node size, and fast ram. Its not special low latency ram or anything, just clocked higher than maybe any other production machine, though enthusiasts sometimes clock theirs higher on desktops!
Right, latency isn't (much) affected by a higher clock rate. Getting ram to run fast requires both good ram chips and good controller/motherboard.
and yes, obviously apples bespoke ARM cpu is quite a bit different than Zen3 Ryzens x86 cpu, but I'm not sure it is net-better. When Zen4 hits at 5nm I expect it will perform on par or better than the M1, but we won't know till it happens!
Frankly, I find Lemire does oversimplified, poor-quality control, back-of-the-envelope microbenchmarking all the time that provides little to no insight other than establishing a general trend. It's sophomoric and a poor demonstration about how to well-controlled benchmarking that might yield useful, repeatable, and transferrable results.
Can you give an example? I've seen Lemire correct his posts on many occasions and the source code is published. I don't know many blogs doing anything remotely like that.
Sure. He often benchmark some small C++ code on his "laptop" CPU (which one exactly? microarchs matter!) and then committing classic microbenchmarking pitfalls such as:
- benchmarking something small enough to inspect machine code, but not inspecting machine code
- not plotting distribution, average, variance etc
- no attention paid to CPU frequency governor settings
- measuring too short a run
- measuring too small a dataset that it fits entirely in L1
The most impressive thing I've seen is that when accessed in a TLB friendly fashion that the latency is around 30ns.
Anandtech has a graph showing this, specifically the R per RV prange graph. I've verified this personally with a small microbenchmark I wrote. I've not seen anything else close to this memory latency.
Could this be coming from the page size being 4x as large for Apple Silicon versus x86? I don't fully understand the benchmark, but it appears to be accessing a variety of pages from the same first level TLB lookup?
It's been a long time since I dealt with this stuff (wanted to get 1GB huge pages in Linux for some huge huge hash tables), so maybe I'm misunderstanding.
Cachelines, page sizes, and size of the TLB all play a role. But with tinkering you can see those effects yourself and I played with 1, 2, 4, 8, 16, and 32 "pages" which I assumed were 4KB each and didn't see much difference. Measured latencies do increase slowly, but you expect that as the TLB becomes progressively more of a bottleneck.
If you use a 1GB array and see full random with much higher latency than a sliding window then you can be pretty sure that the page size is much less than 1GB.
Getting the cacheline off by a factor of 2 does make a small difference since you get occasional cache hits instead of zero, but as long as the array tested is several times larger than cache the impact is small.
But all in all the M1 has excellent memory bandwidth, excellent latency, and shows significantly better throughput on random workloads as you use more cores. Normal PC desktops have 2 memory channels (even the higher end i7/i9/ryzen7/ryzen9), only the $$$$ workstation chips like threadripper and some of the $$$$ Intel's have more. The little ole M1 in a mac mini, starting at $700 has at least 8 memory channels. So basically the M1 delivers on all fronts, larger and lower latency caches, wide issue, large reorder buffers, excellent IPC, and excellent power efficiency.
Just a clarification as to why the P per RV prange numbers are good: This pattern is simply aggressively prefetched by the region prefetcher in the M1, while Zen3 doesn't pull things in as aggressively.
It's designed to graph latency/bandwidth for 1 to N threads. My 1 thread numbers match Anandtech's. Use -p 0 for full random, which thrashes the TLB or -p 1 to be cache friendly (visit each cacheline once, but within a sliding window of 1 page).
To see the apple results (if you have gnuplot installed):
./lview results/apple-m1
I don't understand this competition to attribute the M1's speed to one specific change, while downplaying all of the others.
M1 is fast because they optimized everything across the board. The speed is the cumulative result of many optimizations, from the on-die memory to the memory clock speed to the architecture.
What other laptop ships with LPDDR4X clocked at 4267?
I agree though that being closer to the cpu isn't having any appreciable effect on latency, but being soldered close to the cpu probably does make it easier for them to hit that high clock rate.
As WMF mentions, Tiger Lake laptops like my Razer Book have the same memory. It is not appreciably closer to the CPU in the Apple design. In Intel's Tiger Lake reference designs the memory is also in two chips that are mounted right next to the CPU.
I just ran it again, and got more or less the same results:
N = 1000000000, 953.7 MB
starting experiments.
two : 29.7 ns
two+ : 36.5 ns
three: 43.8 ns
This surprises me. Normally, it does very well in most benchmarks I run.
Looking a little closer at the script, it loads numbers from "random", a vector of 3 million `Int` (this is hard coded, separate from `N`).
This vector is about 11.4 MiB.
The Tiger Lake CPU has 12 MiB of L3 cache (same as your i7-9750H), so it barely fits. Meanwhile, the L1 cache is 48 KiB and the L2 cache is 1.5 MiB -- huge compared to most recent CPUS, and a lot of benefit in most benchmarks, but at the cost of higher latency.
https://www.anandtech.com/show/16084/intel-tiger-lake-review...
Skylake's L3 latency was 26-37 cycles, and in Willow Cove's (Tiger Lake), it is 39-45 cycles.
That difference by itself isn't big enough to account for the difference we're seeing, so something else must be going on.
The caching of the random[] array (almost) shouldn't matter, as the access is sequential.
I'm wondering if the difference is the the number of active memory channels. How many channels does your respective computers support? Do you have enough RAM installed so all channels are in use? Are you able to do a RAM bandwidth test by some other means to verify?
Another possibility is that for some reason the base latency is just different between your machines. A commenter added a pointer-chasing variation of Daniel's test on his blog. Maybe run this to find the full latency and see how the times differ?
Finally, there was one more commenter on the blog who reported anomalously fast times on a Windows laptop. It's possible there is a bug with Daniel's time measurements on windows.
Yeah, that is strange. Why was it so slow?
Trying on a desktop with a 7900X, I get
N = 1000000000, 953.7 MB
starting experiments.
two : 17.7 ns
two+ : 19.1 ns
three: 26.4 ns
bogus 1422321000
This is again close to 50% slower than your time, but nearly twice as fast.
I'll try again on the laptop and make sure I don't have other processes running.
The outcome seems to depend greatly on the physical design of the laptops. The elsewhere-mentioned Dell XPS 13 has a particularly poor cooling design, which is why I chose the Razer Book instead. Despite being marketed in a very silly way to gamers only, it seems to have competent mechanical design.
Gamers are likely to run their systems with demanding workloads, for hours, with a color-coded performance counter (FPS stat). They'll notice if it throttles. They're particularly demanding customers, and there's quite a bit of competition for their money.
I'm not sure. I know they've been getting more popular with the increased power of laptops and the ability to use external GPUs (via Thunderbolt 3). I'd guess desktops are more common, but some people will have both.
How are you defining memory performance and where are your supporting comparisons? This article only discusses the M1's behavior, and makes no comparisons to any other CPU.
All recent Intel Core-i microarchitectures require using full vector width loads to max out L1d bandwidth, because the load/store units don't actually care about the width of a load, as long as it doesn't cross a cache line (in which case the typical penalty is an additional cycle).
Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.
Sure, and it has a very large out-of-order execution engine, but it is not fundamentally different from what other super scalar processors do. So I am curious what the OP meant by that offhand comment.
One core of the M1 can drive the memory subsystem to the rails. A single core can copy (load+store) at 60GB/s. This is close to the theoretical design limit for DDR4X. A single core on Tiger Lake can only hit about 34GB/s, and Skylake-SP only gets about 15GB/s. So yes, it is close to 4x faster.
Thanks for clarifying. But this isn't any fundamental difference IMO. There isn't any functional limitation in an Intel core that means it cannot saturate the memory bandwidth from a single core, unless I am missing something.
One could argue that it's not "fundamental", but it's definitely a functional limitation of the current Intel cores. The memory bandwidth of a single core is hardware limited by the number "Line Fill Buffers". Each buffer keeps track of one outstanding L1 cacheline miss, thus the number of LFB's limits the memory level parallelism (MLP). "Little's Law" gives the relationship between the latency, outstanding requests, and throughput. With 10 LFB's and the current latency of memory, it's physically impossible for a single core to use all available memory bandwidth, especially on machines with more than 2 memory channels.
The M1 chip allows higher MLP, presumably because it has more LFB's per core (or maybe they are using different approach where the LFB's are not per-core?). I apologize for using so many abbreviations. I searched to try to find a better intro, but didn't find anything perfect. I did come across this thread that (apparently) I started several years ago at the point where I was trying to understand what was happening: https://community.intel.com/t5/Software-Tuning-Performance/S....
I agree, it's not fundamental. It is, in particular, not that other popular myth, that it's "because ARM". It's only that 1 core on an Intel chip can have N-many outstanding loads and 1 core of an M1 can have M>N outstanding loads.
Your analogy was so close! It's Apple comes along and makes an 8 cylinder engine. Since, you know, the other CPUs are 4-wide decode and Apple's M1 is 8-wide decode :)
>or that it's faster because the memory is 2mm closer to the CPU (not that either)
Not to disagree with your overall point, but 2mm is a long way when dealing with high frequency signals. You can't just eyeball this and infer that it makes no difference to performance or power consumption.
If it works, it works. There will be no observable performance difference for DDR4 SDRAM implementations with the same timing parameters, regardless of the trace length. There are systems out there with 15cm of traces between the memory controller pins and the DRAM chips. The only thing you can say against them is they might consume more power driving that trace. But you wouldn't say they are meaningfully slower.
You can't just eyeball the PCB layout for a GHz frequency circuit and say "yeah that would definitely work just the same if you moved that component 2mm in this direction". It's certainly possible to use longer trace lengths, but that may come with tradeoffs.
>The only thing you can say against them is they might consume more power driving that trace
Power consumption is really important in a laptop, and Apple clearly care deeply about minimising it.
For all we know for sure, moving the memory closer to the CPU may have been part of what's enabled Apple to run higher frequency memory with acceptable (to them) power draw.
It makes it easier to get to a particular clock speed. The geometry, interconnect lengths etc are all tightly controlled, the noise is less because you're not on the main PCB and you have interconnect options that aren't whatever your PCB process is (e.g. commonly gold wires).
It is not only HN. It is practically the whole Internet. Go around the Top 20 hardware and Apple website forum and you see the same thing, also vastly amplify by a few KOL on twitter.
I dont remember I have ever seen anything quite like it in tech circle. People were happily running around spreading misinformation.
Yeah, I know. There was some kid on Twitter who was trying to tell me that it was the solder in an x86 machine (he actually said "a Microsoft computer") that made them slower. Apple, without the solder was much faster.
According to this person's bio they had an undergraduate education in computer science ¯\_(ツ)_/¯
I am pretty sure KOL predates Influencer in modern internet usage. Before that they were simply known as Internet Celebrities. May be it is rarely used now. So apology for not explaining the acronyms.
I can't find any info about the memory bus of apple m1. Is it 8 channels 16 bit each? That's drastically different from AMDs 2 channels 64 bit each.
It looks like apple m1 is much less eager when caching memory rows. Maybe because it doesn't have l3 cache.
Edit: This test utilizes the 8x16bit memory bus of apple m1 fully. It's mostly just fetching random locations from memory, which can all be parallelized by the cpu pipeline. It explains why the results are exactly 4x slower on my ryzen 3 with 2 memory channels.
So the summary is that m1 is optimized for dynamic languages that tend to do a DDOS attack on RAM with a lot of random memory access, but it might take a performance hit with compiled languages and traditional HPC techniques that tend to process data in sequence like ECS.
That's an interesting observation - as someone who's built a few ECS implementations, one of the things I've always taken for granted is that things like cache line size are more or less set in stone, given the ubiquity of x86, so it's interesting to consider that the rise of ARM might create additional complexities there.
I'm a bit of two minds about this: on the one hand, for a long time I've wanted a language for writing allocators which is more explicit about memory, and offers good abstractions for low-level memory operations (maybe Zig is going in this direction). In some sense, it feels like the move towards programmers thinking less about memory management has been a bit of a dead-end, and what we really want is better tools for memory management. Fragmentation in terms of how processors handle memory goes against this goal in some ways.
On the other hand, it's a bit of a "holy grail" to imagine a hardware stack which obviates the need for memory optimization, and really does treat loading from and storing to memory anywhere on the heap as the same. But I imagine that the interesting things which the M1 is doing with memory are helping a lot with the worst case performance, and maybe even average case performance, but they're probably not doing much for the best case.
That would make sense for LPDDR4 but it apparently claims to have a 128 byte cache line size and I'm not sure how to square that with 16 bit channel width.
Is the article saying that the M1 is slower than we would have expected in this case?
My understanding, based on the article, is that a normal processor, we would have expected
arr[idx] + arr[idx+1]
and
arr[idx]
to take the same amount of time.
But the M1 is so parallelized that it goes to grab both arr[idx] and arr[idx+1] separately. So we have to wait for both of those two return. Meanwhile, on a less parallelized processor, we would have done arr[idx] first and waited for it to return, and the processor would realize that it already had arr[idx+1] without having to do the second fetch.
>> My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.
That depends. If the two accesses are on the same cache line, then yes. But since idx is random that will not happen sometimes. He never says how big array[] is in elements or what size each element is.
I thought DRAM also had the ability to stream out consecutive addresses. If so then it looks like Apple could be missing out here.
Then again, if his array fits in cache he's just measuring instruction counts. His random indexes need to cover that whole range too. There's not enough info to figure out what's going on.
Ratios aside, that's an interesting speed leap when the article gets 9 ms for 2-wise. Mind, the laptop had lots of applications running, i didn't clear it up to do a proper benchmark, but still.
He's only got 3 million random[] numbers. Weather that's enough depends on the cache size. It also bothers me to read code like this where functions take parameters (like N) and never use them.
That array is indexed by an array of random numbers and there are only 3M of them. That should be enough assuming even 4 bytes per index it will just fit in the 12MB cache, but then there are accesses to the big array as well.
Its a little confusing because they're conflating the idea that you almost certainly read at least the entire word (and not a single byte) at a time with the other idea that you could fetch multiple words concurrently.
Any cached memory access is going to read in the entire cache line -- 64 bytes on x86, apparently 128 on M1. This is true across most architectures which use caches; it isn't specific to M1 or ARM.
(As I learned from recent Rust concurrency changes) on newer Intel, it usually fetches two cache lines so effectively 128 bytes while AMD usually 64 bytes. That's the sizes they use for "cache line padded" values (I.e making sure to separate two atomics by the fetch size to avoid threads invalidating the cache back and forth too much).
To be clear here, it fetches two cache lines but it doesn’t put the second in exclusive state until it’s written to; the unit of granularity is still 64b. In a scanning read mode you will see the benefit but you won’t see the contention on writes. (The contention will come from subsequent reads on that cache line though)
Yes almost certainly more than the word will be read but it varies by architecture. I would think almost by definition no less than a word can be read so I went with that in my explanation.
It’s a good introduction, but it’s a bit disappointing that it ends that way. I’d love to read more about what’s behind the figure and more technical info about how it might work.
This isn’t specific to the M1 but I tap about cache lines in my last QCon presentation (where I also suggested that a 128b cache line wasn’t far away):
However the speed benefits come from a much larger L1 cache and the fact that the ram is in the same chip which will reduce latency that is the benefit for most of it.
The program (instruction) cache is also a lot bigger and has the advantage that as a fixed size isa can be much wider in execution than in x86 but that’s unlikely to be of benefit here, other than perhaps slightly in terms of queuing up multiple outstanding loads.
Yup, my presentation was back in March 2020 and the M1 came out later in the year — sooner than I was expecting, TBH; I thought that it was a couple of years out when I said it :-)
This article lays out three scenarios: 1) accessing two random elements
2) accessing 3 random elements
3) accessing two pairs of adjacent elements (same as (1) but also the elements after each random element)
It then does some trivial math to use the loaded data.
A naive model might only consider memory accesses and might assume accessing an adjacent element is free.
On the Mac m1 core, this is not the case. While the naive model might expect cases 1 & 3 to cost the same and case 2 to cost 50% more, instead cases 2 & 3 are nearly the same (3 slightly faster) and case 2 is about 50% more expensive than 1.
> A naive model might only consider memory accesses and might assume accessing an adjacent element is free.
Really depends on the level of naivety and the definition of "free". It would be less insane to write that: accessing an adjacent element has a negligible overhead if the data must be loaded from RAM and there are some OOO bubbles to execute the adjacent loads. If some data are in cache the free adjacent load claim immediately is less probable. If the latency of a single load is already filled by OOO, adding another one will obviously have an impact. If the workload is highly regular you can get quite chaotic results when making even some trivial changes (even sometimes when aligning the .text differently!)
And the proposed microbenchmark is way too simplistic: it is possible that it saturates some units in some processors and completely different units in others...
Is the impact of an extra adjacent load from RAM likely to be negligible in a real world workloads? Absolutely. With precise characteristics depending on your exact model / current freq / other memory pressure at this time, etc.
I don't really understand the comparison because it seems like scenario 3 (2+) is doing more XORs and twice the accesses to array over the same amount of iterations.
We have to assume these are byte arrays, yes? Or at least some size that's smaller than the cache line. You would still pay for the extra unaligned fetches. I don't think this is a valid scenario at all, M1 or not.
Anyone want to run these tests on an Intel machine and let us know if the authors "naive model" test hold there?
The point of the naive model is that you assume memory accesses dominate
That is, the math part is so trivial compared to the memory access that you could do a bunch of math and you would still only notice a change in the number of memory accesses.
Also it looks like the response to yours links their test and the naive model predicts correctly
I think 5% is a non-trivial difference but alright, its a much bigger difference on the M1.
I guess I still don't understand whats going on here.
Scenario 1 has two spatially close reads followed by two dependent random access reads.
Scenario 3 (2+) has two spatially close reads, and two pairs of dependent random access reads of two spatially close locations.
Why does it follow that this is caused by a change in memory access concurrency? The two required round trips should dominate both on the M1 and an Intel but for some reason the M1 performs worse than that. Why?
I can't help but feel the first snippet triggers some SIMD path while the 3rd snippet fails to.
I think the 5% can maybe be accounted for by the cache line (you mentioned this above, and I don't think the experiment does anything to prevent the issue)? If it's 1/16th chance of crossing the cache line, that maybe is about 5% of the time? I say that with pretty low confidence though
I think you raise a good question, though -- what really is going on here? Is this just a missed optimization compiling for the m1?
Or is it actually something fundamental about how reads happen with an m1? I'm definitely not knowledgeable enough to know how to answer this
Shouldn’t you choose the random numbers such that array[idx1] ^ array[idx1 + 1] are guaranteed to fall in the same cache line? Assuming that it has that. Right now some accesses cross the end of the cache line.
Technically you are correct but it’s expected to cross a cache line 1/16 times (or however many ints there are in a cache line). There is an implicit assumption that that is relatively infrequent enough that it shouldn’t increase the average time too much, but that assumption should be tested.
The M1 doubles the line size, doubles the L1 data cache (i.e. same number of lines), quadruples the L1 instruction cache (i.e. double the lines), and has a 16x larger L2 cache, but no L3 cache.
The answer to that is usually very context dependant, and on what you're measuring. As long as you use a histogram first and don't blindly calculate (say) the mean it should he obvious.
Two examples( that are slightly bigger than this but the same principles apply):
If you benchmark a std::vector at insertion, you'll see a flat graph with n tall spikes at ratios of it's reallocation amount apart, and it scales very very well. The measurements are clean.
If, however, you do the same for a linked list you get a linearly increasing graph but it's absolutely all over the place because it doesn't play nice with the memory hierarchy. The std_dev of a given value of n might be a hundred times worse than the vector.
It was added some years ago, and I believe mach_absolute_time is actually now implemented in terms of (the implementation of) clock_gettime. The documentation on mach_absolute_time now even says you should use clock_gettime_nsec_np(CLOCK_UPTIME_RAW) instead.
macOS also has clock constants for a monotonic clock that increases while sleeping (unlike CLOCK_UPTIME_RAW and mach_absolute_time).
I was part of the team that really pushed the kernel team to add support for a monotonic clock that counts while sleeping (this had been a persistent ask before just not prioritized). We got it in for iOS 8 or 9. The dance you otherwise have to do is not only complicated in userspace on MacOS, it's expensive & full of footguns due to race conditions (& requires changing the clock basis for your entire app if I recall correctly).
Do you have any insight as to why libdispatch added support for this new clock internally (in the last year or two), but did not expose it in any public API? In C I can manually construct a dispatch_time_t that will use make libdispatch use CLOCK_MONOTONIC_RAW¹, if I'm willing to make assumptions about the format of dispatch_time_t² (despite libdispatch warning that the internal format is subject to change³). And I can't even do this in Swift. It would be really useful to have this functionality, so I'm mystified as to why it's hidden.
¹Technically it uses mach_continuous_time() first if available (which appears to be equivalent to CLOCK_MONOTONIC_RAW), then clock_gettime(CLOCK_BOOTTIME, &ts) on Linux, then clock_gettime(CLOCK_MONOTONIC, &ts), then some other API for Windows.
²Conveniently enough the value that is equivalent to DISPATCH_TIME_NOW using the monotonic clock is just INT64_MIN, at least in the current encoding.
³Swift makes assumptions about the internal format of dispatch_time_t so I don't know if it actually can meaningfully change. Newer versions of Swift now use stdlib on the system, but any app built with a sufficiently old version of Swift still embeds its own copy of the stdlib. Granted, additions (like the monotonic clock) should be fine, as the Swift API does not actually bridge from dispatch_time_t so it only ends up representing times it has APIs to construct. Since it doesn't have APIs to construct monotonic times, they won't break it.
I haven't worked at Apple in almost 5 years so I can't speak to more recent developments unfortunately. libdispatch integration may be trickier & have implementation details not suitable for a public API.
Specifically if I recall correctly the internal representation of time only allowed for 2 formats (wallclock vs monotonic). I suspect adding other types of clocks could pose back/forward compat challenges. However, this is a wild shot in the dark & I didn't really look into the internals of libdispatch. Maybe ask on their github page?
Generally upgrading a private API to public is a lot of work & goes through a lot of review (having monitored that mailing list & added 1 API during my time there). So if there's a private API probably some team at Apple needed it to deliver a feature but the maintainers were not confident the specific solution they chose generalized well & either requires some work or something else.
The way dispatch_time_t currently works, having the high bit (bit 63) 0 means based on uptime (or mach_absolute_time()), having bits 63 and 62 both set to 1 means based on the wall clock, and the addition here is having 63 set to 1 and 62 set to 0 means based on continuous time. I don’t think they have the room to add a 4th clock though.
That may be the result of inlining clock_gettime, though that would imply a pretty different implementation from the one I am familiar with.
AFAIR on x86 a locked rdtsc is ~20 cycles. So to answer the gp question, it has around a precision in the few nanoseconds range. Accuracy is a different question, IE compare numbers from the same die, but be a little more suspicious across dies.
No clue how this is implemented on the M1, or if the M1 has the same modern tsc guarantees that x86 has grown over the last few generations of chips.
Sufficiently old versions of mach_absolute_time used a function called clock_get_time() on i386 (if the COMM_PAGE_VERSION was not 1). This changed in macOS 10.5 to a tiny bit of assembly that just read from _COMM_PAGE_NANOTIME on i386/x86_64/ppc (the arm implementation(!!) triggers a software interrupt). The i386/x86_64/ppc definitions were also copied into xnu.
For the next few years it kind of bounced back and forth between libc and xnu, and the routine was complicated by adding timebase conversion as needed. And at some point arm support was added back (it vanished when it first went to xnu), but this time using the commpage if possible.
As for clock_gettime_nsec_np(), at least as of Big Sur, it's in libc¹ instead of xnu and defers to mach_continuous_time()/mach_continuous_approximate_time()/mach_absolute_time()/mach_approximate_time() for the CLOCK__RAW[_] clocks. And clock_gettime() for those clocks is implemented in terms of clock_gettime_nsec_np().
For people who know more about this stuff than me: are these sorts optimizations only possible because Apple controls the whole stack and can make the hardware & OS/software perfectly match up with one another or is this something that Intel can do but doesn't for some reasons (tradeoffs)?
> are these sorts optimizations only possible because Apple controls the whole stack and can make the hardware & OS/software perfectly match up with one another or is this something that Intel can do but doesn't for some reasons (tradeoffs)?
Interestingly it's the other way around. Apple is using TSMC's 5nm process (they don't have their own fabs), which is better than Intel's in-house fabs, so it's Intel's vertical integration which is hurting them compared to the non-vertically integrated Apple.
Also, the answer to "is this only possible because of vertical integration" is always no. Intel and Microsoft regularly coordinate to make hardware and software work together. Intel is one of the largest contributors to the Linux kernel, even though they don't "own" it. Two companies coordinating with one another can do anything they could do as an individual company.
Sometimes the efficiency of this is lower because there are communication barriers and isn't a single chain of command. But sometimes it's higher because you don't have internal politics screwing everything up when the designers would be happy with outsourcing to TSMC because they have a competitive advantage, but the common CEO knows that would enrich a competitor and trash their internal investment in their own fabs, and forces the decision that leads to less competitive products.
Not quite vertical integration, but TSMC's 5nm fabs are Apple's fabs. (exclusively for a period of time)
During the iPod era, Toshiba's 1.8in HD production was exclusively Apple's only for music players, but Apple gets all the 5nm output from TSMC for a period of time.
I think this gets lost in the fray between the "omg this is magic" and then the Apple haters. The M1 is a very good chip. Apple has hired an amazing team and resourced them well. But from a pure hardware perspective, the M1 is quite evolutionary. However the whole Apple Silicon experience is revolutionary and magical due to the tight software pairing.
Both teams deserve huge praise for the tight coordination and unreal execution.
I think this is part of the reason where there are so many people trying to find reasons to downplay it: humans love the idea of “one weird trick” which makes a huge difference and we sometimes find those in tech but rarely for mature fields like CPU design. For many people, this is unsatisfying like asking an athlete their secret, and getting a response like “eat well, train a lot, don't give up” with nary a shortcut in sight.
Sort of; Intel and AMD are stuck with the variable width instruction isa that exists due to historical evolution. To do something different you need a new isa.
Intel tried this with Itanium a while back and failed because it is difficult to get software developers to target a new isa and provide compilers and compiled code for everything unless you use a translation layer.
Apple is one step ahead here because their compilers already supported ARM isa (because iPhones use them) and had both the OS and apps ready to go from day one of availability.
They also had translation technology that would allow mutating x86_64 code to ARM64 code so that old apps would (on the whole) run acceptably fast on the new chip.
To do the latter properly, Apple had to create a special mode to run the arm chip with total store order for memory writes, which is not standard on arm. (It would be a lot slower if they didn’t have that when running Rosetta translated code.)
So both the OS being available, and the OS influencing the ARM tweaks (eg TSO) could they pull it off.
They also have the position that they build hardware that uses those chips so can mass produce - and in fact, replace - existing hardware.
Each of these things could be done in isolation by Intel/Windows/Apps but it would be difficult to do all three.
Even getting JavaScript maths in a special instruction was difficult enough on Intel, and that was something of benefit to any browser.
My guess is you’ll see Intel and AMD offering Arm chips in the near
future, as both AWS (graviton) and Apple have shown the way to a new ARM future.
There's at least two M1 optimisations targeting Apple's software stack:
1. Fast uncontended atomics. Speeds up reference counting which is used heavily by Objective-C code base (and Swift). Increase is massive comparing to Intel.
2. Guaranteed instruction ordering mode. Allows for faster Arm code to be produced by Rosetta when emulating x86. Without it emulation overhead would be much bigger (similar to what Microsoft is experiencing).
How do latency and bandwidth relate to the cost model for the code in the benchmark?
When creating the model discussed in the post, we're using it to try to make a static prediction about how the code will execute.
Note that the goal of the post is not to merely measure the memory access performance, it's to understand the specific microarchitecture and how it might deliver the benefits that we see in benchmarks.
For example, what is the bandwidth and latency when you ask for the value at the same memory address in an infinite loop? And how does that compare to the latency and bandwidth of a memory module you buy on NewEgg?
When people use BW in their performance models, they don’t use only 1 bandwidth, but whatever combination of bandwidth makes sense for the _memory access pattern_.
So if you are always accessing the same word, the first acces runs at DRAM BW, and subsequent ones at L1 BW, and any meaningful performance model will take that into account.
The concepts are still broadly valid, the naivety being referred to is the assumption that two non adjacent memory reads will be twice as slow as one memory read or two adjacent reads.
FWIW, I have a 2010 MBA which was _heavily_ used for years as a primary development system. The SSD only started to show signs of degraded performance last year and that wasn't massive. I would be quite surprised if the technology has become worse.
there’s also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance.
It's faster because it has more microarchitectural resources. It can load and store more, and it can do with a single core what an Intel part needs all cores to accomplish.