> There is no Apple secret sauce. Except for secret sauce like super wide instru...

menaerus · on Oct 29, 2023

Super wide instruction decode won't help you much unless you're able to feed and retire those instructions at a consistent pace. This means being able to keep you ALU busy and for that to happen, there's plenty of problems one has to solve but two major bottlenecks in CPU design are (1) branch-prediction in the CPU frontend and (2) hiding the memory latency in the CPU backend. Both of those are tightly coupled to the instruction- and data-cache design.

Coincidentally, both of those caches in Apple M design are unusually large - 192KB for instruction cache size and 128KB L1 data cache size - per core (!). The same goes for L2 cache size - 3MB per (performance) core.

When compared to bleeding edge _server_ CPUs from AMD and Intel, it's crazy to see that those figures are by several _magnitudes_ larger in the Apple M design. E.g. Zen3 Epyc - 32KB of instruction cache size, 32KB of L1 data cache size and 512KB L2. Intel Xeon Gold - 32KB of instruction cache size, 64KB of L1 data cache size and 1.25MB of L2.