One could argue that it's not "fundamental", but it's definitely a functional li...

One could argue that it's not "fundamental", but it's definitely a functional limitation of the current Intel cores. The memory bandwidth of a single core is hardware limited by the number "Line Fill Buffers". Each buffer keeps track of one outstanding L1 cacheline miss, thus the number of LFB's limits the memory level parallelism (MLP). "Little's Law" gives the relationship between the latency, outstanding requests, and throughput. With 10 LFB's and the current latency of memory, it's physically impossible for a single core to use all available memory bandwidth, especially on machines with more than 2 memory channels.

The M1 chip allows higher MLP, presumably because it has more LFB's per core (or maybe they are using different approach where the LFB's are not per-core?). I apologize for using so many abbreviations. I searched to try to find a better intro, but didn't find anything perfect. I did come across this thread that (apparently) I started several years ago at the point where I was trying to understand what was happening: https://community.intel.com/t5/Software-Tuning-Performance/S....