Hacker News new | past | comments | ask | show | jobs | submit login

One could argue that it's not "fundamental", but it's definitely a functional limitation of the current Intel cores. The memory bandwidth of a single core is hardware limited by the number "Line Fill Buffers". Each buffer keeps track of one outstanding L1 cacheline miss, thus the number of LFB's limits the memory level parallelism (MLP). "Little's Law" gives the relationship between the latency, outstanding requests, and throughput. With 10 LFB's and the current latency of memory, it's physically impossible for a single core to use all available memory bandwidth, especially on machines with more than 2 memory channels.

The M1 chip allows higher MLP, presumably because it has more LFB's per core (or maybe they are using different approach where the LFB's are not per-core?). I apologize for using so many abbreviations. I searched to try to find a better intro, but didn't find anything perfect. I did come across this thread that (apparently) I started several years ago at the point where I was trying to understand what was happening: https://community.intel.com/t5/Software-Tuning-Performance/S....




Came across someone else mentioning the similar bandwidth constraint w.r.to LFB per core a month back.

https://news.ycombinator.com/item?id=25221968




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: