How are you defining memory performance and where are your supporting comparisons? This article only discusses the M1's behavior, and makes no comparisons to any other CPU.
All recent Intel Core-i microarchitectures require using full vector width loads to max out L1d bandwidth, because the load/store units don't actually care about the width of a load, as long as it doesn't cross a cache line (in which case the typical penalty is an additional cycle).
Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.