I hardly think the purpose of this post was to show "hey, here's what you should do if you want to multiply matrices in production: write this from scratch". More like "learn how it is done, to make a seemingly simple algorithm, for a crucial problem, fast on modern processors".