Sorry, but this is not at all how to write efficient matrix multiplication.
The correct answer is: don't. Use BLAS (yes, written in Fortran) + OpenMP (if needed), and let the smart people handle this. (Unless you are one of those smart people - in which case, you should improve BLAS, which the rest of us mortals will use).
The longer answer is: matrix multiplication is a highly non-trivial algorithmic problem[1], which is, in fact, still open.
The asymptotically fast algorithms (such as Strassen's[2]) outperform the naive algorithm (i.e. by definition, such is in the article) when matrices get large.
After skimming the article it is still unclear to me how the optimized naive algorithm fares against Strassen's, for example.
The final bit of advice this article is offers is reading the paper on... implementation of BLAS (that's what Goto describes there).
And so that's basically the case where avoiding Goto is considered harmful.
1. All BLAS implementations I know of don't use Strassen's algorithm.
2. This is such a Stack Overflow answer, please make it stop. Q: "How do I do a thing?" A: "I have unilaterally decided your question is invalid and you should not do a thing." That's really useful!
To the author: hack away. Goto can probably write better ASM but tutorials like this are very helpful for people who'd just like to read some interesting case studies.
But the stackoverflow answer is right, and accounts for the fact that in the majority of cases the person who is asking a confused question is indeed confused.
If what you are trying to do is to get best performance, make use of work that other people already did, rather than wasting your time on a solved problem.
If you want to learn about how efficient matrix multiplication can be implemented, that is a different problem.
The problem with Stackoverflow answers of the form "You asked about doing X, but it looks like you are trying to do X to achieve Y, and X is not a good way to achieve Y. You want to do Z. Here's how to do Z" is that later when people who actually do need to do X are looking for help, their searches keep turning up these damned answers that don't tell how to do X.
SO needs a flag on questions that is set if there is an answer to the actual question as asked and clear otherwise, and that can be used as a search filter.
By the way "make use of work that other people already did" might be nice for getting something built fast but it may not be the best thing.
Any issue that involves writing fast code involves tradeoffs. If you sit down to write something new you may have different views on the tradeoffs than whoever wrote "the fastest" one.
Life ain't so black and white. In fact even these 'best of field' products tend to be ugly inside and unoptimized in places. (source: I develop linear algebra libraries)
Bonus: for low level ASM math, every "solved" problem (which by the way it wasn't) becomes unsolved the second Intel or AMD or whoever releases a new chip or coprocessor.
This is something I'm interested in contributing to. Can you name a few libraries (especially ones implementing new and interesting work) that would welcome open source contributors? Alternatively you can just contact me (via my profile info) if you're working on something in particular but would rather not be identified publicly.
"On my machine the code runs at around 100 Gflops, which is not far from the theoretical peak, and similar to Intel's optimized implementation."
So he knows that, for example, Intel provides/sells already optimized implementation. And he knows that there is GotoBLAS (1), because as you noted, he cites the paper about it.
But he explains the implementation present in Glow, "a machine learning compiler and execution engine for various hardware targets." (2)
Now if you want to prove to the authors of Glow why they should have "just used BLAS" I would like to read this discussion (or even better, see a proof of concept code with benchmarks etc). But I can imagine that in their case there were some reasons why not to use it, as they were aware that it exists.
However it is true that the readers should be aware of the background assumptions and to be aware of the existence of the libraries.
I hardly think the purpose of this post was to show "hey, here's what you should do if you want to multiply matrices in production: write this from scratch". More like "learn how it is done, to make a seemingly simple algorithm, for a crucial problem, fast on modern processors".
Could it, perhaps, be a tutorial that imparts some useful knowledge instead of being a suggestion that everyone should roll their own dense linear algebra? I mean, most tutorials shouldn't really exist - why would you code anything, when there is a library for it?
The problem with this page (for me) is that it's hard to evaluate as a tutorial. It reads like describing an idea the author came up with, not how state of the art implementations work.
It ends with citing and recommending a paper by another person; that doesn’t sound like “an idea the author came up with”! It’s the author trying to describe the state of the art, as they see it.
A hand rolled routine where you benchmarked different indexing orders, strides, interleaving on a particular sized matrix on fixed sizes of caches has the potential to outperform blas. In general, use BLAS.
Those sub-cubic algorithms require the matrix to be very-very-very large to recoup the extra constant costs associated with them.
> Those sub-cubic algorithms require the matrix to be very-very-very large to recoup the extra constant costs associated with them.
Wikipedia disagrees:
"In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for small matrices, for matrices that are not at all square, and without requiring workspace beyond buffers that are already needed for a high-performance conventional multiplication."
I used to use Atlas quite a bit, which effectively benchmarks your CPU and memory configuration at installation time to empirically determine where to switch from 'small n' algorithms to asymptomatically optimal algorithms, like Strassen.
Of course, if you only some resources into optimizing the cubic time version, Strassen will look slow by comparison. Better we take a serious look at both and empirically figure out where the trade off point is...
For small, dense n you're not going to beat the cubic algorithm. This is the most common case in a lot of applications.
For sparse n, use a sparse version of the cubic algorithm. For large dense n, those sub-cubic algorithms might be better. Here n would have to be so large that you'd have to factor in the cost of doing it out of core.
> The asymptotically fast algorithms (such as Strassen's[2]) outperform the naive algorithm (i.e. by definition, such is in the article) when matrices get large.
This is true only if approaching the matrix multiplication problem purely from a theoretical perspective. But the reality is that computers have a specific architecture designed to be as fast as possible, while still respecting the laws of physics.
The libraries such as BLAS deal with those architectures, and it happens that there, an O(n³) triple loop with the optimizations described in the Goto paper is what gives you the fastest performance.
Maybe if we were dealing with 1b*1b matrices, Strassen's like algorithms would start to shine, but with those sizes you already exceed the maximum space on a computer, in which case you then start looking at parallel algorithms on clusters/supercomputers where the algorithms become very different (see things such as SUMMA: Scalable Universal Matrix Multiplication Algorithm). In those cases, the computation happening on each processor would still be the fast O(n³) BLAS code.
Oh, interesting point! Yes let's just use OpenBLAS, no need for anyone but one or two people to understand these things. Sure.
Okay so let's have a look at OpenBLAS's Skylake kernel. Maybe we can learn some deep wisdom this author hasn't considered! Go and have a look for it, I'll wait here.
Ah, what's that? THERE IS STILL NO SKYLAKE KERNEL IN OPENBLAS? Interesting.
In fact if you do "sudo apt-get install libopenblas" on Ubuntu 16.04 (the one every deep learning tutorial will be recommending), you'll find you get version 2.18 of OpenBLAS. You'll find that this version of the software is quite old now, and it has a bug that causes it to fail to detect your CPU correctly. You'll find that instead of using any AVX instructions, it'll fall-back to the Prescott kernel. You'll also find this is really fucking slow.
In summary:
a) You have no idea what you're talking about;
b) Even if you were right on these points of detail, your point of view is still terrible
c) Please avoid being so beligerantly wrong in future
Your comment here broke the site guidelines in several ways, including personal attack, which is the worst way to react to someone who you know or believe is wrong. In doing that you not only poison the well of this community, you discredit the truth you're trying to express.
Such indignant comments routinely get upvoted, but that's an unfortunate weakness of homo internetus, not evidence of a good comment. That's why we have guidelines that people need to follow here. Please (re-)read https://news.ycombinator.com/newsguidelines.html and abide by these rules when posting. They are based on many years of experience and every one is there for good reason.
Let me add that there exists a similar echo chamber when it comes to questions regarding cryptography. Way too often questions about how to do this or that in cryptography are put down by a "you don't" or the generic "don't roll your own crypto", which is hostile against curiosity, learning, and knowledge and on top of that unfriendly towards the one asking. Often when I read it, I feel the hacker ethos is being insulted.
While I think there are reasonable arguments for the security of (certain) self-made crypto, taking that security aspect completely aside, we need new people in every field, and cryptography is no exception from that. We should embrace newcomers, not tell me to not do what would teach them so much. Admittingly, one can easily mess up, but then tell how and what to take care of instead of just telling them not to try anything at all.
I feel like the admonition, "don't roll your own crypto" refers to production code. If you want to develop your own crypto for "curiosity, learning and knowledge" I don't think many people in the security community will really bother you about it. Just don't use it for securing anything in production unless it's been thoroughly audited and you have a reason not to use something that already exists.
It seems pretty obvious why you shouldn't design or implement novel cryptography in production without good reason and significant expertise. It also doesn't seem like the people saying this need to spell out that this doesn't preclude learning exercises or legitimate research which won't endanger any real world information. So what's the controversy?
If you want to indulge curiosity, learn, and obtain knowledge, concentrate on breaking cryptography. Even relatively basic crypto attacks will teach you far more about how cryptography works than implementing well-known constructions for the nth time will.
The other thing is that people think "I'm not rolling my own crypto" because they're building their own construction out of crypto primitives someone else wrote and having tons of side-channel attacks as a result.
... or, worse yet, they're using crypto implemented by all the people who didn't listen to the advice not to roll their own crypto.
Better not let the kids in the workshop. They could think the tools lying around there are made by professionals so they'd be safe, resulting in them getting hurt. And the metal and wooden dust from the saw someone used there could get into the wound and that could get nasty. Really, better just tell children to never go into workshops, or tinker around at all, for that matter.
Your comment serves as a poster child for what I'm talking about. Arrogant, looking down on others, deriding their efforts.
Crypto is really, really easy to screw up. If you screw it up in production software, the consequences can be anything from "fifteen million people get their credit cards stolen" to "a teenager in Saudi Arabia gets beheaded for treason".
People have every right to learn about and experiment with crypto, but it's just too important to treat as a plaything. If you're marketing your software as "cryptographically secure", you're asking users to place their trust in you to keep their secrets safe. That's a tremendous responsibility and we should treat it with the utmost seriousness. People who should know better keep shipping software with utterly broken DIY crypto, so we need to keep hammering home the message that you should never, ever roll your own crypto in production software.
If someone asked questions on StackOverflow about how to perform brain surgery with kitchen implements, we might want to indulge their curiosity, but we have a moral obligation to say definitely don't do this, because performing brain surgery in your kitchen is an outstandingly terrible idea and someone will probably die. Right now, the software industry is suffering from an epidemic of kitchen brain surgery and it urgently needs to stop.
You're reading some weird things into what I wrote. There are some real concerns with implementing your own crypto and getting it wrong, but the people listening to "Don't roll your own crypto" should be ignoring it, and the people who shouldn't be rolling their own crypto already are ignoring it, so it's useless advice. If you want to roll your own crypto, be conscientious about it, that's all.
And I have no need to deride the people who shouldn't be writing crypto. The exploits are factual and speak for themselves.
Having respect of certain things is healthy, and learning things under (expert) supervision is generally effective and safe. Think about cars or electricity. Most societies have some kind of mandatory training for both.
Do most hackers have the patience for learning the basics, i.e. some somewhat complicated maths that could take years to learn? No, most of the time it's programmers, and we tend to just start writing code. That's what "don't roll your own crypto" means. Nobody's saying "don't learn about Galois field" or "don't run through cryptopals".
The Dunning-Kruger effect is real. I've seen broken crypto getting shipped. Or people releasing dangerous crypto under "hey, I made this perfectly functional library that even has documentation how to use it, but it's just a learning project". So while that's still happening, I'm fine with hearing "don't roll your own crypto" ad nauseum.
The ad hominem and profanity isn't even correct about OpenBLAS.
There's nothing interesting in the lack of Skylake support, as is obvious from the open issue for KNL support. That might even be interesting for the pointers it contains to KNL and SKX GEMM.
OpenBLAS 2.18 certainly does use AVX(2) on appropriate CPUs. If doesn't detect micro-architectures that didn't exist at the time, it hardly a bug, and the correct solution would be to persuade the Ubuntu maintainer to provide the "hardware enablement", or to use BLIS. If tutorials tell you to install "libopenblas" they're wrong because that package doesn't exist.
The correct answer is: don't. Use BLAS (yes, written in Fortran) + OpenMP (if needed), and let the smart people handle this. (Unless you are one of those smart people - in which case, you should improve BLAS, which the rest of us mortals will use).
The longer answer is: matrix multiplication is a highly non-trivial algorithmic problem[1], which is, in fact, still open.
The asymptotically fast algorithms (such as Strassen's[2]) outperform the naive algorithm (i.e. by definition, such is in the article) when matrices get large.
After skimming the article it is still unclear to me how the optimized naive algorithm fares against Strassen's, for example.
The final bit of advice this article is offers is reading the paper on... implementation of BLAS (that's what Goto describes there).
And so that's basically the case where avoiding Goto is considered harmful.
[1]https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...
[2]https://en.wikipedia.org/wiki/Strassen_algorithm
[3]https://en.wikiquote.org/wiki/Donald_Knuth