Writing the fastest matrix multiplication you can is pretty fun. It probably won't be as fast a BLAS or MKL, but if you get close to it, it's a rewarding experience.
> see what else can be done I recommend reading this paper by Kazushige Goto
After doing the easiest change of loop reordering for already significant perf improvements, the paper linked ("Anatomy on a high performance matrix multiplication") is really excellent.
It will walk you through all the optimizations like tiling for reducing L1, L2, L3, and TLB cache misses, and leverage vectorization.
Then for squeezing out even more perf I remember there's the things like loop prefetching, loop unrolling, which you would expect the compiler with the best optimizations flags even targeted for the native architecture to take care of automatically. But you realize that it's not necessarily true.
If your particular matrix problem has special properties that aren't covered by blas, you can wring out an apparently impossible improvement. I once found a matrix problem (periodic cubic splines) which turned out to have a solution in O(N) - the matrix columns were related such that it could be replaced by a vector.
That paper is light on details but, for example, in R you can define the function for toeplitz multiplication as below. This has given the matrix multiplication part of my code a >100x speedup before.
'%t*%' <- function(A,v){
n <- nrow(A)
x <- as.matrix(c(A[1,], 0, A[1,][n:2]))
p <- c(v, rep(0, n))
h <- as.vector(fft(p)*fft(x))
out <- Re(pracma::ifft(h)[1:n])
return(matrix(out, n))
}
all.equal(A %t*% v, A %*% v) #TRUE
I had a problem of inverting an N x N upper bidiagonal matrix a while ago. I spent ten minutes fruitlessly googling for papers about efficient algorithms for it, then thought about it for thirty seconds, and realised you can do it straightforwardly in N divisions and N subtractions.
I can attest this too, I was recently working with some massive and sparse binary matrix, with only 0.1% elements have values, it was magnitudes faster to index and sum the elements than to do a dot product. Not even a Tesla GPU was close to the performance of a custom matrix operation we wrote.
A lot of matrix routines will have sparse matrix support. Though often the line between wanting a sparse matrix and wanting some kind of adjacency-list graph structure can be pretty fine...
Also, only tangentially related, but if you can change your problem formulation to make the matrix of interest sparse(r), that can be a huge win. Two things that have helped me in the past:
- Rows or columns that have mostly the same (nonzero) coefficient throughout. You can normally do some simple substitution to turn them into zeros.
- Rows or columns that are naturally very similar to each other. You can often replace `y` with `y-x` or replace `x[i]` with `x[i]-x[i+1]` to turn one (or `n-1`) columns "mostly empty."
We tested it against Tensorflow and scipy sparse matrices, we beat those as well, mostly because those general purpose sparse matrices do not optimize for binary matrix (eliminates the need for multiplication)
Somewhere in the conversation Les Valiant's paper "General Context-free recognition in less than Cubic time" is worth mentioning. This has informative and clear discussion on the complexity of matrix-multiplication in programming areas such as grammar and parsing.
For anyone working with linear algebra in Node.js, I wrote C++ bindings to BLAS a couple of years ago [0]. I also started working on bindings to LAPACK, but never had the time to finish them [1].
This looks like it's only for 3x3 and 4x4 matrices, e.g. what you usually use in 3d graphics. Or is it possible to do arbitrary matrix multiplication with this lib?
Thanks! I find it weird that performance-focused javascript projects don't have live benchmarks up so you can judge for yourself. Personally I've found that a semi-clever matrix multiplier in c++ compiled with emscripten can give about 1.5 gflops (singlethreaded on a 2.5GHz i7 macbook pro), and existing js libs like sushi are a lot slower.
How would this change if you were dealing with more sparse matrices? Im used to using spark mllib distributed to do matrix multiplication on large sparse coordinate matrices and it works fine for me but i dont think i need the level of optimality this approach would demand.
Dense matrix algebra has the benefit that besides the matrix dimensions, the data itself has no impact on performance: the access pattern is entirely predictable and known as soon as you know the matrix sizes (side note: which is why just-in-time libraries such as libxsmm exist). For most cases, you can rely on MKL or Eigen to produce near-optimal performance for your dense linear algebra operations.
Sparse matrices do not have that luxury. Sparse matrix ops have access patterns that entirely depend on your data. You simply cannot make a library that covers every possible distribution of non-zeroes in an optimal way. At best you have a toolbox of a large number of representations and algorithms that you choose for each different use-case. Engineering simulations and solvers can have block or cluster patterns of non-zeroes, which can be exploited by doing dense operations for the dense blocks. Large real-world graphs often have an exponential non-zero distribution and are often extremely sparse, which might need an hash-based join algorithm for an axpy-kernel. Are you running a tight CG-solver loop? Then you might want to bundle multiple spmv-kernels to operate multiple vectors on the same matrix (for similar reasons as states in the original article). You are distributing with Spark? Unless your non-zero distribution is uniform, your matrix is really hard to partition evenly. If your matrix never changes, you may consider preprocessing using a partitioner (METIS, Mondriaan).
In short, it is highly unlikely that the library you are using is anywhere near optimal for your use case. Do explore alternatives based on your domain-insights, it is not uncommon to net double digit performance improvements.
Dense matrix algebra has the benefit that besides the matrix dimensions, the data itself has no impact on performance
I was very surprised when I first learned that's not strictly true, because of denormal numbers[1]. Here's a session in ipython --pylab (with MKL), demonstrating a 200x slow-down for matrix multiplication with tiny numbers. Crazy!
In [1]: A = randn(1000, 1000)
In [2]: %timeit A @ A
25.9 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: A *= 1e-160
In [4]: %timeit A @ A
5.52 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You hit denormal numbers more quickly with single-precision floats. I have been bitten by this issue in the wild a couple of times now, and seen a couple of other people with it too. Sometimes denormals are created internally in algorithms, when you didn't think your input matrices were that small.
Yes! For some applications including machine learning you can set the option to flush denormals to zero so you don't hit this case.
Infinity and NaN can also be a performance problem but it depends on your processor. Though of course if you get those during matrix multiplication something has probably gone wrong.
The point of NaNs and infinities is that sometimes it is easier/faster/cleaner to just let those things propagate through the calculation than to do conditional logic that in the end just tends to just simulate the same effect.
True, that consideration is more valid for element-by-element operations rather than things like matrix multiplications. But depending on the operation it might make sense to rely on NaN propagation rather than trying to prescan you matrix for NaNs.
It's just incredible that there's still no way to force denormals to be rounded silently to zero on x86. You have to go out of your way to check for them, or face horrific performance penalties for a feature you didn't even want or need.
Sparse matrix multiplication is generally considered to be a different problem. Sparse matrices need to be represented differently (See https://en.wikipedia.org/wiki/Sparse_matrix for an introduction). There has been recently a lot of research on how to do sparse matrix multiplication efficiently on manycore architectures. See this 2018 paper for example: https://arxiv.org/pdf/1804.01698.pdf. Note that sparse matrix-vector multiplication is also a different problem.
Thats fascinating, i wonder how much coordinate matrix multiplication in spark takes into account factors like choelesky decomposition. Noticed most of those libraries are written in C++ or fortran, is JVM based approaches generally discouraged in this realm? It fits well with our current stack based mostly around spark
I don't much about Java. But, for example, if you want to implement the technique from the link above in Java, you'd need LoadFloat8, BroadcastFloat8, and AdduFloat8. I'm not sure how they are implemented in Glow, but they can be implemented using inline assembly. You'd need something like SIMD intrinsics Java library to access these crucial x86 SIMD instructions.
Matrix multiplication involves moving across one matrix in column order and the other matrix in row order. So it turns out that both row or column ordering make no difference. I think that matrix multiplication is one of the best examples of a deceptivly simple problem. It shows how far ALL code is from peak performance. We can only strive to be within an order or two from peak performance.
This trick does work. If the matrices are in row-major order, you transpose B in memory and then compute A * (B^T)^T. This multiplication reads both matrices in row order.
However, while this does improve performance over the naive algorithm, it's still not as good as a tiling algorithm.
I've found where that causes problems is when on of the matrix dimensions is not a multiple of the cache line size. It's common on gpus to use more elements then there are in the dimension. Nvidia calls this the leading dimension, and it must be greater than or equal to the Matrix dimension. If this is the case, the transpose trick doesn't quite work anymore.
I took a different lesson from learning to implement matrix multiply: you can get a small amount of code within a few percent of peak performance if you try. So, the trick is to keep the 99% of the code that does 10% of the work within an order of magnitude of peak performance, and then get the rest within a factor of two.
That gets the system within a factor of three, which is still embarrassing, but that’s what profiling tools are for. :-)
But it is a reason why the matrix is not always a great abstraction. That is, the internal layout of a matrix interacts with the use case enough that (sometimes) users will want explicit control that layout.
BLIS actually tells you how to write a fast production large-matrix GEMM, and the papers linked from https://github.com/flame/blis would be a better reference than the Goto and van de Geijn.
For small matrices see the references from https://github.com/hfp/libxsmm but you're not likely to be re-implementing that unless you're serious about a version on non-x86_64.
I'm surprised to see no discussion of faster-than-naive algorithms like Strassen[1] or Coppersmith-Winograd[2], which very counterintuitively run faster than N^3.
Note: Only helpful for specific or very large matrices.
The problem with those algorithms is that they define "work" as "how many multiplications do I have to do?" Addition and memory accesses are defined by the analysis to be free. And you end up doing 4-5 times as many floating point additions as the naive algorithm, and the memory accesses are decidedly non-cache friendly. And modern CPUs have two multipcation units but only one addition unit, so multiplications are actually faster than addition. Plus, there are ergonomic issues with requiring square powers-of-two matrices.
Matrices have to be truly spectacularly large to benefit from Strassens, and applications of the asymptotically faster algorithms don't really exist. The matrices have to be so large that they're computationally too expensive regardless of the asymptotic speedup. You're talking about dispatching work units to a cluster at that point.
Most importantly though, the discussion is... boring. It's just "yup, we do Strassens, then we quickly reach the point where we go back to the naive algorithm, and things are interesting again."
> Matrices have to be truly spectacularly large to benefit from Strassens
I don't have personal experience with other mmult methods, but Wikipedia says Strassen's algorithm is at least beneficial at matrix sizes that fit in cache. I have had occasion to deal with matrices of large dimension (though usually sparse, and usually not square) and those numbers don't seem crazy at all. Are they that far off-base? (Wikipedia does say that Coppersmith-Winograd tends to never be a useful improvement.)
In any case, perhaps a more friendly response might be to point out that Strassen's algorithm tends to "bail out" into regular old matrix multiplication for small `n` anyway, so this still helps provide a speedup.
Rather than these, the original Winograd algorithm ("A New Algorithm for Inner Product", 1968) is more useful in practice, if not for matrix multiplication in software on the typical CPU architecture. This decomposition is directly related to the Winograd minimal filtering algorithms for fast convolution used in convnets today. Also, systolic array structures for matrix multiplication in hardware exist that take advantage of this reformulation to reduce the number of multiplies, as the cost in hardware is substantially lower for addition versus multiplication.
I wrote a JavaScript implementation with the naive method, but it also checks that matrix are compatibile and generalize the underlying ring operation (mul and add) so you could use it to multiply complex matrices, or bigint matrices, or every other ring you define.
I'm a big fan of writing simple, portable code that is structured enough to let the compiler optimize it to perfection. If you do it right, a naive A' * B multiplier can be within 20% of BLAS performance, at least for non-huge matrices. Now, when you need that code on some embedded platform or for emscripten or whatever, it'll just work and be really tiny too.
In my experience, the people who leave comments like yours are the ones who don't often have to solve real performance problems. The compiler can't do it all (and, specifically, won't help you in the case the article discusses), and those who write code used in performance sensitive applications and/or libs used by literally millions of people have to know this stuff. Where do those people learn when they're starting out if everyone tells them to "stop thinking stupid, your compiler is smarter than you"?
I for one am glad they spend a lot of time thinking about it so that you don't have to and can instead spend your time opining on forums about premature optimization.
Speculative ad hominem aside, the point in this case was to learn to write code so that good compilers can output performant SIMD-heavy code. It's not trivial and can include a lot of trial and error and looking at assembly. But once you learn these tricks, code can be clean, portable and performant, I would say that's a win.
I can only judge you by what you say here. Obviously I don't know you, but what other choice do I have? In my experience, those who rely on their tools to do everything for them are useless when things go sideways.
> the point in this case was to learn to write code so that good compilers can output performant SIMD-heavy code
No, the point is that comments like yours serve little purpose. No one is suggesting that you should roll your own lib for matrix math in every situation. However, if you _understand_ what happens behind the scenes you write better code and can more easily diagnose performance issues.
The problem with "don't worry about it" type responses is that sometimes you _do_ end up having to worry about it, and now you're not equipped to solve the problem. Hell, you may not even know that there is a problem to begin with because you have no idea what the runtime characteristics of your algorithm _should_ be.
Maybe you'll have to write some code like this someday. How would you know if your algorithm is performing well? Perhaps you think your 10 hour run is totally acceptible... until someone who read this article and has some experience comes by. They laugh at the absurd performance and spend a bit of time making it 60% faster _because they know how this stuff works_.
There's a lot of value in learning how things work, and "premature optimization" responses like you see all over SO are worthless.
Except that, when you are writing scientific code to solve a large linear problem, like a weather model or asupernova simulation, the difference between some naïve code and a heavily optimized one can be ten hours against ten days...
The figure "20%" alone means nothing. You are probably meaning, "20% in this specific benchmark with "non-huge matrices" (as said by the OP). Algorithms can show wildly different performance when run on small-scale and large-scale inputs.
Well then you have to still test it if it's still 20% or not. I haven't written the code, I just think you shouldn't have to resort to hyperbole (ten hours, ten days) if the comment above doesn't warrant it.
> It's the huge matrices where performance matters
I wouldn't say that, just look at CGI where a lot of effort is spent making fast 4x4 matrix multiplies. But yeah, a general matrix mult needs to scale, but a lot of the time you know what size you will run and you can optimize for that.
and once we got the ball rolling and more public knowledge is available probably the locked down techniques of the major supercomputers will start to emerge ( i just assume pps from the supercomputers area have been dealing with this stuff for the last couple of decades )
can i spit some more, dear hacker news?
focus, you gotta focus, forget about the compiler generating good code, it's all about the ability for us humans to explore the domain
even in halide, they mention the point that you still need to sit down and code different approaches, but their tool makes it a breeze compared to exploring those different approaches using hand written assembly
Sorry, but this is not at all how to write efficient matrix multiplication.
The correct answer is: don't. Use BLAS (yes, written in Fortran) + OpenMP (if needed), and let the smart people handle this. (Unless you are one of those smart people - in which case, you should improve BLAS, which the rest of us mortals will use).
The longer answer is: matrix multiplication is a highly non-trivial algorithmic problem[1], which is, in fact, still open.
The asymptotically fast algorithms (such as Strassen's[2]) outperform the naive algorithm (i.e. by definition, such is in the article) when matrices get large.
After skimming the article it is still unclear to me how the optimized naive algorithm fares against Strassen's, for example.
The final bit of advice this article is offers is reading the paper on... implementation of BLAS (that's what Goto describes there).
And so that's basically the case where avoiding Goto is considered harmful.
1. All BLAS implementations I know of don't use Strassen's algorithm.
2. This is such a Stack Overflow answer, please make it stop. Q: "How do I do a thing?" A: "I have unilaterally decided your question is invalid and you should not do a thing." That's really useful!
To the author: hack away. Goto can probably write better ASM but tutorials like this are very helpful for people who'd just like to read some interesting case studies.
But the stackoverflow answer is right, and accounts for the fact that in the majority of cases the person who is asking a confused question is indeed confused.
If what you are trying to do is to get best performance, make use of work that other people already did, rather than wasting your time on a solved problem.
If you want to learn about how efficient matrix multiplication can be implemented, that is a different problem.
The problem with Stackoverflow answers of the form "You asked about doing X, but it looks like you are trying to do X to achieve Y, and X is not a good way to achieve Y. You want to do Z. Here's how to do Z" is that later when people who actually do need to do X are looking for help, their searches keep turning up these damned answers that don't tell how to do X.
SO needs a flag on questions that is set if there is an answer to the actual question as asked and clear otherwise, and that can be used as a search filter.
By the way "make use of work that other people already did" might be nice for getting something built fast but it may not be the best thing.
Any issue that involves writing fast code involves tradeoffs. If you sit down to write something new you may have different views on the tradeoffs than whoever wrote "the fastest" one.
Life ain't so black and white. In fact even these 'best of field' products tend to be ugly inside and unoptimized in places. (source: I develop linear algebra libraries)
Bonus: for low level ASM math, every "solved" problem (which by the way it wasn't) becomes unsolved the second Intel or AMD or whoever releases a new chip or coprocessor.
This is something I'm interested in contributing to. Can you name a few libraries (especially ones implementing new and interesting work) that would welcome open source contributors? Alternatively you can just contact me (via my profile info) if you're working on something in particular but would rather not be identified publicly.
"On my machine the code runs at around 100 Gflops, which is not far from the theoretical peak, and similar to Intel's optimized implementation."
So he knows that, for example, Intel provides/sells already optimized implementation. And he knows that there is GotoBLAS (1), because as you noted, he cites the paper about it.
But he explains the implementation present in Glow, "a machine learning compiler and execution engine for various hardware targets." (2)
Now if you want to prove to the authors of Glow why they should have "just used BLAS" I would like to read this discussion (or even better, see a proof of concept code with benchmarks etc). But I can imagine that in their case there were some reasons why not to use it, as they were aware that it exists.
However it is true that the readers should be aware of the background assumptions and to be aware of the existence of the libraries.
I hardly think the purpose of this post was to show "hey, here's what you should do if you want to multiply matrices in production: write this from scratch". More like "learn how it is done, to make a seemingly simple algorithm, for a crucial problem, fast on modern processors".
Could it, perhaps, be a tutorial that imparts some useful knowledge instead of being a suggestion that everyone should roll their own dense linear algebra? I mean, most tutorials shouldn't really exist - why would you code anything, when there is a library for it?
The problem with this page (for me) is that it's hard to evaluate as a tutorial. It reads like describing an idea the author came up with, not how state of the art implementations work.
It ends with citing and recommending a paper by another person; that doesn’t sound like “an idea the author came up with”! It’s the author trying to describe the state of the art, as they see it.
A hand rolled routine where you benchmarked different indexing orders, strides, interleaving on a particular sized matrix on fixed sizes of caches has the potential to outperform blas. In general, use BLAS.
Those sub-cubic algorithms require the matrix to be very-very-very large to recoup the extra constant costs associated with them.
> Those sub-cubic algorithms require the matrix to be very-very-very large to recoup the extra constant costs associated with them.
Wikipedia disagrees:
"In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for small matrices, for matrices that are not at all square, and without requiring workspace beyond buffers that are already needed for a high-performance conventional multiplication."
I used to use Atlas quite a bit, which effectively benchmarks your CPU and memory configuration at installation time to empirically determine where to switch from 'small n' algorithms to asymptomatically optimal algorithms, like Strassen.
Of course, if you only some resources into optimizing the cubic time version, Strassen will look slow by comparison. Better we take a serious look at both and empirically figure out where the trade off point is...
For small, dense n you're not going to beat the cubic algorithm. This is the most common case in a lot of applications.
For sparse n, use a sparse version of the cubic algorithm. For large dense n, those sub-cubic algorithms might be better. Here n would have to be so large that you'd have to factor in the cost of doing it out of core.
> The asymptotically fast algorithms (such as Strassen's[2]) outperform the naive algorithm (i.e. by definition, such is in the article) when matrices get large.
This is true only if approaching the matrix multiplication problem purely from a theoretical perspective. But the reality is that computers have a specific architecture designed to be as fast as possible, while still respecting the laws of physics.
The libraries such as BLAS deal with those architectures, and it happens that there, an O(n³) triple loop with the optimizations described in the Goto paper is what gives you the fastest performance.
Maybe if we were dealing with 1b*1b matrices, Strassen's like algorithms would start to shine, but with those sizes you already exceed the maximum space on a computer, in which case you then start looking at parallel algorithms on clusters/supercomputers where the algorithms become very different (see things such as SUMMA: Scalable Universal Matrix Multiplication Algorithm). In those cases, the computation happening on each processor would still be the fast O(n³) BLAS code.
Oh, interesting point! Yes let's just use OpenBLAS, no need for anyone but one or two people to understand these things. Sure.
Okay so let's have a look at OpenBLAS's Skylake kernel. Maybe we can learn some deep wisdom this author hasn't considered! Go and have a look for it, I'll wait here.
Ah, what's that? THERE IS STILL NO SKYLAKE KERNEL IN OPENBLAS? Interesting.
In fact if you do "sudo apt-get install libopenblas" on Ubuntu 16.04 (the one every deep learning tutorial will be recommending), you'll find you get version 2.18 of OpenBLAS. You'll find that this version of the software is quite old now, and it has a bug that causes it to fail to detect your CPU correctly. You'll find that instead of using any AVX instructions, it'll fall-back to the Prescott kernel. You'll also find this is really fucking slow.
In summary:
a) You have no idea what you're talking about;
b) Even if you were right on these points of detail, your point of view is still terrible
c) Please avoid being so beligerantly wrong in future
Your comment here broke the site guidelines in several ways, including personal attack, which is the worst way to react to someone who you know or believe is wrong. In doing that you not only poison the well of this community, you discredit the truth you're trying to express.
Such indignant comments routinely get upvoted, but that's an unfortunate weakness of homo internetus, not evidence of a good comment. That's why we have guidelines that people need to follow here. Please (re-)read https://news.ycombinator.com/newsguidelines.html and abide by these rules when posting. They are based on many years of experience and every one is there for good reason.
Let me add that there exists a similar echo chamber when it comes to questions regarding cryptography. Way too often questions about how to do this or that in cryptography are put down by a "you don't" or the generic "don't roll your own crypto", which is hostile against curiosity, learning, and knowledge and on top of that unfriendly towards the one asking. Often when I read it, I feel the hacker ethos is being insulted.
While I think there are reasonable arguments for the security of (certain) self-made crypto, taking that security aspect completely aside, we need new people in every field, and cryptography is no exception from that. We should embrace newcomers, not tell me to not do what would teach them so much. Admittingly, one can easily mess up, but then tell how and what to take care of instead of just telling them not to try anything at all.
I feel like the admonition, "don't roll your own crypto" refers to production code. If you want to develop your own crypto for "curiosity, learning and knowledge" I don't think many people in the security community will really bother you about it. Just don't use it for securing anything in production unless it's been thoroughly audited and you have a reason not to use something that already exists.
It seems pretty obvious why you shouldn't design or implement novel cryptography in production without good reason and significant expertise. It also doesn't seem like the people saying this need to spell out that this doesn't preclude learning exercises or legitimate research which won't endanger any real world information. So what's the controversy?
If you want to indulge curiosity, learn, and obtain knowledge, concentrate on breaking cryptography. Even relatively basic crypto attacks will teach you far more about how cryptography works than implementing well-known constructions for the nth time will.
The other thing is that people think "I'm not rolling my own crypto" because they're building their own construction out of crypto primitives someone else wrote and having tons of side-channel attacks as a result.
... or, worse yet, they're using crypto implemented by all the people who didn't listen to the advice not to roll their own crypto.
Better not let the kids in the workshop. They could think the tools lying around there are made by professionals so they'd be safe, resulting in them getting hurt. And the metal and wooden dust from the saw someone used there could get into the wound and that could get nasty. Really, better just tell children to never go into workshops, or tinker around at all, for that matter.
Your comment serves as a poster child for what I'm talking about. Arrogant, looking down on others, deriding their efforts.
Crypto is really, really easy to screw up. If you screw it up in production software, the consequences can be anything from "fifteen million people get their credit cards stolen" to "a teenager in Saudi Arabia gets beheaded for treason".
People have every right to learn about and experiment with crypto, but it's just too important to treat as a plaything. If you're marketing your software as "cryptographically secure", you're asking users to place their trust in you to keep their secrets safe. That's a tremendous responsibility and we should treat it with the utmost seriousness. People who should know better keep shipping software with utterly broken DIY crypto, so we need to keep hammering home the message that you should never, ever roll your own crypto in production software.
If someone asked questions on StackOverflow about how to perform brain surgery with kitchen implements, we might want to indulge their curiosity, but we have a moral obligation to say definitely don't do this, because performing brain surgery in your kitchen is an outstandingly terrible idea and someone will probably die. Right now, the software industry is suffering from an epidemic of kitchen brain surgery and it urgently needs to stop.
You're reading some weird things into what I wrote. There are some real concerns with implementing your own crypto and getting it wrong, but the people listening to "Don't roll your own crypto" should be ignoring it, and the people who shouldn't be rolling their own crypto already are ignoring it, so it's useless advice. If you want to roll your own crypto, be conscientious about it, that's all.
And I have no need to deride the people who shouldn't be writing crypto. The exploits are factual and speak for themselves.
Having respect of certain things is healthy, and learning things under (expert) supervision is generally effective and safe. Think about cars or electricity. Most societies have some kind of mandatory training for both.
Do most hackers have the patience for learning the basics, i.e. some somewhat complicated maths that could take years to learn? No, most of the time it's programmers, and we tend to just start writing code. That's what "don't roll your own crypto" means. Nobody's saying "don't learn about Galois field" or "don't run through cryptopals".
The Dunning-Kruger effect is real. I've seen broken crypto getting shipped. Or people releasing dangerous crypto under "hey, I made this perfectly functional library that even has documentation how to use it, but it's just a learning project". So while that's still happening, I'm fine with hearing "don't roll your own crypto" ad nauseum.
The ad hominem and profanity isn't even correct about OpenBLAS.
There's nothing interesting in the lack of Skylake support, as is obvious from the open issue for KNL support. That might even be interesting for the pointers it contains to KNL and SKX GEMM.
OpenBLAS 2.18 certainly does use AVX(2) on appropriate CPUs. If doesn't detect micro-architectures that didn't exist at the time, it hardly a bug, and the correct solution would be to persuade the Ubuntu maintainer to provide the "hardware enablement", or to use BLIS. If tutorials tell you to install "libopenblas" they're wrong because that package doesn't exist.
> see what else can be done I recommend reading this paper by Kazushige Goto
After doing the easiest change of loop reordering for already significant perf improvements, the paper linked ("Anatomy on a high performance matrix multiplication") is really excellent.
It will walk you through all the optimizations like tiling for reducing L1, L2, L3, and TLB cache misses, and leverage vectorization.
Then for squeezing out even more perf I remember there's the things like loop prefetching, loop unrolling, which you would expect the compiler with the best optimizations flags even targeted for the native architecture to take care of automatically. But you realize that it's not necessarily true.