I'm surprised to see no discussion of faster-than-naive algorithms like Strassen[1] or Coppersmith-Winograd[2], which very counterintuitively run faster than N^3.
Note: Only helpful for specific or very large matrices.
The problem with those algorithms is that they define "work" as "how many multiplications do I have to do?" Addition and memory accesses are defined by the analysis to be free. And you end up doing 4-5 times as many floating point additions as the naive algorithm, and the memory accesses are decidedly non-cache friendly. And modern CPUs have two multipcation units but only one addition unit, so multiplications are actually faster than addition. Plus, there are ergonomic issues with requiring square powers-of-two matrices.
Matrices have to be truly spectacularly large to benefit from Strassens, and applications of the asymptotically faster algorithms don't really exist. The matrices have to be so large that they're computationally too expensive regardless of the asymptotic speedup. You're talking about dispatching work units to a cluster at that point.
Most importantly though, the discussion is... boring. It's just "yup, we do Strassens, then we quickly reach the point where we go back to the naive algorithm, and things are interesting again."
> Matrices have to be truly spectacularly large to benefit from Strassens
I don't have personal experience with other mmult methods, but Wikipedia says Strassen's algorithm is at least beneficial at matrix sizes that fit in cache. I have had occasion to deal with matrices of large dimension (though usually sparse, and usually not square) and those numbers don't seem crazy at all. Are they that far off-base? (Wikipedia does say that Coppersmith-Winograd tends to never be a useful improvement.)
In any case, perhaps a more friendly response might be to point out that Strassen's algorithm tends to "bail out" into regular old matrix multiplication for small `n` anyway, so this still helps provide a speedup.
Rather than these, the original Winograd algorithm ("A New Algorithm for Inner Product", 1968) is more useful in practice, if not for matrix multiplication in software on the typical CPU architecture. This decomposition is directly related to the Winograd minimal filtering algorithms for fast convolution used in convnets today. Also, systolic array structures for matrix multiplication in hardware exist that take advantage of this reformulation to reduce the number of multiplies, as the cost in hardware is substantially lower for addition versus multiplication.
Note: Only helpful for specific or very large matrices.
[1]: https://en.wikipedia.org/wiki/Strassen_algorithm [2]: https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...