The big claim turns out to be a little overstated. The claim:
> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.
reduces to:
> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.
So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.
This is not a correct summary of the results of the paper.
First, you cut out the initial part of the sentence about improving on Strassen's two-level algorithm. Here is the complete sentence:
> Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.
That is, note that the improvement cited in the latter part of the sentence is for the 4 x 4 matrix case.
But then your next quote is not in reference to 4 x 4 matrices. That is, for larger matrix sizes, the best algorithm AlphaTensor discovered had the same theoretical complexity as Strassen-square, but better performance.
EDIT: my next comments are confused and best ignored. See reply from pxx below.
For the 4 x 4 matrix case, the theoretical complexity of AlphaTensor's algorithm was O(N^2.778) compared with the Strassen algorithm's complexity of O(N^2.8074), which is an improvement.
what, no, you have this all wrong. the complexity of multiplying two 4x4 matrices is clearly constant. the important thing about the constant, however, is that you can use this primitive on the block matrices recursively to possibly improve the asymptotic complexity.
the complexity improvement in this paper is for arithmetic in Z_2 (modular arithmetic over single bits). in standard arithmetic, there is no asymptotic complexity improvement but the developed algorithms involve more efficiently fusible[0] operations, which is useful for the target hardware.
[0] ed: this used to read 'with fewer multiplications,' which is clearly wrong: fewer multiplications would lower complexity. Interestingly, the discovered algorithm seems to have a larger number of additions, but still runs faster.
I thought that savings on the number of multiplications leads to savings in complexity. For example, multiplying 2x2 matrices using 7 multiplications instead of 8 gives an improvement in complexity.
It's possible that my understanding is completely wrong, but your comment is at odds with my other reading, so it would be useful to get some clarification.
Most complexity measures are defined in terms of asymptotic behavior, so any specific finite algorithm has 'constant' complexity. This is an obviously unhelpful bit of pedantry in the theory side.
In fact, it's as you say; we can compute a 'raw' complexity for a finite operation (eg, raw count of arithmetic ops), and then use that operation as a primitive to create an algorithm with an asymptotic complexity which depends directly on the finite operation's 'raw' complexity.
(complexity theory, and specifically for matrix multiplication, is a great example of metrics becoming targets becoming bad incentives, a la Goodhart's Law. The so-called best algorithms are 'galactic' and therefore of no use to anyone. There's some fuzzy-headed hope that one day the people chasing the big-O exponent metric will come up with something practically useful, buuuuuut... you actually gotta work practical problem to solve the practical problem, and that requires different metrics. This seems to be what's motivating the work under discussion here.)
sorry; whoops, I don't know why I said multiplications there. I somehow thought the "more additions" somehow meant "fewer multiplications," but no, it's just "more additions but faster performance on target." edited.
your other comment notes that for some shapes, even in standard arithmetic, they have found algorithms with fewer multiplications than best-known. but those don't seem to extend to an algorithm with better asymptotic complexity for multiplying general matrices. otherwise I'd assume they'd claim them :)
To be pedantic about this complexity estimate for the 4 x 4 case, both of them reduce to a O(1). If N == 4, it's a constant, so we have O(4^2.778) == O(4^2.8074) == O(1).
Talking about scaling for a problem that has no scaling factor is a bit odd.
Yeah the paper uses O notation for complexity, but never when N is constant. For constant N and M, the paper uses the number of scalar multiplication operations performed as complexity measure. According to Figure 3, their algorithm has decreased the best known value from 49 to 47 for 4x4 matrices, and from 98 to 96 for 5x5 matrices (as well as some further state of the art improvements).
This is for matrix multiplication where elements are themselves 4x4 matrices. So yes, indeed this is about multiplying many many 4x4 matrices where N is the size of the outer matrix.
any fixed size multiplication algorithm has a corresponding general algorithm where you recursively divide a matrix into blocks and treat each block as an element. for any nxn algorithm that takes k multiplications, this yields a general algorithm with runtime n^log_n(k)
My understanding of this situation is that they found a way to multiply 4x4 matrices that requires fewer multiplications than Strassen's . This implies that that can generalize to a way to multiply n x n matrices with a better theoretical complexity than Strassen's algorithm.
In Z_2 - modular matrix multiplication, not conventional matrix multiplication. And in this case, modular over what I assume is GF(2), i.e., single bits.
Maybe I'm mis-reading the paper, but my interpretation was that the algorithm discovered using a reward which scales with HW performance matches the theoretical complexity of Strassen, but is more performant on the HW.
They also identified algorithms which do fewer matrix multiplications than Strassen, improving the lower bound of matrix multiplies required (They highlight this in Fig 3).
In that light, I thought their claim was fair. They've discovered (different) algorithms which are both theoretically and practically better than Strassen's.
That's the DeepMind modus operandi- and why they publish in Nature. It's in their best interest to claim as much covered ground as possible so they can keep working towards their end-goal. But most people who read Nature already know that results in that journal are overhyped. You just have to apply a specific prior/translator when you read their papers.
This really is a depressingly true sentiment. Nature is both the most important publication but so often, the big shots publish overhyped stuff there that doesn't deserve it but get away with it since they are bigshots. Underscore here for another piece of evidence of the dysfunction in science.
I don't doubt the sentiment is true, but hasn't this kind of science always been this way? By that I mean authors inflating the importance of their work; everyone wants to be seen as having the biggest breakthroughs.
When I think of dysfunction in 'science' I usually think of unfalsifiable hypothesis, the repeatability crisis in Psychology, p-hacking in Medicine, misuse of statistical methods in Economics and other epistemic issues, but I don't think of exaggerations like this.
I would imagine that it's the metrics universities and funding agencies apply in promotion decisions. For example, my (well-known) university decided to measure impact, and takes "Twitter engagement" as one proxy metric for impact -- against my explicit recommendations. I'll leave the consequences to everybody's imagination.
Yes. In fact many scientists post to twitter but don't look at the replies (Emily Bender is an example) or even block you if you disagree with them. That's not engagement and I wonder what the dean would do about a promotion where the scientist just sort of blathered on twitter and had lots of followers, but wasn't actually providing any real scientific value (again, Emily Bender is an example).
Deans typically lack detailed technical knowledge to evaluate candidates (not to mention time), they just go with the flow, and want to run the department smoothly.
I think the real mechanism is this: those who make the decision (those with the most power, i.e. those who bring in the most funding and can threaten to leave if they don't get their way)
already know whom they want and they cherry-pick data, eg "impact" figures, to bolster their case. That enables the dean then to justify the decision in public with those cherry-picked figures ... (the dean can hardly say
we are hiring X because otherwise top funding getter Y will leave)
I think Twitter is less important in STEM subjects than in social sciences or humanities, as STEM has more clearcut results.
On the other hand, real world performance is really the only useful metric for matrix multiplication. I don’t really care about the theoretical performance, or the number of operations. Not disagreeing with your take that the claim is grandiose, just pointing out that finding a generalizable way to automate the improvement of what is almost certainly the most important mathematical operation a computer can do is worth some attention. It also suggests other mathematical operations could be improved this way as well - potentially algorithms that haven’t gotten as much theoretical or practical attention as matrix multiplication.
With all that said, they even point out that other people have worked on this before using other optimization strategies. I’d guess they got into Nature either by using the Deepmind name, because Nature really loves reinforcement learning because it’s a cool topic that draws views, or because they’re the first group to find their algorithm leads to real improvements. (Probably a mixture of all three, I’d guess)
This is a very good proxy for actual real world speed. It's pretty much "as good as it gets" for most straight computational tasks, though sometimes memory movement is your real bottleneck.
The 'best' algorithms for matrix multiplication are galactic algorithms that provide no actual benefit. Raw operation counts are a good proxy for speed, but the big-O complexity that people actually chase hasn't been especially helpful for this problem in the last twenty+ years.
Probably the most common matrix multiplication is (nx9) x (9xm) (9 = 3x3 cells from a convnet) . If you can optimize the shit out of those you might be in business for something interesting.
Though to be honest the real slow step in machine learning is training and the slow step in training is the outer product of two matrices.... I don't believe there is an algorithmic way out of that one.
For non-ml/non-GF purposes, you might also worry about numerical stability of these matrix multiplications, which is not addressed in this paper.
> and the slow step in training is the outer product of two matrices.... I don't believe there is an algorithmic way out of that one.
Well, there are several, but the obvious ones tend to require strange or unrealistic assumptions about the hardware. The most obvious such assumption, IMO, being that the hardware is arranged in 3D space in a manner roughly analogous to a human brain, which tends to be at odds with the common practice of mostly-planar photolithography, and with the preference to be able to change the network topology experimentally without building new hardware.
It's interesting how both SIMD and big-o have had so different reasons for being of limited but still significant relevance.
SIMD, and vector processors like it was called in the 70s, delivered practical speedups in simple benchmarks right away but most applications dont't take advantage because of SW engineering reasons. Whereas big-O improvement ignores important components of performance per unit of time (memory access and constant factors) and is purely theoretical in an essential sense.
Sometimes. In the case of matrix multiplication, there is a pretty large backlog of "galactic algorithms" going down to ~O(n^2.373) that haven't yet led to real-world improvements. Strassen's ~O(n^2.807) algorithm is only considered over the basic O(n^3) strategy for n>1000.
My understanding is that the Winograd minimal filtering algorithms used in cuDNN are different from the O(n^2.37) Coppersmith-Winograd-descended matrix multiplication algorithms. But I acknowledge that these can be considered cousins, produced by the same line of research.
I'm pretty sure there's no difference. It does seem to be pretty hard to turn the theoretical win into a practical one - the GPU kernel needs to be coded extremely efficiently to match the underlying hardware. AFAIK it's only a win for 3x3 - maybe for one other size too. Originally Winograd wasn't supported by cuDNN on NVidia's Tensor Cores (matmul-specific hardware on more recentish GPUs), vs CUDA cores, but a Google search seems to indicate it can be done - not sure if that's in cuDNN though.
No, they count multiplications, 47 vs 49 in the 4x4 case. This assumes that multiplications are the heavy lift relative to additions, memory moves, etc. this is not a bad assumption, and is hardware independent, although there are some proposed 'exotic' hardware solutions where addition and multiplication are equally difficult (via lookup table, limited to about 8-bit x 8-bit) or where addition is harder than multiplication (log space number systems) but I don't think these have been deployed
That assumption didn't age very well. In many if not most modern architectures multiplication still does take more cycles than addition but can have a higher throughput if scheduled well, and fused multiply-add can be as fast as a single multiplication, essentially giving a free addition.
With sufficiently parallel hardware, you can do an entire mat4 * mat4 multiplication in 3 cycles. First do all 64 multiplications in parallel, then do 32 adds, then 16 adds to get the final answer.
For operations in GF(2) where they claim a result, a multiply is just an AND gate and an add is an XOR gate. So the fully parallel hardware version is 64 AND gates and 48 XOR gates, with a total gate delay of 3. This is a trivial amount of hardware and could easily be an instruction in some alternate universe where it was useful.
You missed the part where the "4x4 matrix multiplication" refers to multiplication of NxN matrices logically partitioned into sixteen blocks of size (N/4)x(N/4), done recursively.
The paper is a great read -- an interesting approach, fun theoretical comparisons, plus benchmarks for optimized algorithms for specific hardware (CPU and GPU, allowing for differing instruction sets).
But it doesn't stop there; from the "Discussion" section:
> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.
I have a gut feeling that there is a faster way to compute logarithms going from least to most significant bit. How would I go about using ML to find it?
[Edit] I think Feynman's algorithm might do it:
"Consider the problem of finding the logarithm of a
fractional number between 1 and 2. (The algorithm can be
generalized without too much difficulty.) Feynman observed
that any such number can be uniquely represented
as a product of numbers of the form 1 + 2^(-k), where k is an
integer. Testing for the presence of each of these factors in
a binary representation is simply a matter of a shift and a
subtraction. Once the factors are determined, the logarithm
can be computed by adding together the precomputed
logarithms of the factors. The algorithm fit the
Connection Machine especially well because the small
table of the logarithms of 1 + 2^(-k) could be shared by all
the processors. The entire computation took less time
than doing a division."
note that this algorithm will no longer be good. for precisions up to roughly 1000 bits small tables combined with minimax polynomials are optimal and for higher precision, you want to use more complicated methods. if you're interested, the ARB library unlikely the currently fastest known methods.
I'm quite suspicious about their hardware benchmarks. They're not writing custom kernels, they're relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).
Moreover, as far as I can tell, they don't report absolute performance numbers anywhere. In other words, I suspect that a naive N^3 matrix multiplication would absolutely smoke them in performance.
Now, were these algorithms discovered, or invented?
I.e., have they always been there, just sitting in Platonic space waiting for a conscious mind to stumble across them, or have they just now popped into existence?
The dovetail over all possible programs solves every problem with at most a constant slowdown.
So if it was invented, it was invented 100 years ago along with every other algorithm since the inventor of the dovetail incorporated it by reference. And there are no more algorithms to invent.
And if it was discovered, you would want to compare the efficiency of your discovery process with the dovetail.
So I tend to say "discovered with X bits of optimization power", where 0 bits reduces to the dovetail over some enumeration process, infinity bits reduces to "invention"(i.e. you consider one(or zero) object from the stream, constructing it directly), and everything in-between grades the search process.
> Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.
10-20% performance improvement in matrix multiplications is pretty amazing[0]!
That doesn't say exactly what was run and how. From measurements with nvblas, that difference could be dwarfed by the effect of tile size and pinning buffers or not.
this is cool. i suppose it's only a matter of time until we see optimizing compilers that use transformer-rl style searches for subsets of their optimization and codegen.
Whenever I see stuff about automated algorithms it reminds me of a movie, can't remember the name or plot exactly but they create a being or upload someones consciousness and the "ai" is solving all the problems the world faces until it reaches the last one, humans. It then begins to synthesize some airborne virus or nukes that would scour us from the earth. We basically created tech that would eventually optimize us out of existence as a cancer to the earth.
Edit, found it Transcendence (2014), also it seems my brain injected some random thoughts into that plot, it's not about that at all but the point stands!
Formal verification is very good at proving that compiler transformations preserve semantics. Programming language semantics are pretty well specified (at least for some programming languages...), so the chance of bugs in the specification are low.
It can catch all kinds of bugs, but you have to ask the correct questions. So it comes down to define what a bug is and the assumptions you use for testing. And therein lies the problem: what constitutes a bug? A function that add two numbers but bever terminates might be considered bugfree if forget to include as a bug that not giving an answer before the end of the universe is faulty behaviour. We humans are terrible at writing these kind of specifications, so formal verification as a method just pushes the correctness from code to specification.
If you navigate within algebraic structures with well known properties (which are also verified for example), formal verification is all you need, you can be certain of being bug free.
> ... AlphaTensor finds an algorithm for multiplying 4×4 matrices using 47 multiplications in Z_2 , thereby outperforming Strassen’s two-level algorithm, which involves 7^2 = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z_2 with complexity O(N^2.778).
> Moreover, AlphaTensor discovers efficient algorithms for multiplying matrices in standard arithmetic; for example, AlphaTensor finds a rank-76 decomposition of T_{4,5,5}, improving over the previous state-of-the-art complexity of 80 multiplications.
I understand that they transform the problem into a gamified 3-D matrix decomposition, but what exactly is the motivation for using RL to beat this decomposition game ? Why not just use, for example, an evolution algorithm to find incrementally better decompositions ?
There's commentary here which seems to be talking about matrix multiplication in general and ignoring real implementations, which are dominated by considerations of the memory hierarchy. They actually introduce operations, like packing for sufficiently large dimensions but not for smaller, and prefetch. Some of that seems only to be tunable empirically for the micro-architecture too. (I don't know how the considerations vary between CPU and GPU.) Fairly recent work on Strassen implementation, talks about that and the normal Goto-type algorithm: https://jianyuhuang.com/papers/sc16.pdf
Always caution someone who wants to follow in this research:
When somebody promotes a fast algorithm there is often a catch. In this case the issue is numerical stability. There is a theorem that states that any algorithm for n-by-n matrix-matrix multiplication that is componentwise forward stable (as good as it gets in this situation) much necessarily use n^3 scalar multiplications.
The authors will therefore waste their time if they carry out their plans and try to optimize for stability.
The standard algorithm has the nice property and no faster algorithm can have this property.
The question of fast matrix multiplication was raised recently on mathoverflow.net, see
https://mathoverflow.net/q/421304/110176
and the answers given there.
In general to speed up matrix operations how much is some sort of result caching applied?
For example if you profile calculations done when training a model, is there any significant repetition happening that would allow some kind of benefit from table lookups of certain solutions?
Can someone knowledgeable tell me if it discovered an algorithm we can understand and re-implement ourself, like is the pseudo-code for it known? Or is it kind of stuck in the infered function of the ML model?
Yes, but if those cats discovered a faster way to solve integrals than you initially taught them, their findings would have higher credibility if published in an actual mathematics journal.
Otherwise, you end up with things like the "rediscovery" [0] of the trapezoid rule. That wasn't actually published in Nature, but with the way Nature has gone downhill even in its own sphere, it could probably happen today. There's no reason to believe that someone publishing in a bio/chem journal is remotely capable of making math/CS judgements like "faster matrix multiplication"; nor that the referees (does Nature even bother with those any more? O.o) will recognize any issues.
Reinforcement learning is usually introduced as a technique to train an agent in simulation that would then be let loose in a real environment. Interesting that they choose to treat it is a search strategy.
To be fair, their work is based on alpha-zero which is itself deeplearning on top of monte-carlo tree search, a search strategy.
Having played with those algorithms, I suspect that one could obtain similar results with a solid Monte-Carlo tree search implementation (a lot of them use ucb which is really suboptimal)
Never heard of the "matrix multiplication tensor" before. Is there some digestible blog post or article somewhere explaining what it is and why it is useful?
This is completely besides the matter, but reading "provably" in the abstract is a frank reminder of how terrible English spelling/pronounciation is. I can't imagine I'm the only well-read native English speaker who read this as "prov-ably" on first take. I don't know about most languages, but you just don't get nonsense like this in French, at least.
Edit: Sorry, missed your point on first reading, which perhaps reinforces your point. I read papers like this for a living so the language is unambiguous but I see how it might not be for others, even if English is their first language.
> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.
reduces to:
> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.
So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.