Discovering faster matrix multiplication algorithms with reinforcement learning

svnt · on Oct 5, 2022

The big claim turns out to be a little overstated. The claim:

> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

reduces to:

> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.

So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.

boole1854 · on Oct 5, 2022

This is not a correct summary of the results of the paper.

First, you cut out the initial part of the sentence about improving on Strassen's two-level algorithm. Here is the complete sentence:

> Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

That is, note that the improvement cited in the latter part of the sentence is for the 4 x 4 matrix case.

But then your next quote is not in reference to 4 x 4 matrices. That is, for larger matrix sizes, the best algorithm AlphaTensor discovered had the same theoretical complexity as Strassen-square, but better performance.

EDIT: my next comments are confused and best ignored. See reply from pxx below.

For the 4 x 4 matrix case, the theoretical complexity of AlphaTensor's algorithm was O(N^2.778) compared with the Strassen algorithm's complexity of O(N^2.8074), which is an improvement.

pxx · on Oct 5, 2022

what, no, you have this all wrong. the complexity of multiplying two 4x4 matrices is clearly constant. the important thing about the constant, however, is that you can use this primitive on the block matrices recursively to possibly improve the asymptotic complexity.

the complexity improvement in this paper is for arithmetic in Z_2 (modular arithmetic over single bits). in standard arithmetic, there is no asymptotic complexity improvement but the developed algorithms involve more efficiently fusible[0] operations, which is useful for the target hardware.

[0] ed: this used to read 'with fewer multiplications,' which is clearly wrong: fewer multiplications would lower complexity. Interestingly, the discovered algorithm seems to have a larger number of additions, but still runs faster.

ColinWright · on Oct 5, 2022

I thought that savings on the number of multiplications leads to savings in complexity. For example, multiplying 2x2 matrices using 7 multiplications instead of 8 gives an improvement in complexity.

See my comment here: https://news.ycombinator.com/item?id=33098192

It's possible that my understanding is completely wrong, but your comment is at odds with my other reading, so it would be useful to get some clarification.

sdenton4 · on Oct 5, 2022

This is correct, as I understand it.

Most complexity measures are defined in terms of asymptotic behavior, so any specific finite algorithm has 'constant' complexity. This is an obviously unhelpful bit of pedantry in the theory side.

In fact, it's as you say; we can compute a 'raw' complexity for a finite operation (eg, raw count of arithmetic ops), and then use that operation as a primitive to create an algorithm with an asymptotic complexity which depends directly on the finite operation's 'raw' complexity.

(complexity theory, and specifically for matrix multiplication, is a great example of metrics becoming targets becoming bad incentives, a la Goodhart's Law. The so-called best algorithms are 'galactic' and therefore of no use to anyone. There's some fuzzy-headed hope that one day the people chasing the big-O exponent metric will come up with something practically useful, buuuuuut... you actually gotta work practical problem to solve the practical problem, and that requires different metrics. This seems to be what's motivating the work under discussion here.)

pxx · on Oct 5, 2022

sorry; whoops, I don't know why I said multiplications there. I somehow thought the "more additions" somehow meant "fewer multiplications," but no, it's just "more additions but faster performance on target." edited.

your other comment notes that for some shapes, even in standard arithmetic, they have found algorithms with fewer multiplications than best-known. but those don't seem to extend to an algorithm with better asymptotic complexity for multiplying general matrices. otherwise I'd assume they'd claim them :)

boole1854 · on Oct 5, 2022

You are correct. I will edit.

a-dub · on Oct 5, 2022

if you're in gf2 on a traditional cpu and care about wall clock time, i start to wonder how something like simd xors and parity would compare...

Nihmie · on Oct 5, 2022

To be pedantic about this complexity estimate for the 4 x 4 case, both of them reduce to a O(1). If N == 4, it's a constant, so we have O(4^2.778) == O(4^2.8074) == O(1).

Talking about scaling for a problem that has no scaling factor is a bit odd.

est31 · on Oct 5, 2022

Yeah the paper uses O notation for complexity, but never when N is constant. For constant N and M, the paper uses the number of scalar multiplication operations performed as complexity measure. According to Figure 3, their algorithm has decreased the best known value from 49 to 47 for 4x4 matrices, and from 98 to 96 for 5x5 matrices (as well as some further state of the art improvements).

vlovich123 · on Oct 5, 2022

If you have to process many such matrices, you’re not at O(1). I would imagine that’s what the O is referring to in this case.

cochne · on Oct 5, 2022

If you look at it that way then any algorithm is just o(n) where n is the number of matrices. O does not care about constant factors

vlovich123 · on Oct 7, 2022

There's a better explanation here: https://fgiesen.wordpress.com/2022/10/06/on-alphatensors-new...

This is for matrix multiplication where elements are themselves 4x4 matrices. So yes, indeed this is about multiplying many many 4x4 matrices where N is the size of the outer matrix.

throwawaymaths · on Oct 6, 2022

To be pedantic, paper is very clear that O is calculated after decomposing arbitrarily large matrices to 4x4s.

skybrian · on Oct 5, 2022

Isn’t multiplying two 4x4 matrices a finite problem? What does big O complexity mean in that case?

adgjlsfhk1 · on Oct 5, 2022

any fixed size multiplication algorithm has a corresponding general algorithm where you recursively divide a matrix into blocks and treat each block as an element. for any nxn algorithm that takes k multiplications, this yields a general algorithm with runtime n^log_n(k)

Rioghasarig · on Oct 5, 2022

My understanding of this situation is that they found a way to multiply 4x4 matrices that requires fewer multiplications than Strassen's . This implies that that can generalize to a way to multiply n x n matrices with a better theoretical complexity than Strassen's algorithm.

dgacmu · on Oct 5, 2022

In Z_2 - modular matrix multiplication, not conventional matrix multiplication. And in this case, modular over what I assume is GF(2), i.e., single bits.

aaaaaaaaaaab · on Oct 5, 2022

Modular multiplication over GF(2)… also known as AND?

chabons · on Oct 5, 2022

Maybe I'm mis-reading the paper, but my interpretation was that the algorithm discovered using a reward which scales with HW performance matches the theoretical complexity of Strassen, but is more performant on the HW.

They also identified algorithms which do fewer matrix multiplications than Strassen, improving the lower bound of matrix multiplies required (They highlight this in Fig 3).

In that light, I thought their claim was fair. They've discovered (different) algorithms which are both theoretically and practically better than Strassen's.

dekhn · on Oct 5, 2022

That's the DeepMind modus operandi- and why they publish in Nature. It's in their best interest to claim as much covered ground as possible so they can keep working towards their end-goal. But most people who read Nature already know that results in that journal are overhyped. You just have to apply a specific prior/translator when you read their papers.

noobermin · on Oct 5, 2022

This really is a depressingly true sentiment. Nature is both the most important publication but so often, the big shots publish overhyped stuff there that doesn't deserve it but get away with it since they are bigshots. Underscore here for another piece of evidence of the dysfunction in science.

brutusborn · on Oct 5, 2022

I don't doubt the sentiment is true, but hasn't this kind of science always been this way? By that I mean authors inflating the importance of their work; everyone wants to be seen as having the biggest breakthroughs.

When I think of dysfunction in 'science' I usually think of unfalsifiable hypothesis, the repeatability crisis in Psychology, p-hacking in Medicine, misuse of statistical methods in Economics and other epistemic issues, but I don't think of exaggerations like this.

dekhn · on Oct 5, 2022

Scientists have long been self-promoters who desire that their theories become the dominant ones and they use many techniques to achieve this.

However, the trend towards maximizing the predicted outcomes of your research really took off during the human genomics project.

YorkshireSeason · on Oct 5, 2022

I would imagine that it's the metrics universities and funding agencies apply in promotion decisions. For example, my (well-known) university decided to measure impact, and takes "Twitter engagement" as one proxy metric for impact -- against my explicit recommendations. I'll leave the consequences to everybody's imagination.

dekhn · on Oct 5, 2022

Yes. In fact many scientists post to twitter but don't look at the replies (Emily Bender is an example) or even block you if you disagree with them. That's not engagement and I wonder what the dean would do about a promotion where the scientist just sort of blathered on twitter and had lots of followers, but wasn't actually providing any real scientific value (again, Emily Bender is an example).

YorkshireSeason · on Oct 5, 2022

Deans typically lack detailed technical knowledge to evaluate candidates (not to mention time), they just go with the flow, and want to run the department smoothly.

I think the real mechanism is this: those who make the decision (those with the most power, i.e. those who bring in the most funding and can threaten to leave if they don't get their way) already know whom they want and they cherry-pick data, eg "impact" figures, to bolster their case. That enables the dean then to justify the decision in public with those cherry-picked figures ... (the dean can hardly say we are hiring X because otherwise top funding getter Y will leave)

I think Twitter is less important in STEM subjects than in social sciences or humanities, as STEM has more clearcut results.

adalacelove · on Oct 5, 2022

> Scientists have long been self-promoters

The famous ones more than the rest, I guess. To my mind nevertheless comes Cavendish, my hero.

brutusborn · on Oct 6, 2022

Any more info on the changes from the human genomics project?

atty · on Oct 5, 2022

On the other hand, real world performance is really the only useful metric for matrix multiplication. I don’t really care about the theoretical performance, or the number of operations. Not disagreeing with your take that the claim is grandiose, just pointing out that finding a generalizable way to automate the improvement of what is almost certainly the most important mathematical operation a computer can do is worth some attention. It also suggests other mathematical operations could be improved this way as well - potentially algorithms that haven’t gotten as much theoretical or practical attention as matrix multiplication. With all that said, they even point out that other people have worked on this before using other optimization strategies. I’d guess they got into Nature either by using the Deepmind name, because Nature really loves reinforcement learning because it’s a cool topic that draws views, or because they’re the first group to find their algorithm leads to real improvements. (Probably a mixture of all three, I’d guess)

dekhn · on Oct 5, 2022

Science proceeds at the rate by which linearly larger matrices can be decomposed in less than exponential time.

matheusd · on Oct 5, 2022

This is excellent! You should make a T-Shirt with this quote on it!

throwawaymaths · on Oct 5, 2022

> the number of operations

This is a very good proxy for actual real world speed. It's pretty much "as good as it gets" for most straight computational tasks, though sometimes memory movement is your real bottleneck.

sdenton4 · on Oct 5, 2022

This is questionable...

The 'best' algorithms for matrix multiplication are galactic algorithms that provide no actual benefit. Raw operation counts are a good proxy for speed, but the big-O complexity that people actually chase hasn't been especially helpful for this problem in the last twenty+ years.

https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...

throwawaymaths · on Oct 5, 2022

Probably the most common matrix multiplication is (nx9) x (9xm) (9 = 3x3 cells from a convnet) . If you can optimize the shit out of those you might be in business for something interesting.

Though to be honest the real slow step in machine learning is training and the slow step in training is the outer product of two matrices.... I don't believe there is an algorithmic way out of that one.

For non-ml/non-GF purposes, you might also worry about numerical stability of these matrix multiplications, which is not addressed in this paper.

a1369209993 · on Oct 5, 2022

> and the slow step in training is the outer product of two matrices.... I don't believe there is an algorithmic way out of that one.

Well, there are several, but the obvious ones tend to require strange or unrealistic assumptions about the hardware. The most obvious such assumption, IMO, being that the hardware is arranged in 3D space in a manner roughly analogous to a human brain, which tends to be at odds with the common practice of mostly-planar photolithography, and with the preference to be able to change the network topology experimentally without building new hardware.

fulafel · on Oct 5, 2022

It's become a much worse proxy since the memory wall.

psychphysic · on Oct 5, 2022

Surely theoretical improvement begets real world, especially in the context of highly specialised hardware.

It was with theoretical performance improvement that motivated the creation of SIMD and led to real world speed ups.

fulafel · on Oct 5, 2022

It's interesting how both SIMD and big-o have had so different reasons for being of limited but still significant relevance.

SIMD, and vector processors like it was called in the 70s, delivered practical speedups in simple benchmarks right away but most applications dont't take advantage because of SW engineering reasons. Whereas big-O improvement ignores important components of performance per unit of time (memory access and constant factors) and is purely theoretical in an essential sense.

taeric · on Oct 5, 2022

Not necessarily. In particular, this could have slower growth, but still higher cost on normal workloads.

chrchang523 · on Oct 5, 2022

Sometimes. In the case of matrix multiplication, there is a pretty large backlog of "galactic algorithms" going down to ~O(n^2.373) that haven't yet led to real-world improvements. Strassen's ~O(n^2.807) algorithm is only considered over the basic O(n^3) strategy for n>1000.

HarHarVeryFunny · on Oct 5, 2022

Winograd O(n^2.37) is a win for 3x3 convolutions in cuDNN, so it can be implemented efficiently.

chrchang523 · on Oct 5, 2022

My understanding is that the Winograd minimal filtering algorithms used in cuDNN are different from the O(n^2.37) Coppersmith-Winograd-descended matrix multiplication algorithms. But I acknowledge that these can be considered cousins, produced by the same line of research.

HarHarVeryFunny · on Oct 5, 2022

I'm pretty sure there's no difference. It does seem to be pretty hard to turn the theoretical win into a practical one - the GPU kernel needs to be coded extremely efficiently to match the underlying hardware. AFAIK it's only a win for 3x3 - maybe for one other size too. Originally Winograd wasn't supported by cuDNN on NVidia's Tensor Cores (matmul-specific hardware on more recentish GPUs), vs CUDA cores, but a Google search seems to indicate it can be done - not sure if that's in cuDNN though.

throwawaymaths · on Oct 5, 2022

No, they count multiplications, 47 vs 49 in the 4x4 case. This assumes that multiplications are the heavy lift relative to additions, memory moves, etc. this is not a bad assumption, and is hardware independent, although there are some proposed 'exotic' hardware solutions where addition and multiplication are equally difficult (via lookup table, limited to about 8-bit x 8-bit) or where addition is harder than multiplication (log space number systems) but I don't think these have been deployed

lifthrasiir · on Oct 5, 2022

That assumption didn't age very well. In many if not most modern architectures multiplication still does take more cycles than addition but can have a higher throughput if scheduled well, and fused multiply-add can be as fast as a single multiplication, essentially giving a free addition.

throwawaymaths · on Oct 5, 2022

> but can have a higher throughput if scheduled well

Unlikely to be true for matrix multiplications, which have well-defined data dependencies.

> and fused multiply-add can be as fast as a single multiplication, essentially giving a free addition.

Yes, this supports the assumption that multiplication is the heavy lift.

tlb · on Oct 5, 2022

With sufficiently parallel hardware, you can do an entire mat4 * mat4 multiplication in 3 cycles. First do all 64 multiplications in parallel, then do 32 adds, then 16 adds to get the final answer.

For operations in GF(2) where they claim a result, a multiply is just an AND gate and an add is an XOR gate. So the fully parallel hardware version is 64 AND gates and 48 XOR gates, with a total gate delay of 3. This is a trivial amount of hardware and could easily be an instruction in some alternate universe where it was useful.

klyrs · on Oct 5, 2022

You missed the part where the "4x4 matrix multiplication" refers to multiplication of NxN matrices logically partitioned into sixteen blocks of size (N/4)x(N/4), done recursively.

letitgo12345 · on Oct 5, 2022

I think they find a N^2.778 complexity algorithm (obtained by applying their 4x4 discovered algorithm recursively) so this is not correct.

Ar-Curunir · on Oct 5, 2022

You’re wrong. They discovered an algorithm with fewer multiplications for 4x4 matrices.

They also discovered better algorithms for other dimensions.

ColinWright · on Oct 5, 2022

How can I reconcile your comment here with the comment in https://news.ycombinator.com/item?id=33098192 ??

johndfsgdgdfg · on Oct 6, 2022

What do you expect from an ad company? Not the truth.

oittaa · on Oct 6, 2022

Hey, you haven't chosen the betting site yet. Which one do you prefer?

https://news.ycombinator.com/item?id=33102761

xpe · on Oct 5, 2022

The paper is a great read -- an interesting approach, fun theoretical comparisons, plus benchmarks for optimized algorithms for specific hardware (CPU and GPU, allowing for differing instruction sets).

But it doesn't stop there; from the "Discussion" section:

> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.

mikewarot · on Oct 5, 2022

I have a gut feeling that there is a faster way to compute logarithms going from least to most significant bit. How would I go about using ML to find it?

[Edit] I think Feynman's algorithm might do it:

"Consider the problem of finding the logarithm of a fractional number between 1 and 2. (The algorithm can be generalized without too much difficulty.) Feynman observed that any such number can be uniquely represented as a product of numbers of the form 1 + 2^(-k), where k is an integer. Testing for the presence of each of these factors in a binary representation is simply a matter of a shift and a subtraction. Once the factors are determined, the logarithm can be computed by adding together the precomputed logarithms of the factors. The algorithm fit the Connection Machine especially well because the small table of the logarithms of 1 + 2^(-k) could be shared by all the processors. The entire computation took less time than doing a division."

adgjlsfhk1 · on Oct 5, 2022

note that this algorithm will no longer be good. for precisions up to roughly 1000 bits small tables combined with minimax polynomials are optimal and for higher precision, you want to use more complicated methods. if you're interested, the ARB library unlikely the currently fastest known methods.

radford-neal · on Oct 5, 2022

Seems similar to the CORDIC algorithm (https://en.wikipedia.org/wiki/CORDIC).

xpe · on Oct 5, 2022

Start with these questions:

1. How big is the search space?

2. What analysis approaches are likely to bear fruit for the search space? (theoretical analysis? optimization?)

3. If optimization is called for, what kind?

chillee · on Oct 5, 2022

Quoting myself on twitter:

https://twitter.com/cHHillee/status/1577713102434361344

I'm quite suspicious about their hardware benchmarks. They're not writing custom kernels, they're relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).

Moreover, as far as I can tell, they don't report absolute performance numbers anywhere. In other words, I suspect that a naive N^3 matrix multiplication would absolutely smoke them in performance.

optimalsolver · on Oct 5, 2022

Now, were these algorithms discovered, or invented?

I.e., have they always been there, just sitting in Platonic space waiting for a conscious mind to stumble across them, or have they just now popped into existence?

MichaelBurge · on Oct 6, 2022

The dovetail over all possible programs solves every problem with at most a constant slowdown.

So if it was invented, it was invented 100 years ago along with every other algorithm since the inventor of the dovetail incorporated it by reference. And there are no more algorithms to invent.

And if it was discovered, you would want to compare the efficiency of your discovery process with the dovetail.

So I tend to say "discovered with X bits of optimization power", where 0 bits reduces to the dovetail over some enumeration process, infinity bits reduces to "invention"(i.e. you consider one(or zero) object from the stream, constructing it directly), and everything in-between grades the search process.

WFHRenaissance · on Oct 5, 2022

Discovered. I'd go as far to say that all "invention" is actually just discovery that we find useful.

lawrenceyan · on Oct 6, 2022

> Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

10-20% performance improvement in matrix multiplications is pretty amazing[0]!

[0]: https://www.nature.com/articles/s41586-022-05172-4/figures/5

gnufx · on Oct 6, 2022

That doesn't say exactly what was run and how. From measurements with nvblas, that difference could be dwarfed by the effect of tile size and pinning buffers or not.

a-dub · on Oct 5, 2022

this is cool. i suppose it's only a matter of time until we see optimizing compilers that use transformer-rl style searches for subsets of their optimization and codegen.

mhh__ · on Oct 5, 2022

Generating pgo data with graph ML is already a thing

a-dub · on Oct 5, 2022

i'm guessing most of that stuff is register allocation and cache management.

rl-searches could potentially rewrite entire algorithms.

edit: (more musing) maybe the future of software engineering will be authoring bullet proof acceptance tests.

Psyladine · on Oct 5, 2022

Nevermind compilers, the possibility of updating architectures & code bases by automation. A true feasible attack on tech debt and legacy code.

danuker · on Oct 5, 2022

Will also generate some fun bugs!

iKlsR · on Oct 5, 2022

Whenever I see stuff about automated algorithms it reminds me of a movie, can't remember the name or plot exactly but they create a being or upload someones consciousness and the "ai" is solving all the problems the world faces until it reaches the last one, humans. It then begins to synthesize some airborne virus or nukes that would scour us from the earth. We basically created tech that would eventually optimize us out of existence as a cancer to the earth.

Edit, found it Transcendence (2014), also it seems my brain injected some random thoughts into that plot, it's not about that at all but the point stands!

haolez · on Oct 5, 2022

Transendence is one of the few movies that present an almighty AI that's benevolent in nature.

scotty79 · on Oct 6, 2022

You might watch "I am mother". Greater good was always an interesting concept.

ASalazarMX · on Oct 5, 2022

Must have been an embarrassing incident for the engineer that commented out the "Don't kill humans" constraint for a quick debug session.

jeffbee · on Oct 5, 2022

Superoptimizers are already extremely good at discovering latent flaws in high level source code, and in compilers.

mirekrusin · on Oct 5, 2022

Unless it provides or simply works within formal verification?

f1shy · on Oct 5, 2022

Formal verification does not prove lack of bugs. In best case, can only catch one certain type of bugs.

https://smartech.gatech.edu/handle/1853/62855

adrianN · on Oct 5, 2022

Formal verification is very good at proving that compiler transformations preserve semantics. Programming language semantics are pretty well specified (at least for some programming languages...), so the chance of bugs in the specification are low.

wokkel · on Oct 5, 2022

It can catch all kinds of bugs, but you have to ask the correct questions. So it comes down to define what a bug is and the assumptions you use for testing. And therein lies the problem: what constitutes a bug? A function that add two numbers but bever terminates might be considered bugfree if forget to include as a bug that not giving an answer before the end of the universe is faulty behaviour. We humans are terrible at writing these kind of specifications, so formal verification as a method just pushes the correctness from code to specification.

mirekrusin · on Oct 6, 2022

If you navigate within algebraic structures with well known properties (which are also verified for example), formal verification is all you need, you can be certain of being bug free.

hdjjhhvvhga · on Oct 5, 2022

Relax - we'll soon have self-learning debuggers, too.

ColinWright · on Oct 5, 2022

Quoting:

> ... AlphaTensor finds an algorithm for multiplying 4×4 matrices using 47 multiplications in Z_2 , thereby outperforming Strassen’s two-level algorithm, which involves 7^2 = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z_2 with complexity O(N^2.778).

> Moreover, AlphaTensor discovers efficient algorithms for multiplying matrices in standard arithmetic; for example, AlphaTensor finds a rank-76 decomposition of T_{4,5,5}, improving over the previous state-of-the-art complexity of 80 multiplications.

HarHarVeryFunny · on Oct 5, 2022

Can someone explain the approach here?

I understand that they transform the problem into a gamified 3-D matrix decomposition, but what exactly is the motivation for using RL to beat this decomposition game ? Why not just use, for example, an evolution algorithm to find incrementally better decompositions ?

gnufx · on Oct 6, 2022

There's commentary here which seems to be talking about matrix multiplication in general and ignoring real implementations, which are dominated by considerations of the memory hierarchy. They actually introduce operations, like packing for sufficiently large dimensions but not for smaller, and prefetch. Some of that seems only to be tunable empirically for the micro-architecture too. (I don't know how the considerations vary between CPU and GPU.) Fairly recent work on Strassen implementation, talks about that and the normal Goto-type algorithm: https://jianyuhuang.com/papers/sc16.pdf

zekrioca · on Oct 7, 2022

Always caution someone who wants to follow in this research:

When somebody promotes a fast algorithm there is often a catch. In this case the issue is numerical stability. There is a theorem that states that any algorithm for n-by-n matrix-matrix multiplication that is componentwise forward stable (as good as it gets in this situation) much necessarily use n^3 scalar multiplications. The authors will therefore waste their time if they carry out their plans and try to optimize for stability. The standard algorithm has the nice property and no faster algorithm can have this property. The question of fast matrix multiplication was raised recently on mathoverflow.net, see https://mathoverflow.net/q/421304/110176 and the answers given there.

WhitneyLand · on Oct 5, 2022

In general to speed up matrix operations how much is some sort of result caching applied?

For example if you profile calculations done when training a model, is there any significant repetition happening that would allow some kind of benefit from table lookups of certain solutions?

ubik80 · on Oct 9, 2022

Machine learning for finding matrix multiplication algorithms is not so new. The problem was solved before with other methods. For example:

https://www.researchgate.net/publication/341942316_Searching...

didibus · on Oct 5, 2022

Can someone knowledgeable tell me if it discovered an algorithm we can understand and re-implement ourself, like is the pseudo-code for it known? Or is it kind of stuck in the infered function of the ML model?

bugzz · on Oct 5, 2022

yes, they implemented the resulting algorithms and have charts showing real world performance on GPUs and TPUs

didibus · on Oct 6, 2022

That's really cool!

puffoflogic · on Oct 5, 2022

Why the heck is a math paper in Nature? I'd put Nature somewhere just below vixra in terms of math credibility.

323 · on Oct 5, 2022

If you could find a way to train cats to solve integrals, wouldn't that be a Nature paper and not a math one?

puffoflogic · on Oct 5, 2022

Yes, but if those cats discovered a faster way to solve integrals than you initially taught them, their findings would have higher credibility if published in an actual mathematics journal.

Otherwise, you end up with things like the "rediscovery" [0] of the trapezoid rule. That wasn't actually published in Nature, but with the way Nature has gone downhill even in its own sphere, it could probably happen today. There's no reason to believe that someone publishing in a bio/chem journal is remotely capable of making math/CS judgements like "faster matrix multiplication"; nor that the referees (does Nature even bother with those any more? O.o) will recognize any issues.

[0] https://academia.stackexchange.com/questions/9602/rediscover... https://news.ycombinator.com/item?id=26384357 etc.

bitwize · on Oct 5, 2022

It's begun. AI is improving itself.

kortex · on Oct 6, 2022

Language is the real AI and its been improving itself since the first word was used between two people.

wbhart · on Oct 6, 2022

The new 4x4 matrix multiplication over F_2 has practical applications as many matrix operations over F_2 can be reduced to matrix multiplication.

For anyone looking for the algorithm itself, it is actually given in in one of the extended data sections at the very end of the paper.

KKKKkkkk1 · on Oct 5, 2022

Reinforcement learning is usually introduced as a technique to train an agent in simulation that would then be let loose in a real environment. Interesting that they choose to treat it is a search strategy.

nestorD · on Oct 6, 2022

To be fair, their work is based on alpha-zero which is itself deeplearning on top of monte-carlo tree search, a search strategy. Having played with those algorithms, I suspect that one could obtain similar results with a solid Monte-Carlo tree search implementation (a lot of them use ucb which is really suboptimal)

bjourne · on Oct 5, 2022

Never heard of the "matrix multiplication tensor" before. Is there some digestible blog post or article somewhere explaining what it is and why it is useful?

machinekob · on Oct 5, 2022

Wasn't this based on some other paper? I have read something like that a year+ (?) ago also about deep learning based matrix multiplication boost.

fatneckbeardz · on Oct 6, 2022

maybe they can use this thing to figure out universal basic income before we all become homeless

ckrapu · on Oct 5, 2022

50% fewer operations for skewsymmetric matrix vector product is pretty big, IMO.

MILP · on Oct 7, 2022

Will this be efficient for sparse matrices ?

loufe · on Oct 5, 2022

This is completely besides the matter, but reading "provably" in the abstract is a frank reminder of how terrible English spelling/pronounciation is. I can't imagine I'm the only well-read native English speaker who read this as "prov-ably" on first take. I don't know about most languages, but you just don't get nonsense like this in French, at least.

evmar · on Oct 5, 2022

English is terrible for sure, but this comment is pretty funny to compare it to the language that spells /kɛs kə sɛ/ as "qu'est-ce que c'est".

julienchastang · on Oct 6, 2022

Or quatre-vingt-dix-huit -> four-twenty-eighteen, for the number 98.

elcomet · on Oct 5, 2022

I don't understand your point, provably means which can be proven, is there another meaning I'm missing ?

613style · on Oct 5, 2022

Think "probably" vs "provably" (it's about pronunciation/spelling)

ogogmad · on Oct 5, 2022

He's talking about the spelling not matching the pronunciation

eigenschwarz · on Oct 5, 2022

https://www.merriam-webster.com/dictionary/provably

Edit: Sorry, missed your point on first reading, which perhaps reinforces your point. I read papers like this for a living so the language is unambiguous but I see how it might not be for others, even if English is their first language.

pwdisswordfish9 · on Oct 5, 2022

I agree. This is completely besides the matter.