Hacker News new | past | comments | ask | show | jobs | submit login

> AlphaEvolve achieved up to a 32.5% speedup for the FlashAttention kernel implementation in Transformer-based AI models

> In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.

> And in 20% of cases, AlphaEvolve improved the previously best known solutions

These sound like incredible results. I'd be curious what kind of improvements were made / what the improvements were.

Like, was that "up to a 32.5% speedup" on some weird edge case and it was negligible speed up otherwise? Would love to see the benchmarks.






Remember that GPUs have cache hierarchies and matching block sizes to optimally hit those caches is a big win that you often don't get by default, just because the number of important kernels times important GPUs times effort to properly tune one is greater than what people are willing to do for others for free in open source. Not to mention kernel fusion and API boundaries that socially force suboptimal choices for the sake of clarity and simplicity.

It's a very impressive result, but not magic, but also not cheating!


100%. LLMs are extremely useful for doing obvious but repetitive optimizations that a human might miss.

What it essentially does is a debugging/optimization loop where you change one thing, eval, repeat it again and compare results.

Previously we needed to have a human in the loop to do the change. Of course we have automated hyperparameter tuning (and similar things), but that only works only in a rigidly defined search space.

Will we see LLMs generating new improved LLM architectures, now fully incomprehensible to humans?


If I understood, isn't this software only as useful as the llm powering it is? It sounds like something very useful, but either I'm missing something or it put into a loop and a validator a "please optimize this code". Useful, but maybe not as revolutionary as the underlying llm tech itself

Edit the white paper says this: AlphaEvolve employs an ensemble of large language models. Specifically, we utilize a combination of Gemini 2.0 Flash and Gemini 2.0 Pro. This ensemble approach allows us to balance computational throughput with the quality of generated solutions. Gemini 2.0 Flash, with its lower latency, enables a higher rate of candidate generation, increasing the number of ideas explored per unit of time. Concurrently, Gemini 2.0 Pro, possessing greater capabilities, provides occasional, higher-quality suggestions that can significantly advance the evolutionary search and potentially lead to breakthroughs. This strategic mix optimizes the overall discovery process by maximizing the volume of evaluated ideas while retaining the potential for substantial improvements driven by the more powerful model.

So, I remain of my opinion before. Furthermore, in the paper they don't present it as something extraordinary as some people here say it is, but as an evolution of another existing software, funsearch


"Make this better in a loop" is less powerful than using evolution on a population. While it may seem like evolution is just single steps in a loop, something qualitatively different occurs due to the population dynamics - since you get the opportunity for multiple restarts / interpolation (according to an LLM) between examples / and 'novelty' not being instantly rejected.

The “fully incomprehensible to humans” aspect of this potential future state interests me as a software person.

The last 50 years of software evolution have been driven by a need to scale human comprehension for larger and more integrated codebases. If we decreasingly need/rely on humans to understand our code, source code’s forward-progress flywheel is going to slow down and will bring us closer to (as you suggest) incomprehensibility.

Not only did we scale the breadth of codebases - the flywheel built layers and layers of abstraction over time (have you seen the code sample in this article??), fostering a growing market of professional developers and their career progressions; if most code becomes incomprehensible, itll be the code closer to “the bottom”, a thin wrapper of API on top of an expanding mass of throwaway whatever-language AlphaAlgo creates.

If we don’t wrangle this, it will destroy a profession and leave us with trillions of LoC that only people with GPUs can understand. Which may be another profession I suppose.


Very few people already understand highly optimized numerical kernels. Many are already machine optimized. This takes it just a bit further. Most programmers do not do high performance algorithm development.

One can have obvious but repetitive optimizations with symbolic programming [1].

[1] https://arxiv.org/abs/1012.1802

Strange that AlphaEvolve authors do not compare their work to what is achievable by equality saturation. An implementation of equality saturation can take interesting integrals with very simple rules [2].

[2] https://github.com/alt-romes/hegg/blob/master/test/Sym.hs#L3...


Absolutely - not arguing that the results are unreasonable to the point of illegitimacy - just curious to see when they perform as well as reported and how well the presented solutions generalize to different test cases - or if it's routing to different solutions based on certain criteria etc.

Hey, do you have any suggestions for resources to learn more about this kind of custom optimisation? Sounds interesting, but not sure where to start?

https://ppc.cs.aalto.fi/ covers some of this (overlapping with the topics the person you responded to mentioned, but not covering all, and including some others)

> AlphaEvolve is accelerating AI performance and research velocity. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.

From the paper it was a speedup on the XLA GPU kernel they wrote using Jax, which is probably not SOTA. I don't think Jax even has a official flash attention implementation.

Not sure what “official” means but would direct you to the GCP MaxText [0] framework which is not what this GDM paper is referring to but rather this repo contains various attention implementations in MaxText/layers/attentions.py

[0] https://github.com/AI-Hypercomputer/maxtext


I'm thinking reading numbers like this is really just slop lately.

FA achieving a 32.5% speed up? Cool.

Why not submit it as a PR to the Flash Attention repo then? Can I read about it more in detail?


I have not read this linked article, but your comment made me recall a discussion about a speed up of CUDA kernels presented by Sakana AI Labs. The researcher Ravid Shwartz Ziv at NYU posted about it on LinkedIn [1], and here is the Twitter post of interest [2]

""" Yesterday's news about Sakana AI Labs provided an important lesson for all of us working with AI agents. Their announcement of an AI system that could supposedly optimize CUDA kernels to run 100x faster initially seemed like exactly the kind of use cases we've been hoping for in AI-assisted development.

Like many others, I was excited about it. After all, isn't this exactly what we want AI to do - help us optimize and improve our technical systems?

However, careful investigation by the community (on Twitter) revealed a different story. What really happened? The AI-generated CUDA kernel appeared to achieve incredible speedups, but the code was inadvertently reusing memory buffers containing previous results, essentially bypassing the actual computation. When properly evaluated, the kernel actually runs about 3x slower than the baseline. """

[1] https://www.linkedin.com/posts/ravid-shwartz-ziv-8bb18761_ye...

[2] https://x.com/main_horse/status/1892473238036631908


lmao this is exactly the kind of stuff I always see from Claude. It’s like adding a Skip() to a test and declaring it works now. “Well it’s a lot faster, I met the criteria of my TODOs cya”

I’ve seen it so much I kinda doubt it was “inadvertent” because they’re like seemingly intentional about their laziness, and will gaslight you about it too.


So annoying. Also, when it hardcodes the expected response in a mock, bypassing the purpose entirely. “Test passes now!”

Funny, 5 years ago we had these same complaints, but about (some) people.


Same thing for TypeScript type errors… “AI added as any and the problem is fixed”!

Well you forgot to fully qualify your linguistic basis and semantic interpretation of the text of your wish to the great genie bottle.

“I am a vibe coder, it is your job to check the results”

Exactly, as a great dev once said: "talk is cheap, show me the code"

I assume the Gemini results are JAX/PAX-ML/Pallas improvements for TPUs so would look there for recent PRs



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: