JIT optimizers operate at runtime, there are no test suites to verify before/after. It's happening live as the code is running so if you use AI then you won't know if the optimization is actually valid or not. This is why the article is using Z3 instead of neural networks. Z3 can validate semantic equivalence, neural networks can't.
Yes, but this Z3 analysis is not done at runtime. It's done offline, based on JIT traces. A neural network could, in principal, suggest optimizations in the same way, which an expert would then review for possible inclusion into the Pypy JIT.
You'd still have to write a proof for verifying semantic equivalence before implementing the optimization so I don't see what the neural network gains you here unless it is actually supplying the proof of correctness along with the optimization.
The idea is that the LLM would provide "intuition" to guide the optimizer to find better optimizations, but a formal proof would be necessary to ensure that those optimizations are actually valid.
Pypy doesn't do this in general. The same Z3 model that is used to find these missing optimizations is also used to verify some integer optimizations.
But the point is that as long as optimization rules are hand-written, a human has thought about them and convinced themselves (maybe incorrectly) that the rules are correct. If a machine generates them without a human in the loop, some other sort of correctness argument is needed. Hence the reasonable suggestion that they should be formally verified.
PyPy has formally verified the integer abstract domain using Z3, a quite important part of our jit optimizer (will write about that in the coming weeks).
We also run a fuzzer regularly to find optimization bugs, using Z3 as a correctness check:
The peephole optimizations aren't themselves formally verified completely yet. We've verified the very simplest rules, and some of the newer complicated ones, but not systematically all of them. I plan to work on fully and automatically verifying all integer optimizations in the next year or so. But we'll see, I'll need to find students and/or money.
Ah, yes, I meant that the LLM could output suggestions, which a human would then think about and convince themselves, and only then, implement in Pypy.
Presumably the LLM would generate a lot of proposed rules for humans to wade through. Reviewing lots of proposed rewrites while catching all possible errors would be tedious and error-prone. We have computers to take care of this kind of work.
Perhaps not, but they’re based on heuristics and checks that are known, checked and understood by humans, and aren’t prone to hallucination like LLM’s are. An LLM suggests something that looks plausible, but there’s no guarantee that it’s suggestions actually work as intended, hence the need for a proof.
I added a few somewhat similar optimization to Racket. The problem are the corner cases.
For example (fixnums are small integer), is it valid to replace
(if (fixnum? x)
(fixnum? (abs x))
true)
with just the constant
true
?
Try runing a few tests, common unit test and even random test. Did you spot the corner case?
It fails only when x is the most negative fixnum, that is also a very rare case in a real program. (IIRC, the random test suit try to use more of this kind of problematic values.)
Generate the z3 too - as the need is to verify, not test. It can be a direct translation. For all inputs, is the optimization output equivalent. (Bootstrapping a compiler prototype via LLMs is nice though.)
One place LLMs get fun here is where the direct translation to z3 times out, such as bigger or more complicated programs, and so the LLM can provide intuition for pushing the solver ahead.
Sure, for booleans you can just test all combinations of input arguments. In some cases you can do the same for all possible 32 bit float or int values that you have as input. But for 64 bit integers (let alone several of them) that's not feasible.
As long as we can agree that we are testing the application logic and not the compiler or hardware, then if (a > 4) {...} else {...} can be tested with just 3, 4, 5 no need to test -430 or 5036.
Known as boundary value testing, you partition all input into equivalence classes, then make sure your tests contain a sample from each class.
Making sure the test contains a sample from each class is the hard part. For example in your `if` example above it may happen that the code computing `a` is such that `a >= 5` is impossible, so that equivalence class is never going to happen. As such you can't have a test for it, and instead you'll have to prove that it can never happen, but this reduces to the halting problem and is not computable.
And even ignoring that problem, there may be an infinite amount of equivalence classes when you introduce loops/recursion, as the loops can run a different amount of times and thus lead to different executions.
Even just considering `if` statements, the amount of equivalence classes can be exponential in the amount of `if` (for example consider a series of `if` where each check a different bit of the input; ultimately you'll need any combination of bits to check every combination of `if`, and the number is 2^number of ifs).
Even most algorithms would allow too many inputs. Even a simple algorithm computing the addition between two 64 bit numbers allow 2^128 possible input combinations, which would take billions of years to exhaustively check in the best case.
You’d be extracting optimization candidates by running the test suite.
You re-run the test suite after changes to ensure they still pass.