A valid accompanying test would ensure this? You’d be extracting optimization ca...

ainoobler · on July 21, 2024

JIT optimizers operate at runtime, there are no test suites to verify before/after. It's happening live as the code is running so if you use AI then you won't know if the optimization is actually valid or not. This is why the article is using Z3 instead of neural networks. Z3 can validate semantic equivalence, neural networks can't.

fwip · on July 21, 2024

Yes, but this Z3 analysis is not done at runtime. It's done offline, based on JIT traces. A neural network could, in principal, suggest optimizations in the same way, which an expert would then review for possible inclusion into the Pypy JIT.

ainoobler · on July 21, 2024

You'd still have to write a proof for verifying semantic equivalence before implementing the optimization so I don't see what the neural network gains you here unless it is actually supplying the proof of correctness along with the optimization.

screcth · on July 21, 2024

The idea is that the LLM would provide "intuition" to guide the optimizer to find better optimizations, but a formal proof would be necessary to ensure that those optimizations are actually valid.

fwip · on July 21, 2024

I might be incorrect, but I don't believe that most compiler optimizations have formal proofs written out before implementation. Does Pypy do this?

derdi · on July 21, 2024

Pypy doesn't do this in general. The same Z3 model that is used to find these missing optimizations is also used to verify some integer optimizations.

But the point is that as long as optimization rules are hand-written, a human has thought about them and convinced themselves (maybe incorrectly) that the rules are correct. If a machine generates them without a human in the loop, some other sort of correctness argument is needed. Hence the reasonable suggestion that they should be formally verified.

cfbolztereick · on July 21, 2024

PyPy has formally verified the integer abstract domain using Z3, a quite important part of our jit optimizer (will write about that in the coming weeks).

We also run a fuzzer regularly to find optimization bugs, using Z3 as a correctness check:

https://pypy.org/posts/2022/12/jit-bug-finding-smt-fuzzing.h...

The peephole optimizations aren't themselves formally verified completely yet. We've verified the very simplest rules, and some of the newer complicated ones, but not systematically all of them. I plan to work on fully and automatically verifying all integer optimizations in the next year or so. But we'll see, I'll need to find students and/or money.

fwip · on July 21, 2024

Ah, yes, I meant that the LLM could output suggestions, which a human would then think about and convince themselves, and only then, implement in Pypy.

derdi · on July 21, 2024

Presumably the LLM would generate a lot of proposed rules for humans to wade through. Reviewing lots of proposed rewrites while catching all possible errors would be tedious and error-prone. We have computers to take care of this kind of work.

dkersten · on July 22, 2024

Perhaps not, but they’re based on heuristics and checks that are known, checked and understood by humans, and aren’t prone to hallucination like LLM’s are. An LLM suggests something that looks plausible, but there’s no guarantee that it’s suggestions actually work as intended, hence the need for a proof.

gus_massa · on July 22, 2024

I added a few somewhat similar optimization to Racket. The problem are the corner cases.

For example (fixnums are small integer), is it valid to replace

  (if (fixnum? x)
    (fixnum? (abs x))
    true)

with just the constant

  true

?

Try runing a few tests, common unit test and even random test. Did you spot the corner case?

It fails only when x is the most negative fixnum, that is also a very rare case in a real program. (IIRC, the random test suit try to use more of this kind of problematic values.)

lmeyerov · on July 21, 2024

Close!

Generate the z3 too - as the need is to verify, not test. It can be a direct translation. For all inputs, is the optimization output equivalent. (Bootstrapping a compiler prototype via LLMs is nice though.)

One place LLMs get fun here is where the direct translation to z3 times out, such as bigger or more complicated programs, and so the LLM can provide intuition for pushing the solver ahead.

SkiFire13 · on July 21, 2024

Tests can't ensure the correctness of an algorithm, only that it gives the correct output on a specific input.

aantix · on July 21, 2024

Depends on the comprehensiveness of the test.

petschge · on July 22, 2024

Sure, for booleans you can just test all combinations of input arguments. In some cases you can do the same for all possible 32 bit float or int values that you have as input. But for 64 bit integers (let alone several of them) that's not feasible.

stonemetal12 · on July 22, 2024

As long as we can agree that we are testing the application logic and not the compiler or hardware, then if (a > 4) {...} else {...} can be tested with just 3, 4, 5 no need to test -430 or 5036.

Known as boundary value testing, you partition all input into equivalence classes, then make sure your tests contain a sample from each class.

SkiFire13 · on July 24, 2024

Making sure the test contains a sample from each class is the hard part. For example in your `if` example above it may happen that the code computing `a` is such that `a >= 5` is impossible, so that equivalence class is never going to happen. As such you can't have a test for it, and instead you'll have to prove that it can never happen, but this reduces to the halting problem and is not computable.

And even ignoring that problem, there may be an infinite amount of equivalence classes when you introduce loops/recursion, as the loops can run a different amount of times and thus lead to different executions.

Even just considering `if` statements, the amount of equivalence classes can be exponential in the amount of `if` (for example consider a series of `if` where each check a different bit of the input; ultimately you'll need any combination of bits to check every combination of `if`, and the number is 2^number of ifs).

SkiFire13 · on July 21, 2024

For any practical input no test is gonna be comprehensive enough. Especially for something that has infinite possible inputs like programs.

aantix · on July 21, 2024

Is the scope a whole program or a specific algorithm?

SkiFire13 · on July 22, 2024

Even most algorithms would allow too many inputs. Even a simple algorithm computing the addition between two 64 bit numbers allow 2^128 possible input combinations, which would take billions of years to exhaustively check in the best case.