One of the biggest things that seems to be holding back ML in compilers right now is dataset size. This model was only trained on a gigabyte of source code, 30+% of that synthetic. Even on much simpler models, there have been massive performance gains by just throwing more data at them. Some experimentation with the original MLGO inlining model on a much bigger data corpus doubled the code-size wins. LLMs have also been shown to perform better they more data they are fed [1].
It's possible, nay, mandatory to constrain the outputs of the model at each step of generation in order to guarantee that a given structure or grammar is adhered to. If you can fine-tune the model with these constraints in place you can offload a lot of the effort that the LLM otherwise has to perform in comprehending correctness so it has more capacity for generating good content. To be sure, quality and quantity of data are important, but it's all too easy to introduce subtle bugs that take years to tease out if you don't adhere to the right constraints.
Most of the work in this space is not focused on neural compilation (having a ML model perform the transformation/entire compilation), but on replacing heuristics or phase ordering, where the issue of correctness falls back onto the compiler. For pretty much exactly the reasons you mentioned, neural compilation isn't really tractable.
This specific paper focuses on phase ordering, which should guarantee correctness, assuming the underlying transformations are correct. They do train the model to perform compilation, but as an auxiliary task.
1. https://arxiv.org/abs/2203.15556