You are correct in that I could be merely thinking it's easy. So let me answer your questions as best I can.
* UB is not exposed to the compiler/IR on purpose. I don't want compilers using it as an excuse to be adversarial like today's compilers. There is still UB, but as little as I can get away with (mostly in data races and race conditions).
* I'm designing a new provenance model right now. Incomplete. If anything is hard, it's this. And it might be.
* It is possible to attach information to any item or group of items. In fact, my IR will be able to attach so much information, it should be possible for high-level passes to reconstruct high-level code. Think MLIR with pure data. For example, you could group basic blocks together and label tell as coming from a while loop. You will also be able to generate and use e-graphs. My model will also be different than LLVM's. Analysis will only happen on the original IR, with information generated only on the original. Thus, no information is lost before analysis.
* Canonicality is another one I'm working on and may be hard, though less hard than provenance because my IR uses basic block arguments instead of a phi node. But the basic idea would be that optimization happens differently, so canonicalization should be easier than in LLVM because I'm going to design it to be.
* User-specified instructions with a way of defining their semantics in code.
* I'm starting with e-graphs and going from there. Still working on this.
I agree with you that an instruction list is not interesting at all.
Anyway, yeah, you are right, but I think I have mostly satisfactory answers.
* UB is not exposed to the compiler/IR on purpose. I don't want compilers using it as an excuse to be adversarial like today's compilers. There is still UB, but as little as I can get away with (mostly in data races and race conditions).
* I'm designing a new provenance model right now. Incomplete. If anything is hard, it's this. And it might be.
* It is possible to attach information to any item or group of items. In fact, my IR will be able to attach so much information, it should be possible for high-level passes to reconstruct high-level code. Think MLIR with pure data. For example, you could group basic blocks together and label tell as coming from a while loop. You will also be able to generate and use e-graphs. My model will also be different than LLVM's. Analysis will only happen on the original IR, with information generated only on the original. Thus, no information is lost before analysis.
* Canonicality is another one I'm working on and may be hard, though less hard than provenance because my IR uses basic block arguments instead of a phi node. But the basic idea would be that optimization happens differently, so canonicalization should be easier than in LLVM because I'm going to design it to be.
* User-specified instructions with a way of defining their semantics in code.
* I'm starting with e-graphs and going from there. Still working on this.
I agree with you that an instruction list is not interesting at all.
Anyway, yeah, you are right, but I think I have mostly satisfactory answers.