Everyone always does this - plays down the amount of work thing X is if they themselves have done thing X. From this perspective you are ineluctably led to a binary classification of hardess:
1. Things that have not been done are hard.
2. Things that have been done are easy.
Ok maybe but given the choice between implementing my own IR and using LLVM IR and going to the beach, I choose the latter.
Depending on your definition of “from scratch” I’d agree it isn’t all that hard.
I mean, if I can do it…
Just a couple days ago I was poking at one of my yak shaving projects which involves a generator for IR nodes as part of a backend for… something, haven’t quite decided where it’s going yet. Doesn’t really matter because it’s just the next logical step after the already written AST node generator.
This is just something I’m doing because I find it interesting and haven’t spent all that much time on it. All I know is if I want to lower AST to IR I need IR nodes and in order to know what functionality a IR generator needs I must have an IR to target.
As they say, you eat an elephant one byte at a time.
"designing an IR" may or may not be hard - designing a close-to-optimal generic IR that can be sourced from many different languages and target many different backends, each with their own idiosyncrasies on both ends may not even be "hard" by some definition of hard (apparently one where a lot of work can still be easy), but then designing optimizations and transforms that practically work pretty good across all those hacks and edge cases, enough that major companies now base their toolchains on your optimizations? I'm pretty sure that's gotta be hard
Previous comment was about optimizations, though I can see how it was confusing. Someone said: "Question to pros: Things like constant propagation, loop unrolling, TCO, memory reuse etc., seem simple in theory. Yet why do languages end up depending on LLVM and not implement these algorithms on their own and avoid a monstrous dependency?" - response said optimizations are hard and recommended designing an IR specifically to optimize for it "pick any language that has a parser impl and design an IR for it and then optimize it (designing an IR from scratch can't be that hard can it?)"
I have a commercial product that's a development tool that get a huge chunk of value from targeting x64, arm/arm64 and wasm - I actually do like C++ but I don't like the massive bloat of LLVM - however just getting all these backends for free and how easy it was to get a 90% there thing going by writing parser -> IR and then having everything else just taken care of was worth it for me. Compilation speeds def hurt from LLVM dependency, but it's a tradeoff.
From this kind of response, it's always so hard to tell whether it's easy for you because you have the relevant experience, or you just think it's easy because you don't.
A (very non-exhaustive) list of interesting questions about IR design are:
* What's your model for delayed UB and UB materialization?
* What's your provenance model?
* What's your strategy for information retention and handling of flow-sensitive facts?
* What inherent canonicality is there?
* What's your approach to target or domain specific extensions?
* What is your strategy for formal verification of refinements?
Questions like "What instructions does your IR support?" are fairly uninteresting, and are not what IR design is (mostly) about.
It's worth noting that LLVM's own IR design doesn't have a very good answer to some of those questions either, in part because making changes to an IR that is as widely used as LLVM IR is hard (been there, done that). It's easier to design a new IR than to change an existing one -- however, unless you just want to reinvent past mistakes, it is certainly helpful to have deep familiarity with an existing IR design and its problems.
You are correct in that I could be merely thinking it's easy. So let me answer your questions as best I can.
* UB is not exposed to the compiler/IR on purpose. I don't want compilers using it as an excuse to be adversarial like today's compilers. There is still UB, but as little as I can get away with (mostly in data races and race conditions).
* I'm designing a new provenance model right now. Incomplete. If anything is hard, it's this. And it might be.
* It is possible to attach information to any item or group of items. In fact, my IR will be able to attach so much information, it should be possible for high-level passes to reconstruct high-level code. Think MLIR with pure data. For example, you could group basic blocks together and label tell as coming from a while loop. You will also be able to generate and use e-graphs. My model will also be different than LLVM's. Analysis will only happen on the original IR, with information generated only on the original. Thus, no information is lost before analysis.
* Canonicality is another one I'm working on and may be hard, though less hard than provenance because my IR uses basic block arguments instead of a phi node. But the basic idea would be that optimization happens differently, so canonicalization should be easier than in LLVM because I'm going to design it to be.
* User-specified instructions with a way of defining their semantics in code.
* I'm starting with e-graphs and going from there. Still working on this.
I agree with you that an instruction list is not interesting at all.
Anyway, yeah, you are right, but I think I have mostly satisfactory answers.
I'm not GP, but I am someone designing an IR. No, it's not that hard. A lot of work, yes, but not that hard.