This is (in my humble opinion) a fantastic introduction, and relevant to a much wider audience than the title implies. If you want to keep up to date with LLVM developments, you might also be interested in my LLVM Weekly newsletter (http://llvmweekly.org/). I try to highlight interesting commits, mailing list discussions, and blog posts (tips and submissions always welcome!).
Adrian might have mentioned an important drawback of using LLVM: active effort is required to keep up with it. Many an LLVM-based research project has gotten stuck on 2.9 or 3.2 at which point it starts becoming less and less relevant.
It's not that big of a deal, but active effort is required. The amount of effort depends on how many and which APIs your project uses; for a small/medium project perhaps a couple of hours every couple of weeks.
Glad you said this - it's a pain point, for sure. My PhD student Charlie Curtsinger (who will be joining Grinnell College this Fall) developed Stabilizer using LLVM (see http://emeryberger.com/research/stabilizer/, http://www.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf), and it is "stuck" for now in exactly the way you describe (he plans to fix it soon, but it will take a solid week or two). Of course, YMMV: Stabilizer is by its nature pretty invasive -- it randomizes code and stack frames dynamically during execution (in addition to doing fine-grained heap randomization), and this touches a lot of stuff.
I often hear about llvm breaking backwards compatibility but I don't know much about the specifics. What are the parts that are more prone to change? Are there any parts that are more stable?
This seems like a strange observation. Active effort is required to keep up with any compiler framework, except the ones that aren't going anywhere. LLVM and Clang at least attempt to provide fixed interfaces for interfacing. In contrast The GCC maintainers are actively hostile towards that idea.
Honestly, keeping up seems like less of a concern for research projects. Most projects will be abandoned after providing grist for a few papers. The most useful and practical will get integrated back into LLVM. The ones that are not yet practical but of ongoing interest will have multiple groups working on them. A good example of this is Polly.
For the past year, i had two projects which required me to work with LLVM. Because of the scarcity of articles/documentation, i found it really hard to get into it. The API Docs and a few articles by Eli Bendersky(thank you so much!!! ) was all i found useful.
But once i got past the basic hurdles, the llvm project code was so well written, i felt a certain pleasure working with it.
I've used LLVM a few times to generate code, and it's pretty easy to use (and moderately stable between revisions). e.g. http://cowlark.com/calculon is a function evaluation language I wrote as a sort of cheap shader language for a ray tracer; the whole thing is 4000 lines of C++ header.
I have also attempted to port LLVM to new CPU architectures. This is a totally different deal --- it's poorly documented, very hard to make progress with, and peculiarly unfinished. (e.g. the pattern matcher language, which is excellent, is weirdly unable to match certain types of pattern, which requires you to write big chunks of C++ which manually look for particular DAG patterns and convert them into other patterns.) Debugging is a pain; get a pattern wrong and it'll just hang as the pattern matcher state machine goes insane, or else produce weird, contorted and incomprehensible error messages.
I would say that it's probably about as painful as gcc, although the pain points are in a very different place. And LLVM actually has people and momentum behind it, while the gcc mailing lists are dead quiet.
I'm currently investigating libfirm, which is the open source C compiler nobody's ever heard of. It looks really rather nice, and builds in less time than LLVM takes to run its configure script...
A nice introduction, I'm currently working on a compiler which uses LLVM for code generation... it was very difficult to get into at first. Especially since I was using C (now Go), so I would have to work through 2/3 layers of language and documentation. (Was just C to C++, now it's Go, to C, to C++).
If you're interested in learning more about LLVM, there are some good open source projects that use it. If you aren't using C++, people have also ported the kaleidoscope tutorial project to Haskell, Rust, C, etc... Additionally, a lot of bigger compilers like Rust, and Clang use it - Swift also uses LLVM, and should be open source soon?
Working with the OCaml bindings feels similar. A lot of the functions are explained as wrapping some particular function from the C++ code, which is itself only explained in the docs with a link to its source implementation.
We use LLVM to at MapD (http://mapd.com) to compile SQL queries to CPU and GPU machine code - it has given us a major boost over an interpreter based approach. For more see here - http://devblogs.nvidia.com/parallelforall/mapd-massive-throu.... If you have a background in LLVM or compilers in general and are interested in tackling problems like this please reach out at jobs@mapd.com.
Daytona(http://www.research.att.com/projects/Daytona/index.html ), mainly developed by Rick Greer in 1990's, does the same thing. They have implemented superset of SQL; and this SQL is compiled into machine code. Daytona in those days used to parse call records for AT&T. They were processing Terabytes of data when the average laptop had a 20GB hard drive. Rick wrote a paper titled "Daytona
And The Fourth-Generation Language Cymbal".
You should contact Rick, if you guys need to pick his mind. He is very down to earth man. And he is retired, lives in Morristown, NJ area.
> LLVM is nicely written: its architecture is way more modular than other compilers. Part of the reason for this niceness comes from its original implementor, who is one of us.
For those interested in seeing examples of LLVM hacking in action, I would recommend reading the source for Halide – https://github.com/halide/Halide – which is an image-processing DSL implemented in C++ and piggybacking on LLVM. I myself learned a lot about LLVM this way.
This isn't germane to the main topic but this paper they cite about using a compiler pass to verify OS security ("Protecting Applications from Hostile Operating Systems") is pretty darn interesting: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf
One of my work projects involved building a .NET runtime with LLVM. It evolved from conservative garbage collection through to an advanced precise collector supporting a multithreaded runtime.
The conservative collector required no support from the runtime as they operate by treating every word-sized value on the stack and heap as a potential reference. All you need for a conservative collector is to be able to know where your stack starts and ends, where your heap begins and ends, and where your globals are.
We went through a couple designs for precise collectors and they all used the same meta-information. A precise collector needs to know what is a reference and what is not so your compiler needs to emit stack maps (tables of offsets where, given your PC in a function, all the live objects are), global root information, as well as object layout information.
Sometimes this becomes more complex. Object layout information included both location of references within type instances as well as whether or not an object is an array. Arrays of value types require both.
.NET also has a type of reference, a managed pointer, which can be used to refer to the middle of an object, such as a pointer to a value type in the middle of an array, which contributes to the parent object's liveliness information.
To sum up, conservative garbage collection can be easy to add. Precise garbage collection much less so. Language specifics can throw a big wrench into things.
If you want an imprecise (conservative) garbage collector, the GC and the compiler are almost disjoint. The GC only needs to know where the heap and stack are, and be able to capture register values at GC time. In other words, it's a runtime-library issue, not a compiler issue. (See the Boehm GC for an example that works with a stock C compiler.)
If you want a precise garbage collector, you need type information to know which heap and stack slots and registers are pointers. This is much harder because your compiler needs to emit metadata for each struct and then keep stack frame info or somesuch at the GC safe points so the runtime knows what the thread's roots are.
Most of the precise GCs I know of are in JIT'd or interpreted languages where you have this type info anyway... AOT-compiled Lisp is the one counterexample I can think of, but usually Lisps solve this in a different way by using tagged pointers (so you know if any heap slot is a pointer by e.g. its LSB).
The issue is not having the type info available. AOT compilers certainly have that (or at least, enough of it to get the job done).
The issue is that most compilers with precise GC---AOT or otherwise---are built around that assumption from the very beginning. Therefore, they choose representations, algorithms, and optimizations that are amenable to GC, and in particular don't destroy the type info necessary to make GC work.
It is possible in theory to do the same with LLVM, but difficult in practice. Most existing LLVM-based GC'd languages compromise on performance by requiring LLVM to "root" pointers, effectively forcing them to stay on the stack (instead of registers) so that the GC can see them when it needs to. LLVM's optimizations (particularly at the machine-specific level) were previously allowed to do arbitrary things with register-based pointers, which is obviously bad for a GC if no copies of those pointers exist on the stack. It just takes a long time to take a code base that large and untangle the code from such basic assumptions.
This appears to be changing, as the release nodes for LLVM 3.6 indicated. But it will take time. In the mean time languages on LLVM are either going with conservative GCs, or with precise GCs with the performance degradation noted above (e.g. OCaml), or with no GC (e.g. Rust for the time being).
Most of the precise GCs I know of are in JIT'd or interpreted languages
GC is orthogonal to JITing. OCaml and Haskell are two examples of languages with non-JIT AOT compilers (ocamlopt and ghc, respectively) that do precise GC. I'm sure there are plenty of others.
Yep, you're right, it is. And I totally forgot about those compiled functional languages. Thanks! I mentioned JITs and interpreters because those are the obvious cases where type information still exists at runtime, but the compiler can emit that metadata too (as I mentioned).
GCing has nothing to do with a language being functional.
All you need is access to the roots and the ability always to determine which data is another pointer. The main 'enemy' of this is the ability to cast integers to pointers (as you may in C/C++).
Interpreters don't necessarily do GC: if the language that the interpreter is written in has GC, then the interpreted language is automatically GCed. If the interpreter is written in a non-GC language you have to manage the interpreted program's memory manually too (which you may do using a GC for the interpreted language).
As an aside: this is comparable to evaluation order: if the interpreter is written in a call-by-value language, then the interpreted language is CBV, if the interpreter is written in a lazy language, the interpreted language has lazy evaluation. (Assuming you are not doing something fancy like translation to continuation-passing-style.)
> GCing has nothing to do with a language being functional
Yep, I didn't say it did! You mentioned Haskell and OCaml as counterexamples. I referred to them as "the compiled functional languages". This isn't a claim that only compiled functional languages do GC.
> Interpreters don't do GC
Sure they do! One example: Ruby (the original interpreter in C). It has a full mark/sweep collector that traverses user data structures. Custom object types implement their own 'mark' routines, so it's fully precise. (I've implemented a C extension that uses this API.)
Another example: often, Java runtimes interpret Java bytecode on the first pass, before JIT'ing, to give fast startup on cold code. The JRE would be written in C or C++ (non-GC'd), and Java is GC'd.
In general an interpreter can add any additional or different semantics it needs to on top of the host language's semantics. I don't think your call-by-value claim is correct either, because function calls in the interpreted program do not map one-to-one to function calls in the interpreter. (The interpreter would generally manage a data structure representing the interpreted program's stack.)
My goodness, this is going off the rails. Sorry to OP for derailing. Just trying to fix things that are Wrong On The Internet...
One example: Ruby (the original interpreter in C).
OK, I should have said, and have now modified my original statement to "Interpreters don't necessarily do GC".
In general an interpreter can add any additional or different semantics
Yes, and that falls under "something fancy"! The naive way of interpreting object-language application by meta-language application transfers the latter's calling discipline to the former.
Of course, but no one claimed that interpreters necessarily do GC either, so this statement is not very useful. You keep arguing against claims that aren't there. The only claim ever made in response to OP was that precise GC requires runtime information, and that this can sometimes be easier when it exists anyway (as in JITs and real-world interpreters), but can also be implemented when the compiler produces it AOT. I hope that's clear enough :-)
What about all the ML-family functional languages? JIT compilation is a fairly recent innovation and I don't think its accurate to say that garbage collection is restricted to JIT compilers.
I never said precise GC (or even GC in general -- see Boehm GC which I cited) is restricted to JIT! Only that this is perhaps the most common case, because type info still exists at runtime. As I mentioned, the compiler can emit the necessary metadata for AOT cases. And I forgot to mention ML and Haskell and friends, sorry.
Even e.g. SBCL which is an AOT lisp runs into the issue if you share the C stack (needed for FFI) with the lisp control stack. On targets with few registers (i.e. not SPARC or Power) it uses a conservative GC that pins any memory pointed to on the control stack, and a precise GC for all heap objects.
Long time ago I read a post by Mathew Flatt about LLVM and gcc and how mini optimization and selling points were important not able to find the post, but it was a interesting read.
By market cap (or enterprise value), Apple's the largest public company now, worth about $700b, which is about 2x as much as the next biggest (Exxon, Microsoft, etc.). Although Saudi Aramco is probably the world's most valuable company if non-publicly-traded companies are included.