The Illustrated GHC [pdf]

amelius · on Jan 1, 2015

As someone interested in functional programming and compilers, I recently tried to use GHC's intermediate output. However, I got a little disappointed about the documentation that's available for compiler writers. From what is essentially a research compiler, I'd expected a little higher standards, also in the department of available tools, and examples. In other words, the experience came too close to what I would call "hacking".

Furthermore, the "Core" language is, I believe, for many purposes, more complicated than necessary (e.g., it is typed). It would be nice if there were a simpler alternative, e.g., for small projects.

Of course, I could be wrong about this (perhaps I looked in the wrong places), but this is just what I noticed.

Besides this, of course, Haskell is a cool language, and GHC an awesome compiler! :)

agumonkey · on Jan 1, 2015

GHC STG has many articles about it:

The 1992 SPJ paper http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=70D... (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.3...)

https://www.google.fr/search?q=spj+ghc+stg -> http://lambda.jstolarek.com/2013/06/getting-friendly-with-st...

Also: https://ghc.haskell.org/trac/ghc/wiki/ReadingList

dons · on Jan 1, 2015

There are several papers on why GHC uses a typed intermediate representation-- System Fc-- safety being the main reason.

http://research.microsoft.com/en-us/um/people/simonpj/papers...

An untyped representation is "just code", and that way lies madness. Bugs at least.

tome · on Jan 1, 2015

Core is just a desugared Haskell. Do you know there's another much simpler intermediate language called STG?

thoughtpolice · on Jan 1, 2015

> As someone interested in functional programming and compilers, I recently tried to use GHC's intermediate output. However, I got a little disappointed about the documentation that's available for compiler writers. From what is essentially a research compiler, I'd expected a little higher standards, also in the department of available tools, and examples. In other words, the experience came too close to what I would call "hacking".

We don't really emphasize or support people using the intermediate language in an external way very well. They're parts of the compiler that are intimately and deeply tied to other internals elsewhere, and it's not really an explicit design goal that those components be reusable in a general way.

Now, people do use GHC for this (years ago I even helped write a compiler that transformed GHC's core language into a whole-program IR that was compiled to C), and we do have an API you can use to leverage the compiler, but overall the amount of people using it for things like this, vs using it for things like dynamic loaders or typechecking utilities (such as ghc-mod) are very small.

In fact, just last year we removed 'External Core' from GHC, which was a way of serializing the Core representation to disk. Why? Because it was actually broken for close to two years I think, and nobody ever really complained or fixed or wanted to support it! And after a discussion, we didn't want to support it either. It has been used, but when it bitrots that bad, I think it's clear this isn't one of the largest driving use cases for the compiler.

That said, we could improve the documentation a lot, and add a lot more examples. But I don't think there ever has been (or probably will be) a huge push to make the core IRs reusable in an easy way. It's simply not a high priority design goal.

> Furthermore, the "Core" language is, I believe, for many purposes, more complicated than necessary (e.g., it is typed). It would be nice if there were a simpler alternative, e.g., for small projects.

The Core language being typed is a good thing for GHC - it makes it easy to turn on an internal typechecker and determine if the compiler has produced invalid IR, your optimization pass did, etc. It adds a lot of safety to your produced programs when you can ensure they type-check in a sane way.

This component, along with the core linting passes, have likely caught an _innumerable_ amount of optimization and desugaring bugs over the years, while being extremely low cost to support and use. Overall, typed Core has been a very huge win for GHC, and I don't ever see us adopting a different route. In fact, it's planned for our Core language to get an even fancier (dependent) type system not far off in the future. :)

amelius · on Jan 1, 2015

I agree that the number of potential users for this particular use of GHC is small. But as with everything, it are the small groups of people that produce the most interesting things ;)

I don't think an external format (serialized Core) is even necessary. Just a well-documented API would be nice. I'm hoping that somebody could write a set of examples, and put those in a test-suite so that it keeps getting updated whenever something breaks it.

I also agree that typing is extremely useful. But for small research projects, it can get in the way of the actual goal.

thoughtpolice · on Jan 1, 2015

Yes, I suppose my larger point was more to illustrate that nothing is free. GHC is software like any other piece of software and it has a lot of competing constraints we must balance.

If the cost of supporting reusable IRs like you want is high in terms of LOC and needed changes, but the amount of people who will leverage that work is incredibly small, it's honestly not clear to me if making that change and maintaining it is sensible or worthwhile. It might be worth it purely from a code cleanliness standpoint if it did a lot of cleanup, but that's not immediately clear either. We have to maintain everything people submit to us, presumably forever - the barrier to entry for large changes should be high, and well-motivated.

It is similar with external core; it was several extra thousands of lines of code throughout the compiler that was rarely - if ever - used by anyone. Did a few people use it to do cool work - for example, by writing the Intel Research Haskell compiler? Absolutely, and Intel's results were incredible - but that does not mean it's worth keeping that ball of code around for that one use case.

I'd actually argue keeping it for that one case would be a terrible decision for everyone, and when I deleted that part of the compiler, Intel's compiler lingered in my mind, as it was a user. But the fact is nobody was maintaining, submitting patches or using it in a modern GHC, including them - so I rm -rf'd it.

GHC is a research compiler, but it is also a production compiler for many users, and a project several dozen of us work on.

So while I'm sure your work is incredibly interesting (most of our large features come from people doing interesting research, so I understand your plight!), this really doesn't mean large, sweeping changes on our part - for your one case - are very worthwhile at all for us. But maybe you don't need sweeping changes, either!

(Of course, it's impossible to say how many people did not use GHC because of these constraints that otherwise were unsuitable for their projects - but we have to draw a line somewhere and balance our own needs vs those of others.)

> Just a well-documented API would be nice.

To be clear - what API are you exactly envisioning? An API that makes it easy to construct Core or STG programs using GHC's ASTs, and then feed them through the rest of the compiler? Or an API to do things like construct and manipulate the same core representation GHC uses, to later do whatever you want with perhaps with a separate tool?

It sounds like this is something you could tackle with a library that wrapped the GHC API. The problem is the Core representation in GHC is very prone to change, so I'm not sure there's any sane way to really maintain things like API stability between versions. But you also might not need that - just a simple API around "GHC Core version X.Y.Z" to build programs may be enough.

We could distribute such a library with GHC, but this sounds like something easily doable out-of-band first, at least.

Either way, something like thisI can surely see the need for. Where it should live is a different question.

> I also agree that typing is extremely useful. But for small research projects, it can get in the way of the actual goal.

Yes, but again, GHC's internal needs and developer needs outweigh those of random small research projects that might use it - the fact that it makes some kinds of work slightly harder is unfortunate, but in the grand scheme of things, this isn't really a concern at all for any of us, and likely never will be. There's no free lunch, I'm afraid.

amelius · on Jan 2, 2015

I would be mostly interested in reading the Core or STG representation, so I can apply transformations to them. Of course, it would also be nice to generate intermediate code from scratch, but that is something which is not forming an obstruction at this point (as it would be easy to just generate Haskell code instead).

I'm not looking for high-performance solutions, just some tools to make researching new ideas a little simpler.

To give you some idea about the things I'd like to do:

* Transforming programs into "incremental programs", so they run faster on changing inputs.

* Automatically translating functional data-structures to other languages (C++)

* Sending code over the network

I tried using the GHC parser directly, but that route turned out to be too much work (too many cases to handle). Then I looked at Core, but the documentation was lacking or not up to date.

So, from where I stand, the internal documentation is the biggest problem, and, according to good software engineering practices, there can be no excuse for having no good internal documentation :)

Besides just plain documentation, a tutorial with examples would be nice, but I can understand if that is not a priority.

Anyway, keep up the good work :) I'm happy with GHC as it is. I just wished it was a little more "open".

hardwaresofton · on Jan 2, 2015

Fantastic in depth PDF. While lacking the talk (as others have mentioned) I was able to determine in large strokes what's going on at the lower layers of Haskell. I particularly found the FFI interface section (ex. the breakdown of a call to getLine) extremely interesting/useful.

While I think real understanding will come if/when I have to actually wrestle with the lower layers of GHC (or am contributing to the source code or something), this served as a great overview, and a quick explanation of how Haskell "really" works.

jamesfisher · on Jan 1, 2015

This looks great, but needs some interpretation, like an accompanying talk. I got lost in box-and-line diagrams which I did not know how to interpret.

An introduction to GHC which I found helpful is its AOSA chapter: http://www.aosabook.org/en/ghc.html

mcguire · on Jan 2, 2015

I haven't looked at the source; is GHC using both C-- (http://www.cminusminus.org/) and LLVM?