Learning to write a compiler

Hoff · on Aug 29, 2010

Here is an earlier incarnation of the "writing a compiler" thread:

http://news.ycombinator.com/item?id=1608129

silentbicycle · on Aug 29, 2010

Oops, I downvoted you when I meant to upvote. There is probably another incarnation or three before the one you linked, too.

I'd recommend Appel's "Modern Compiler Implementation in ML" over the dragon book.

mitjak · on Aug 29, 2010

I noticed there are a few versions, including for Java and C. Any reason why the ML version is better (not against ML; just don't know it well enough (at all actually)).

silentbicycle · on Aug 29, 2010

A very large part of compiler operations involve analyzing and transforming tree/graph data structures (once you've parsed to an AST), so you want a language with good native support for those structures. Pattern matching and garbage collection help immensely, as do ML's inferred static types. Working with complex data structures is really where ML shines.

It's not just ML, though - using a language where you can work directly with native tree structures (ML and Haskell's type constructors, Lisp sexps, Prolog functors, Lua tables, or even JSON) means that you can postpone learning syntax/parsing and start with the language's semantics. You can save the syntax details for when you have the language design worked out. Compiler books based on C (such as the dragon book) spend such a long time on parsing upfront because they don't have much choice.

Also, Andrew Appel works on SML/NJ, and the C and Java books are translated from the ML version. I've looked at the C version (which uses lex and yacc instead of ml-lex and ml-yacc, mallocs but doesn't free until later in the book, etc.), and while it's still a good compiler overview book, I'd strongly suggest getting the ML version instead. The ML version uses Standard ML (SML), but I've only used OCaml and didn't have trouble following it and making the minor adaptations along the way. (I haven't looked at the Java version. Whatever.)

rlm · on Aug 29, 2010

I think that ML has some features (such as pattern matching) that makes it a better fit for writing a compiler than Java or C.

mapleoin · on Aug 28, 2010

I'm upvoting this just to hopefully start a discussion and hear more about this topic from fellow HNers.

I hear the Dragon Book mentioned almost daily around here; I'm intrigued. Do you think this book could be read by someone who's not actually interested in writing a compiler? Does the book stand by itself? I'm interested in reading it, but I don't really think I have the motivation to start writing my own compiler (even a mock one).

pbiggar · on Aug 29, 2010

DO NOT READ THE DRAGON BOOK.

It's not a very good introduction to compilers. Read Engineering a Compiler by Cooper and Torczon. The Appel book is also very good, and contains some stuff about functional and logic languages that are generally missing from most compiler texts. If you're an enthusiast, but not in it to build a compiler, I really enjoy Programming Language Pragmatics. For more advanced material, use the Muchnick book, or the Compiler Design Handbook (both editions have different materials). They also provide excellent pointers to literature, but aren't great for beginners.

I get the impression that most people who recommend the Dragon book haven't read it. When I was doing my PhD I had five or six books on my shelf, and was constantly unimpressed with the Dragon Book, but always impressed with Cooper/Torczon, Muchnick and Appel.

haberman · on Aug 29, 2010

> Read Engineering a Compiler by Cooper and Torczon.

That book has an error in it which I have reported to the authors twice but never heard anything back. This leaves a bad taste in my mouth.

Here's what I wrote to them:

Subject: Errata for "Engineering a Compiler" Date: Sat, May 26, 2007 at 8:12 PM

Hello,

I noticed that your section on DFA minimization is labeled "Hopcroft's Algorithm," but doesn't seem to describe the algorithm explained by Hopcroft in his 1971 paper [0]. Your algorithm appears to be n^2, where Hopcroft's is n log n. David Gries gave a somewhat more digestible presentation of the same algorithm in a follow-up paper in 1972 [1].

Hope this information is useful!

Sincerely, <me>

[0] John E. Hopcroft. An n log n algorithm for minimizing states in a finite automaton. Technical Report: CS-TR-71-190, 1971. Available online at ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/71/190/CS-TR-71-190.pdf

[1] David Gries. Describing an algorithm by Hopcroft. Acta Informatica, 2(2):97-109. Online reference: http://www.springerlink.com/content/r5631549671g6251/

ahn · on Aug 29, 2010

>I get the impression that most people who recommend the Dragon book haven't read it.

These days you can usually tell that they haven't read it if they refer to it as "the Dragon book" and don't say anything about the edition. The same thing goes for "Knuth", "the Appel book", etc.

kingkilr · on Aug 29, 2010

Couldn't agree more, I hack on compilers a fair bit (one of my own, plus the JITs in PyPy and Unladen Swallow) and the Dragon Book was impossible for me to get anything out of.

dlo · on Aug 29, 2010

IMO, the format of this book makes sense for the context it was intended for: Stanford's course on optimizations and program analysis, where the problem sets and reading provide theoretical knowledge and the projects -- hacking on the joeq VM/compiler-infrastructure -- provide implementation experience. These distinct experiences make for a well-rounded education in advanced compiling techniques. (Disclaimer: I have not taken this course and am basing these comments on a quick look at their course Web page.)

So reading the Dragon Book is more akin to reading mathematics than reading a handbook on compiler implementation. Its treatment of data-flow analysis is another example of this: While the Dragon Book gives top-notch treatment of the mathematical underpinnings of data-flow analysis (semi-lattices, partial orderings, monotonicity, greatest lower-bounds, etc.), it does not go into how to implement an efficient worklist algorithm; it does not even mention du-chains AFAICT. At most, it suggests using a bit-set to represent the reaching definitions that enter and exit a basic block.

So I too am in agreement. But maybe this is what you want, mapleoin. If you just want to know what a reaching def is or how compiler writers know that their data-flow analysis is going to terminate, this is a book that will not bog you down with too many implementation details.

carterschonwald · on Aug 29, 2010

I think there more obvious problems with the dragon book:

in particular, ~300+ pages are spent focusing on techniques for doing parsing and lexing.

Now if I want to write a baby yacc [sic], this might be useful. But in this modern era, parsing is a very well understood problem, with lots of easy to use tool. Yes in a production compiler helpful syntax and type error messages are key, but when you're learning the compiling part, you want a book that doesn't spend half of its volume on that topic. Also, the 2nd edition doesn't seem to have a single level reader in mind.

The intro book people should look at is (as mentioned elsewhere), should be the intro to compilers in ml book by appel, and for advanced stuff folks should look at stuff like appel's compiling with continuations, the munchnick (spelling?) book, and one or two others.

I think the point is that 1) most exposure to the dragon book for most folks predates the 2nd edition, and in your experience, most of the learning sounds like it was from the lecture notes and problems sets rather than the text (presumably used as a reference supplement in practice?)

barrkel · on Aug 29, 2010

Even if you are writing a production compiler, you will likely not be using a tool, nor writing a tool, to generate the parser. Most production commercial compilers in practice use hand-written recursive descent with a hand-written lexer, which itself potentially uses feedback from the parser (very handy for parsing C or C++). Recursive descent usually gives context for error recovery that matches the way most programmers think about language syntax; and with it being written by hand, the recovery can be tweaked to be as smart as necessary.

Also, IDE / editor support like intellisense can greatly benefit from integration with a recursive descent parser. If you encode the cursor's position as a special token, the parser can handle that token and do a deep return (throw an exception, longjmp, whatever) with relevant context, active scopes, etc.

carterschonwald · on Aug 29, 2010

this is also very very very true. And thus why parsing combinator libraries are one of the sanest defaults one can adopt in dealing with parsing

haberman · on Aug 29, 2010

> Now if I want to write a baby yacc [sic], this might be useful. But in this modern era, parsing is a very well understood problem, with lots of easy to use tool.

I certainly don't begrudge you using the existing tools, but speaking as someone writing a "baby yacc", I don't think parsing is quite the solved problem you make it out to be.

Yes there is TONS of literature on the subject, but new techniques and algorithms are being discovered all the time. ANTLR's LL(*) parsing hasn't been published yet (though I believe he's working on it) and only three years ago Frost, Hafiz and Callaghan published an algorithm for generalized top-down parsing in polynomial time. There's also the idea of PEG, published by Bryan Ford in 2004, a guy at UPenn who is carving out a set of languages between regular and push-down languages (http://www.cis.upenn.edu/~alur/nw.html).

All of this is to say; we're still discovering things about parsing. It's not a settled subject.

silentbicycle · on Aug 29, 2010

Ierusalimschy's LPEG (a PEG parser for Lua[1]) also has some new developments. IIRC, he found a way to greatly improve the space performance of PEGs. I've used it a lot (it's a nice middle ground between REs and a full parsing framework, and Lua is one of my favorite languages), but I'm not familiar enough with Ford's PEG implementation to be more specific.

Also, here's a good blog post[2] in which the author discovered how using parsing tools to syntax-highlight text as it's being modified quickly led him to the frontiers of parsing research.

[1]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html

[2]: http://www.codekana.com/blog/2009/04/02/on-the-speed-of-ligh...

dlo · on Aug 29, 2010

You make a good point; it does spend a lot of time on the front-end. As pointed out elsewhere, Muchnick is more thorough on back-end topics, which it appears we both agree are far more interesting. I've definitely heard of those other books before, and in particular, Appel's _Compiling with Continuations_ will be especially interesting when I look into compiling functional languages.

Actually, I don't have any personal experience with the Stanford course. I am reading the book for fun really. (For all its faults, it is quite engrossing!) I just thought it would be relevant to investigate under what context this book is generally used.

b3b0p · on Aug 29, 2010

I agree with this. I was told this was the so called bible of compiler development and must have. I cannot recommend it.

So, I bought it (cheapest copies used were over $100), tried to read it, and got nothing out of it. I gave up about half way through. It's been a few years, but it read like no other technical CS book I can remember. If I recall, it like a theoretical explanation, a long lecture on paper.

Edit: Grammar

dlo · on Aug 29, 2010

I started picking at Muchnick. From the look of it, it is almost all back-end stuff.

mouth waters

silentOpen · on Aug 28, 2010

Do you want to build a compiler or design a programming language? The Dragon Book is pretty heavy on implementation details if you just want a taste without building one.

If you're just getting started in mechanical program transformation, I strongly recommend using something like Structure and Interpretation of Computer Programs and its meta-circular evaluator to get the basics without the grunge of parsing and generation.

dlo · on Aug 29, 2010

It is possible to skip the first half of the Dragon Book if you are not interested in front-end stuff. (I am not.) Chapters 8 and onward are far more interesting IMO. You will encounters topics such as SSA, data-flow analysis, points-to analysis, binary-decision diagrams, call graphs, Datalog, etc.

pbiggar · on Aug 29, 2010

Meh. It's SSA treatment is poor. The dataflow chapter is OK. The points-to chapter isn't great (especially for someone so into it). BDDs and datalog are very poorly addressed, especially when trying to seek more information of Lam/Whaley's work on them.

Skip the whole Dragon Book, read Cooper/Torczon or Appel.

dlo · on Aug 29, 2010

I agree that its treatment of SSA is scanty. Muchnick, which I have started poking at, is much more thorough: The Dragon Book does go into what SSA is, but it does not tell you how to compute it while Muchnick does. However, I imagine it must have been a deliberate decision since Ullman et al. go into dominance frontiers quite readily shortly thereafter.

I admit that I skipped over the book's treatment of points-to analysis since I read the original papers by Lam and Whaley on it. Obviously, I can't speak for its treatment in the book, but I found the papers to be excellent. And indeed, one of them won a Best Paper award.

petercooper · on Aug 28, 2010

Yes. I bought and read it out of interest rather than a desire to actually build a compiler. As long as you're OK with set theory and general math notation, it's not a particularly tricky read, and you get to learn a lot of cool stuff about automata/state machines in an applied, practical sense.

mfukar · on Aug 29, 2010

I think the Dragon book is pretty approachable for someone with basic CS background. It's also quite extensive in its coverage of compiler implementation techniques. However, as others have already pointed out, certain aspects are not covered in full and implementation details are not always provided.

My suggestion? Read it. True to its title, "Principles, Techniques and Tools", it delivers just that. I don't think you'll find its contents impenetrable, but perhaps it will seem too abstract in case you're looking for implementation guidelines.

Some things that might bug you about the Dragon Book: 1) extensive coverage of lexical analysis and related techniques, 2) it's not suited for a more pragmatic course - this is better left to books like Muchnick's "Advanced Compiler Design and Implementation" and friends 3) it omits certain important advances in the field, such as the work of Pager, A Practical General Method for Constructing LR(k) Parsers. Acta Informatica 7, 249-268., 1977 (in all fairness to the book authors, all work after yacc was out in 1973 has apparently gone unnoticed by all).

Disclaimer: I've only read the Pearson International Edition, which according to what I hear, is identical to the first edition. I don't know to what extent this is true. Maybe someone can confirm/refute this?

tonfa · on Aug 29, 2010

> It's also quite extensive in its coverage of compiler implementation techniques.

With only two pages about SSA ? (which is the internal representation used by most compilers this days)

mfukar · on Aug 29, 2010

Yes.

The omission of SSA (which, remember, gained in popularity after the publication of R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph., ACM Transactions on Programming Languages and Systems, 13(4):451-490, in October 1991) is in fact one of the many advances that fall in the (3) category I mentioned. I assumed I wouldn't have to enumerate every single one of them. Also, remember the edition I'm referring to. I have no idea if the latest edition (2007, iirc) covers SSA, and to what extent.

Nevertheless, the book is still extensive and thorough in the concepts it covers.

tonfa · on Aug 29, 2010

I was referring to the 2007 edition (the first edition obviously doesn't mention it). It's a bit sad to see they didn't even think it deserved a chapter...

You can look at the new stuff from the 2nd edition here: http://www.pearsonhighered.com/educator/academic/product/0,1...

jhpriestley · on Aug 28, 2010

I had trouble getting through much of it, even though I tend to read technical books for fun. Lisp in Small Pieces or the Lambda Papers are more educational and easier to read. They will change the way you think about programming far more than the Dragon book.

The dragon book is okay as a comprehensive reference, but is bad at communicating the central idea of anything. You might want to skim it, looking for topics that interest you, then seek out several sources for those topics and see which clicks for you. If you start reading the Dragon book from page 1, you risk getting bogged down.

dlo · on Aug 29, 2010

I definitely would not recommend starting from page 1. You can pick at it gradually. Personally, I've gravitated more toward the second half of the book where it gets far more interesting. (I have the 2nd edition, where coverage of the back-end is vastly improved.)

Someone · on Aug 29, 2010

"could be read?" Of course. whether you would enjoy it is another issue. Compare;

  "Do you think I could take piano lessons if I am not actually interested in playing a piano?"

    "Do you think I could study architecture if I am not actually interested in designing buildings?"

The former is not popular, the latter is, but AFAIK few people study the grunt work of architecture for fun. It all depends on where your interests are. What part of compiler writing intrigues you?

domgblackwell · on Aug 29, 2010

You might study architecture if you want to be able to understand the buildings you see in places you travel to. Just as not everyone who studies the history of art wants to be a painter...

dekz · on Aug 28, 2010

If you're interested in language design also may I suggest, http://www.amazon.com/Language-Implementation-Patterns-Domai... and http://www.amazon.com/Language-Implementation-Patterns-Domai...

There are also some programmers around who are going through the process of designing and implementing their own languages, you can watch their progress and ideas. Here is 1(2?) examples, Magpie and Finch by munificent. http://bitbucket.org/munificent

It seems Magpie has started out using a C# compiler/interpreter but now uses Java?

ww520 · on Aug 29, 2010

The Dragon Book is quite readable. The best approach is to read each chapter by itself rather than going from beginning to end. If there's a topic interested you, jump to that chapter. You can jump to the regular expression chapter, which can be read by itself, jump to the code generation and program layout, or jump to the optimization chapter, which can be read by itself. The lex and parser topics sort of go together.

pvg · on Aug 29, 2010

When did reading books become a purely utilitarian thing? Interested in reading it? Read it. It's a good book. You don't have to do the exercises.

jerf · on Aug 29, 2010

"When did reading books become a purely utilitarian thing?"

When the answer to the question "How do I learn to do X?" is "Read this book."

Nice sentiment, wrong place. "Read this book even though it won't help you learn X" translates directly to "Don't read this book." in this context.

Groxx · on Aug 29, 2010

When we started having to prioritize our time.

i.e., around high school, with a significant increase after college / marriage.

okmjuhb · on Aug 28, 2010

The interesting thing about compiler writing is the drastic reduction in complexity that's happened. There are essentially off-the-shelf tools that can be used to handle everything before semantic analysis and everything after conversion to a reasonably high-level IR. This has dramatically reduced the cost of language experimentation; it's now feasible to bang out a new language in a day or two (assuming you're familiar with yacc, LLVM, etc.), and generate efficient code on a number of target architectures.

These tools mean that "learning to write a compiler" can take several different paths; on the one hand, we could learn about how each of these tools work (e.g. "how to write a parser generator", "machine-independent optimizations", "code generation", etc.). This is the approach taken by the dragon book. This gives the background theory and mathematics behind compiler-writing (and will likely be something people need to know if they want to write for instance an LLVM optimization pass). The problem with this is that, because of the trends mentioned above, it has little to do with the day-to-day of actual compiler writing and language experimentation (this is only somewhat true, since most "real" compilers will have a custom-written frontend and at least a little bit of knowledge about the target architecture).

The other approach is to use pre-written tools, to focus on language design decisions and applications. To be honest I'm not familiar with any resources that do this particularly well. This is the approach taken by most of the "we're going to build a compiler!" websites mentioned on the Stack Overflow post. This approach is probably more useful for most of the people who want to "write a compiler", but it leaves people with a very shallow knowledge of what's happening beneath the surface of the APIs they use.

I don't mean to imply that one of these approaches is "better" than the other (and indeed, the second requires a little bit of the first - it's difficult to debug an automatically generated parser without knowing at least a little bit of the theory), but it's important to know which approach you're aiming for, and trying to optimize based on that goal. Going to the Dragon book as a "how-to" is a disaster waiting to happen.

pjscott · on Aug 29, 2010

I think that combining this short compiler tutorial with LLVM would be a very effective way for someone to get comfortable with basic compiler writing:

http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf

Afterward, learning how to do lexing and parsing would be a good addition. The tutorial covers a subset of scheme, which is trivially simple to parse, especially if you're writing the compiler in scheme as well.

wingo · on Aug 29, 2010

The top comment has a remarkably good list of references. I like the nanopass and incremental papers a lot, RTT is a classic, and there were several good-looking references I had never seen. Almost makes me feel some jealousy at the pleasant vertigo the original poster must feel :)