I last played around with ANTLR in 2012 (when it was version 3) and I discovered that there's a "bigger picture" to the parser generator universe that most tutorials don't talk about:
1) ANTLR is a good tool for generating "happy path" parsers. With a grammar specification, it easily generates a parser that accepts or rejects a piece of source code. However, it's not easy to use the hooks to generate high quality diagnostic error messages.
2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.
The common theme to the 2 bullet points above is that a high quality compiler written by hand will hold multiple "states" of information and an ANTLR grammar file doesn't really have an obvious way to express that knowledge. A pathological example would be the numerous "broken HTML" pages being successfully parsed by browsers. It would be very hard to replicate how Chrome/Firefox/Safari/IE doesn't choke on broken HTML by using ANTLR to generate an HTML parser.
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
Lastly, the lexing (creating the tokens) and parsing (creating the AST) is a very tiny percentage of the total development of a quality compiler. Therefore, ANTLR doesn't save as much time as one might think.
I welcome any comments about v4 that makes those findings obsolete.
The vast majority of the time it is totally okay to quit on a single syntax error. Not everybody -- most -- do not need "speculative parsing", we are not all designing HTML parsers and IDE assists. ANTLR / Bison are fantastic for parsing grammars and parsers written in them are a million times more maintainable than a 1000 line hand rolled parser in C that's ten years old and has been touched by 20 hands. You really can't out perform a parser generator unless you live in an ivory tower where no one will ever modify your code and you have unlimited time to reinvent every wheel that Bison / ANTLR have built in.
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).
What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).
Which ones? Awk is one; in fact Awk was almost co-developed with yacc. But it gives generally bad error messages. (There are multiple implementations, but most of them use a yacc-style LR(1) grammar and give bad error messages.)
I don't know of any others. Certainly ANTLR generates bad C/C++ code, so I would be very surprised if it's used in anything in a typical Unix/Linux terminal.
I have been blogging about Unix and parsing here, after writing a very complete bash parser by hand: http://www.oilshell.org/blog/
My conclusion is also that generic parser generators are not good enough for production quality parsers. This is borne out by evidence in the wild.
In other words, "real" languages don't use parser generators. Look at the top 10 languages, as well as emerging languages. Which one of them use parser generators?
Clang, GCC, v8, PHP, C# / Roslyn, Go, Perl, TypeScript, Dart, etc. all use hand-written parsers. Not sure about Rust and Swift, but I think they are hand-written. Java is an exception, but interestingly it has a "real" grammar, and then a LALR(1) optimized for parser generators:
Python uses ITS OWN parser generator, not a generic one, which is a very important distinction.
I'm not sure about Ruby, I think it might be a yacc core with A LOT of ad-hoc parsing, so it may not count. Just like bash uses yacc for about 1/4 of the language, and ad hoc hand-written parsers for the other 3/4.
Anyway, I don't think parser generators are as widely used as you think, but I would be happy to be corrected.
Howdy! It is definitely the case that most production languages use handbuilt parsers. In talking with these compiler developers, they are obsessed with control (speed, error reporting, ...) and want very specific data structures built during the parse. They also tend to use existing lexer infrastructure that is baked into their development environments. I commented more on this topic here: http://stackoverflow.com/questions/16503217/antlr-for-commer...
The thing to remember is that the vast majority of parsers out there are not for these production language compilers. Compare the number of people you know that have built parsers for a DSL, data, documents, or whatever to the number of people you know that built compilers. ANTLR's niche is for your everyday parsing needs. It generates fast ALL(*) parsers in a multitude of languages and accepts all grammars without complaint, with the minor constraint that it cannot handle indirect left recursion. (Direct left recursion as an expressions is totally okay.) For a speed shootout with other tools, see OOPSLA paper http://www.antlr.org/papers/allstar-techreport.pdf
I've heard many compilers started with yacc/bison, but later went hand-written for better error messages.
Currently, jq (pretty much, awk for json https://stedolan.github.io/jq/) uses bison, but sometimes has unhelpful syntax error messages. I'm not the maintainer or anything, but trying to trigger error messages more precisely (i.e. diagnosis) was a nightmare.
While a generator can't match a custom parser for control, is it plausible to have arbitrarily precise hooks in ANTLR for diagnostic messages?
> Compare the number of people you know that have built parsers for a DSL, data, documents, or whatever
The vast majority of those people that I know of have not even been aware of tools like ANTLR, and usually resort to regexp-based abominations instead unless they happen to know about bison/yacc (slightly more likely than ANTLR).
I agree those are cases where a tool like ANTLR might very well be preferable to someone hacking together a regexp-driven mess, but my personal experience is that the people most likely to know about these tools are also the people most likely to know how to handwrite reasonably clean parsers.
Some subset certainly will opt for the tools some of the time (I have done that myself, even though I much prefer handwritten parsers; I've also written parser generation tools myself, out of frustration with the existing ones... Im still frustrated enough by my own tools to end up handwriting parsers... (EDIT: Clarified; I could use a good parser to check my English sometimes) ), but the parser-generation field is a peculiar one.
I have no doubt that sooner or later someone will crack this nut (maybe ANTLR will be that tool one day) and handwriting parsers will go the way of handwriting assembler and be something you do only in rare cases, but we're far from there still.
Note that I'm not saying ANTLR is a bad tool. I'm saying this is a space where a lot of the potential audience are a bunch of very difficult to convince and picky people (myself included...), and where a large other part of the audience doesn't know the field at all and are unaware of the tools.
Thanks for the response. Yes I think we are agreeing -- I have 2 of your books, and have read many of your ANTLR-related papers, and they were definitely the thing that taught me the most about top-down parsing. (I also used to share an office with Guido van Rossum and I remember he had your books too.)
I ported the POSIX shell grammar to both ANTLR v3 and v4 (which was basically changing yacc-style BNF to EBNF). But as mentioned, I discovered that the grammar only covers about 1/4 of the language. bash generates code with the same grammar using yacc, but fills in the rest with hand-written code. Every other shell I've encountered uses a hand-written parser. bash says they regret using yacc here:
I agree with you that there is a Pareto or long tail distribution in parser use cases. Most languages CAN use something like ANTLR or bison. But the parent was making a different claim:
What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).
I would say that is FALSE, because most parsers that your fingers pass through are HAND-WRITTEN, because of the Pareto distribution. 99% of anyone's usage is of probably a dozen or so parsers, and they are either hand-written or generated by custom code generators, not general-purpose code generators like ANTLR or yacc.
-----
As feedback from a user of parsing tools, you might also be interested in my article here:
Someone is asking if there are any parsing tools that generate a "lossless syntax tree".
Also, based on my experience with ANTLR v3 vs. v4, I ask the question why use a concrete syntax tree at all? Nobody answered that question in the comments. I don't understand why that is a good representation, other than the fact that you might not want to clutter your grammar with semantic actions ("pure declarative syntax").
To me the parse tree / CST seems to be resource-heavy while containing unnecessary information, and also lacking some crucial information like where there's whitespace and comments.
To summarize my article, I'm researching code representations in the wild for both style-preserving source translation (like go fix, lib2to3 in Python) and auto-formatting (like go fmt).
It's definitely possible I misunderstood something since my experience was relatively limited, but I have read a lot of the docs and bought the books.
My claim is that neither the parse tree/CST or AST is good for applications like translators. Instead I defined another term Lossless Syntax Tree, which is the data structure I want:
Examples: Clang, Microsoft's Roslyn platforms, RedBaron/lib2to3 for Python, scalameta, and Go. The defining property of the LST is that it can be round-tripped back to the source. This is called out in this C# design doc, along with some conventions for associating whitespace with syntax tree nodes:
The "canonical" documentation of the Ruby parser is basically a 6000+ line Bison file where most of the linecount is taken up with handling exceptions... It's awful.
That sounds very like much like bash, where parse.y is 6227 lines. Almost of all of it is C code, it's not actually using very much of yacc...
That'd be me... Though with the caveat that my parser (part of this project[1] ) still has fairly substantial gaps, so it will grow, but certainly not to the size of MRI's parse.y.
It's actually worse these days it turns out parse.y for Ruby is 11348 lines [2], though for fairness it's worth noting that it's written in a quite "linefeed heavy" style, and this includes providing a Ruby API to the parser.
Yikes! 11K lines is a lot. I guess my conclusion is the same as last time: human brains are good at parsing very elaborate languages. People like all that subtle "expressiveness".
I think my bash parser is pretty complete, and it's about 4K lines in Python now, including the lexer. Changing it to the lossless syntax tree vs AST blew it up by a few hundred lines.
However, it does more than the 6K lines in bash's parse.y, because it parses in a single pass. Bash does more parsing in the 9700 line file subst.c, so it's hard to count.
But yeah I'm about to write my Oil parser now, as opposed to the OSH parser... I really want to avoid writing another 4K lines of code!!! That might be the most practical way though. :-(
> What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).
Having worked in the industry for awhile, there are a lot of handwritten parsers out there, especially when you need to build an IDE with code completion. You are right that for small tool-based languages that don't need those things, a grammar-generated parser is completely sufficient, but if you go all in on a more general programming language, you are likely to wind up with a hand written parser eventually.
Your hostile tone means you misunderstood the intent of my message. I wasn't dismissing the usefulness of ANTLR. I was adding extra information on where ANTLR fits in the bigger universe of parsing. This is extra information that ANTLR tutorials never seem to mention.
For example, here's a blurb from ANTLR creator Terence Parr[1]: "Programmers run into parsing problems all the time. Whether it’s a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, _a PostScript/PDF file_, or a simple spreadsheet macro language"
No... using ANTLR is the wrong tool for writing a parser of "PDF files". Parsing real-world PDF files is very similar to
parsing wild HTML. PDF files are often broken and noncompliant. Is that the fault of ANTLR?!? Not really, but consumers don't care.
With all the dozens of PDF helper libraries out there, this is probably the #1 bug report transcript:
- programmer/customer: "Your PDF library can't open RealWorldBroken.pdf -- please fix it"
- vendor customer support: "We looked at the pdf file you're trying to open and it's invalid"
- programmer/customer: "Yes but Adobe Acrobat Reader opens it just fine."
- <Vendor of library then adds code workaround to open that invalid file and make customer happy and thereby reinvents Adobe Acrobat's permissive parser -- one bug report at a time>
That last step inside the <angle brackets> is the kind of "industrial grade" parsing that ANTLR grammar files are not good at expressing. By mentioning PDF files, it's misleading people about what ANTLR is really good for. ANTLR is a better lex/yacc. It's not better than a hand-written parser for things like PDF files!
If you have a well-defined parsing problem in the classic "Dragon" book[2] sense and want to skip the tedium of hand writing a parser... yes ANTLR is wonderful for that.
>You really can't out perform a parser generator unless you live in an ivory tower"
You can easily outperform ANTLR if your parsing domain is "messy" and programmers (users of compilers) are looking for more than a binary result of accept/reject of a source file with "invalid token" on line 5 of a 1000 line program. They often want insightful error messages from the compiler that "guesses" the programmer's intentions. For example, the C++ parsing error messages in Clang that's superior to GCC cannot be expressed in a ANTLR grammar file.
I guess ANTLR should have some support for "outside of the happy path parsing": As far as I know it's also used as the parsing part of the XText framework, which allows to create custom languages and which will also create eclipse plugins for editing these. These plugins also need to be error tolerant for providing intellisense support during editing, and at least from a user point of view that seems to work.
> 2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.
ANTLR does have some generic error recovery capabilities such as single token insertion/deletion or re-sync recovery.
These kinds of generic heuristics is not something that would be simple to implement in most hand build parsers as they require full knowledge of the entire grammar and that information is often "hard coded" as source code not as a separate data structure.
ANTLR has one of the best error handling strategies of the tools I've used; what do you use that is better in this regard (maybe I need to expand my horizon a bit)?
Most people use hand written parsers for this, for the lack of tools that are good enough. Part of the reason for the proliferation of of parser generation tools is that people tend to get disillusioned by the available tools in this space very quickly.
I tried some of the earlier versions of the JS target (v4). Like you say: the happy path works well. What I disliked is that, in unpredictable cases, there were nulls in the generated output. This made me feel doubtful about using it, so I wrote my own parser by hand. It turned out to be a good decision: it was fun, and I learned a lot.
I switched over to ANTLR 4. It is strictly superior to ANTLR 3. The listener approach rather than embedding code in the grammar is very natural. Separating leads to clean grammars and clean action code. Odd thing is that I was stuck on 3 because 4 didn't support C yet and then I just switched to the Java target in an anti-C pique. Shoulda done that awhile ago.
TParr's The Definitive ANTLR 4 Reference is quite good. And so's this mega tutorial.
I think everyone should manually implement a simple recursive descent parser at least once in their careers. It's surprisingly easy, and really (in my experience) helps to break through the mental barrier of parsers being magical black boxes.
Plus, once you have an understanding of recursive descent parsing, it's a relatively small leap to recursive descent code generation. And once you're there, you have a pretty good high-level understanding of the entire compilation pipeline (minus optimization).
Then all of a sudden, compilers are a whole lot less impenetrable.
On that note, Parr also has a nice little book called "Language Implementation Patterns" [0] which introduces things on a level of "this kind of language motif maps to this kind of parsing code".
Classic regular expressions are level 3 in the Chomsky classification, but most regex implementations have features that aren't a part of "regular languages".
The upside is that they are more expressive, but the downside is that it's no longer O(n) to parse a string.
Worst case for some regexes can be exponential, never run untrusted regexes on your server unless you use a safe implementation like re2 which was developed for Google Code Search.
What is your answer to the first question? In Kleene's original paper (which I've only skimmed), he is interested in "events" for nerve computation models, and defines "regular events" to be some minimal set of events which is closed under concatenation, alternation, and star. He admits "regular" should be replaced by a better word --- this never happened. http://cs.stackexchange.com/questions/1771/why-is-a-regular-...
If I were to make up history, I would say "oh, regular refers to finite state machines, so a regular language is a language recognizable by a finite state machine," but it appears my made-up history does not exactly match reality. This is not too far off, though, since in the paper he shows that regular events are exactly the kinds of events his nerve networks recognize, and there appears to be something about finite automata in the last section.
They could have been called "rational languages" instead, because of a connection with rational functions. In fact, there are manipulations of rational series in Kleene's paper to prove some things about the languages.
(I don't get the joke myself --- unless the joke is that you can't take the "regular" out of "regular language.")
I share your understanding, and yeah, the joke is probably that the word "regular" sounds "easy" or "normal" (shit another loaded term). I was just curious if I was missing more
You can use a generated parser for C/C++, I believe. You just need to insert a little shim between the lexer and parser to look up identifiers in the symbol table and mark which name tokens refer to types versus functions or variables.
You are (essentially) correct. The language itself is context-free, and therefore parse-able as a regular language. Looking up identifiers in symbol tables is part of the type checking phase of compilation. This is a pretty common misconception, that type constraints effect the parsing complexity of a language. Language type only relates to the process of building an abstract syntax tree from an input, not validating it.
You mean "context-free" not "regular" here, but either way, I believe that C is actually context-sensitive:
(a)(b);
Is this:
1. Evaluating "b" and casting it to type "a"?
2. Evaluating the parenthesized expression "a" which yields a function, and then calling it, passing in "b"?
The only way to distinguish the two is by knowing whether "a" is a type or not.
You could argue that as long as the parser produces a suitably vague AST, it doesn't need to know the distinction and can pass it on to later phases. I think that might work because I don't know of cases where the syntax actually diverges based on the type of a symbol.
But in practice, I think most parsers want to produce a more precise AST that distinguishes cast expressions from function calls.
You are right, the regular/context-free comment was a typo.
You are also correct about c being non-context-free[0]. I have just seen many explanations of language hierarchy online mistaking type-checking for language complexity that it is a pet peeve of mine.
I used ANTLR to write a fuzz testing tool which parses an ABNF grammar (like in an IETF RFC) and then generates random output which matches the grammar. Worked great!
For .Net projects, I've used Irony. From the CodePlex site:
"Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. "
I used Irony in a very succesful project too. It worked beautifully and got it's job done. The downside is that there is less documentation available then for Antlr.
I used Antlr v3 to create a NLP parser for calendar events for iOS and Android. It took longer than expected. iOS + C runtime was opaque, so had to write a tool for debugging. Android + Java runtime overran memory, so had to break into separate grammars. Of course, NLP is not a natural fit. Don't know what problems are fixed by v4.
> The most obvious is the lack of recursion: you can’t find a (regular) expression inside another one ...
PCRE has some recursion. Here is an example for parsing anything between { }, with counting of inner brackets:
'(?>\{(?:[^{}]|(?R))\})|\w+'
A C++11 constexpr can make hand coded parsers a lot more readible, allowing token names in case statements. For example , search on "str2int" in the following island parser: https://github.com/musesum/par
These types of tutorials always start out explaining the problems with Regular Expressions and why not to use them... then immediately proceeding into lexing via regular expressions.
Perhaps the tutorials should start with the strengths of regular expressions, and how we can harness that for getting started with a lexer.
This tutorial looks great. I picked up Antlr4 a few months ago, and hadn't done any parsing before then. The first week was basically me, The Definitive Antlr4 Reference, and extreme confusion with how different targets worked. Compounding the problem was the fact that a lot of the antlr4 example grammars only work for a specific target. The use of different language implementations as part of this tutorial seems really useful!
I used ANTLR to write a Python parser for the SNOMED expression language last year, and testing it was one of the weirder parts of the experience. I was up and running in a few days, which was largely thanks to the ANTLR book. I love this project. It made doing what I did a lot more fun than I thought it would be. Hand-rolling an ABNF parser from scratch would be a nice hobby project, but not when one has a deadline.
1) ANTLR is a good tool for generating "happy path" parsers. With a grammar specification, it easily generates a parser that accepts or rejects a piece of source code. However, it's not easy to use the hooks to generate high quality diagnostic error messages.
2) ANTLR was not good for speculative parsing or probabilistic parsing which would be the basis of today's generation of tools such as "Intellisense" not giving up on parsing when there's an unclosed brace or missing variable declaration.
The common theme to the 2 bullet points above is that a high quality compiler written by hand will hold multiple "states" of information and an ANTLR grammar file doesn't really have an obvious way to express that knowledge. A pathological example would be the numerous "broken HTML" pages being successfully parsed by browsers. It would be very hard to replicate how Chrome/Firefox/Safari/IE doesn't choke on broken HTML by using ANTLR to generate an HTML parser.
In short, ANTLR is great for prototyping a parser but any industrial-grade parser released into the wild with programmers' expectations of helpful error messages would require a hand-written parser.
Lastly, the lexing (creating the tokens) and parsing (creating the AST) is a very tiny percentage of the total development of a quality compiler. Therefore, ANTLR doesn't save as much time as one might think.
I welcome any comments about v4 that makes those findings obsolete.