Parser generators vs. handwritten parsers: surveying major languages in 2021

reikonomusha · on Aug 21, 2021

For serious projects, I find myself typically resorting to making a hand-written parser (usually recursive descent) because it’s the only way I can really get excellent and contextual error reporting and recovery. Parser generators are computer science marvels, but for the most part they’re typically too hard to customize well enough to produce a sensible error to the user, with line numbers, token stream provenance, and suggested recovery information. I can’t imagine the difficulties of trying to get good messages for something as large and pervasive as, say, C.

I think that with enough work, we can adapt and evolve parser generators to mesh well with better error reporting, and give the programmer more options and control, but it’ll take some elbow grease and probably breaking backwards compatibility with the production syntax of these parser generator systems. There’s also a risk, of course, that if there’s too much customization, you sort of lose the value of having a parser generator DSL in the first place.

jasonwatkinspdx · on Aug 21, 2021

One of my favorite PLT related books is Programming Language Pragmatics, which spends a non trivial section near the beginning on just this issue. Among other things it's very useful to continue a best effort parse on the remainder after the parse error so you can show more syntax errors vs merely the first. But this is something of a nebulous art to accomplish.

I agree that I think generators could address this issue, they simply haven't.

Also I'm particularly fond of PEGs, because they match the intuitiveness of recursive descent parsers with a generator suitable formalism (though they have their rough edges as well).

LadyCailin · on Aug 22, 2021

In my programming language MethodScript, (which has a handwritten parser), I fail immediately during lexing, (because then that was a really bad error) but during compilation, I only immediately fail on some things that are likely to cause a cascade of errors, such as a missing bracket or something.

But yeah, it makes the compiler output way easier to work through when you have everything at once.

eddieh · on Aug 21, 2021

I'm always shocked to find major projects that use parser generators. A hand-written recursive descent parser is easier to write, read, understand, and modify. It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.

So looking at the list I'm dumbfounded that CPython and Ruby use generators.

jasonwatkinspdx · on Aug 21, 2021

You won't find me defending the use of yacc, but in general hand written parsers are far more error prone.

dbcurtis · on Aug 22, 2021

My strategy when faced with doing a DSL of some kind is to write an accepter in bison or similar and knock the bugs out of the grammar and basic readability issues. Then go recursive descent with a table-driven precedence-climbing parser for expressions when ready for the real deal.

kqr · on Aug 22, 2021

This is, I think, a very good strategy for many reasons. Not only is it a cheap way to work out tthe design of the language and a way to verify the correctness of your implementation. It's also a good fallback solution: if your (more expensive, expressive) implementation is not ready in time, you can fall back to integrating the generated implementation to meet an important deadline or whatnot.

anarazel · on Aug 22, 2021

It's harder to ensure that the grammar doesn't have unintended ambiguities when using a parser generator. That doesn't matter as much when "just" developing another implementation of an existing fully specced language, but for cases like the examples you cite that doesn't exist.

That issue is the primary reason why postgres continues to use bison. The SQL standard introduced potential parsing ambiguities frequently and leaves a large portion of the things necessary for a functioning RDBMS unspecced.

dataflow · on Aug 21, 2021

> It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.

It's also potentially quite slow.

thomasmg · on Aug 22, 2021

In my experience (with both hand-written and generators, eg Antler), that's not the case.

Hand-written recursive descent are simple to write, debug and maintain and extend, while with generators I always had to fight the tool at some point, and that always ended in frustration.

dataflow · on Aug 22, 2021

We've just had different experiences then. It's funny you mention ANTLR because it was in particular very slow last time I tried it (just a couple years ago).

As for fighting the tool and simplicity, sure, that's beside my point.

dboreham · on Aug 22, 2021

Some grammars can't be parsed recursive descent. At least if I remember parsing class from decades ago they can't.

samus · on Aug 22, 2021

There are techniques that help parsers parse a far larger class of grammars than their underlying model permits. Some of these are:

- turning it into a state machine (required to deal with nested comments and C++ template syntax)

- building the symbol table while parsing and querying for disambiguation

- treating expressions as a stream of atoms and using a specialised precedence parser (probably required to deal with operator overloading)

Parser generators usually employ these as well.

deterministic · on Aug 23, 2021

Not true in practice. C++ is one of the hardest practical languages to parse. And it can be parsed using a recursive descent scanner less parser.

billconan · on Aug 21, 2021

I really hope to know why people choose one parsing algorithm vs another.

I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.

"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."

however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.

sirwhinesalot · on Aug 21, 2021

Early and similar algorithms give you a parse forest rather than a parse tree. For a programming language you don't want to have your grammar be ambiguous. When Early gives you a bunch of alternative parse trees, how do you disambiguate?

pfdietz · on Aug 21, 2021

If you are parsing incorrect programs, you want resilience in the parser, including considering alternate parsings and generation of parses of sections of the file.

Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems.

rightbyte · on Aug 21, 2021

It is quite possible to redifine much of the syntax with the cpp. Parsing without the cpp is hopeless.

My favourite is Bourne's longing for Algol: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd...

pfdietz · on Aug 21, 2021

Of course parsing without cpp is, in general, hopeless. But for something like a structure-aware editor you don't need to parse it perfectly, just "good enough". Ditto for a source-to-source transformation system.

chrisseaton · on Aug 21, 2021

If your grammar is ambiguous, then an ambiguous program isn't incorrect.

pfdietz · on Aug 21, 2021

It's not that the grammar is ambiguous, it's that the thing you're parsing may not even be in the language generated by the grammar. But you want to parse it as well as possible, and there may be more than one way to do that.

jatone · on Aug 22, 2021

wow, absolutely not. if i'm parsing a programming language I want to basically stop and bail from the current context as soon as possible. there is basically nothing to gain by continuing.

if the shit i'm parsing isn't in the language there is no reason to continue. its invalid.

greiskul · on Aug 22, 2021

If you are writing a code editor you might want to continue parsing it. Or maybe you detected an error, but you want to report more than one error at a time for a better programmer experience. There are lots of use cases for partially parsing a computer program.

chrisseaton · on Aug 22, 2021

> there is basically nothing to gain by continuing.

You gain error messages for subsequent errors, so you can fix them all before recompiling.

pfdietz · on Aug 22, 2021

I had already explained why one might want to do this:

"Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems."

One cannot in general parse un-cpped C/C++. But that doesn't mean it's useless to do as well as one can, even if it cannot be parsed exactly. Parsing is for more than just compilers.

billconan · on Aug 21, 2021

in my implementation, I give each rule a priority. I always choose the one with the highest priority to expand. I basically followed this https://loup-vaillant.fr/tutorials/earley-parsing/parser

xyzzy_plugh · on Aug 21, 2021

I've written parsers for context-sensitive grammars, and as it turns out, this is not a desirable feature of a language. The reason you typically see simple parsers is because simple grammars are usually unambiguous and context-free, which is more convenient for humans to understand and internalize.

After all, when you're writing software, you pretend to be the compiler to some extent.

Natural languages, like English, are a good example of something which humans struggle with because they are complex, ambiguous, and often require context. Sure, it's extremely expressive, but that is the sharpest double-edged sword in programming languages.

sirwhinesalot · on Aug 21, 2021

Neat, though this just seems like PEG with extra steps and more pain. Still this answers my question. Thanks!

armchairhacker · on Aug 21, 2021

Most languages have a simple, unambiguous syntax, so LL or LR is fine. LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.

In general most people think advanced = bad, they want the easiest solution which gets the job done well.

kmill · on Aug 22, 2021

This is probably known somewhere, but something I noticed is that a way to think about Earley parsing is that, when you create an LR(1) shift-reduce parser, rather than having it be an error in the grammar for there to be shift-reduce or reduce-reduce conflicts, instead you nondeterministically reduce (i.e., your state machine has a set of states rather than a requiring there to be a single state). This also ends up solving the left recursion problem in an efficient way.

This is certainly slower than a deterministic LR(1) parser, but maybe if you're clever about the implementation you can efficiently handle local ambiguities in a grammar. Large-scale ambiguity is bad since you don't want small changes to have action-at-a-distance, but you could in principle have a strict LR(1) grammar that contains Earley sub-grammars for expressions.

UncleEntity · on Aug 21, 2021

> LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

Earley's algorithm should be able to be trivially parallelized which might make it a contender for languages which are ambiguous (like most real world languages) where they are hand-rolling parsers. I haven't tried since I have no real need but looking at the steps I can't see a reason it couldn't do its work across multiple threads.

Honestly, other than JavaScript 1.1 I can't think of a popular language which has an unambiguous syntax and I really like playing with language grammars for some odd reason -- probably wrong though...

eesmith · on Aug 21, 2021

From what I infer from articles like https://jeffreykegler.github.io/personal/timeline_v3 , the original Earley paper had a bug, and wasn't a good fit for 1960s hardware, and had poor performance for some types of grammars.

By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."

It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).

(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)

UncleEntity · on Aug 21, 2021

> I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.

That one is really the only Earley parser I've found used in the wild (don't know what marpa is used for) and unfortunately it is mostly unhackable because they did some serious optimization voodoo on it so it was replaced by a hand-written recursive decent parser a while back because nobody in the world could figure how it works[0] -- which is kind of strange since ASDL is super simple to parse and the generator which used spark was meant to check files into source control but, whatever.

Its easy to play around with but not a great source if you want to see how an Earley parser is put together. There are also some bugs with parser action on duplicate rules not working properly that were pretty easy to fix but python pulled it out of the source tree so no upstream to send patches to?

[0] might be making that part up, dunno?

eesmith · on Aug 21, 2021

You are one of the "couple of people" I was referring to. :)

I know SPARK's docstring use influenced PLY.

PLY doesn't use Earley, but "Earley" does come up in the show notes of an interview with Beazley, PLY's author, at https://www.pythonpodcast.com/episode-95-parsing-and-parsers... . No transcript, and I'm not going to listen to it just to figure out the context.

https://github.com/lark-parser/lark "implements both Earley(SPPF) and LALR(1)".

Kegler, the author of that timeline I linked to, is the author of Marpa. Home page is http://savage.net.au/Marpa.html . The most recent HN comments about it are from a year ago, at https://news.ycombinator.com/item?id=24321395 .

jnwatson · on Aug 22, 2021

Lark [1] is a battle-tested pure-Python Earley parser. My company uses it in production. It is by far the easiest parser generator I've ever used.

1. https://github.com/lark-parser/lark

sirwhinesalot · on Aug 22, 2021

Lark is amazing... But it's also one of the best LR parsers out there and I would guess that mode is used a lot more than the Earley mode.

Either way, I have never used a better parser generator. It has the best usability and incredible performance when you consider it is written in pure Python.

DylanSp · on Aug 21, 2021

nearley.js is an Earley parser that sees at least some use. https://nearley.js.org/

yxhuvud · on Aug 22, 2021

Without certain optimizations it will be too slow. You really need to optimize right recursion to not be quadratic, for example.

hyperpallium2 · on Aug 22, 2021

jq uses bison, and it takes a startling amount of work to diagnose parsing problems so as to give helpful error messages. Last time I checked, the issues were full of new users stopped by something an appropriate error messsge would resolve.

recursivedoubts · on Aug 21, 2021

I took the compilers class at Stanford and never really understood the algorithms of bottom up parsing, or even really how grammars worked. I just made the tool work.

I then went to work at a private company, and an older guy who had gone to a state school that taught recursive descent (his was the last class to teach it) taught me how to do it. In a month or so I had learned more about how grammars actually work, what ambiguity is, and so forth, than in my whole class at Stanford.

I now teach compilers at a university, and I teach recursive descent.

userbinator · on Aug 22, 2021

I recommend you also teach "precedence climbing", which is basically a slight refactoring of RD to parameterise the recursive functions by level so that grammars with many levels don't require proportionally deep call stacks and it also massively reduces the duplication of code at each level:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing

dataflow · on Aug 21, 2021

I still recommend spending the time to (re-)learn bottom-up parsing though, even if you never use it again. Specifically, LR and then GLR (no need to bother with LALR/SLR/all those variants). Maybe even recursive ascent too, or recursive ascent-descent if you find those interesting. It's painful and confusing and takes a long time to pick them up, but it's quite gratifying once you do learn it. The algorithms are super clever and cool, and help you gain a lot of appreciation and intuition for what is/isn't possible to do (whether efficiently or inefficiently).

Using recursive descent for everything is like using Python (or Go or pick your favorite easy language to learn) for everything. Yeah you can totally do it, and a heck of a lot of people do precisely because of how nice and convenient it is, but it's good to have a sense what else is out there and how you can think about a problem differently IMO, even if you rarely use other languages in practice.

recursivedoubts · on Aug 21, 2021

i agree, but for an undergraduate understanding of parsing and grammars, recursive descent is perfect

i think the more complex parsing algorithms were good for producing papers, which explains their success in academia but relative lack of success in industry

samus · on Aug 22, 2021

Recursive parsing is a completely valid concept, but it's unnatural at first to hand-write parsers with it. I have tried it - even wrote a prototype DSL in Haskell (the CPS monad is very useful for that), but what I end up doing is using the DSL to write down the LR state machine, bottom up. This is eerily familiar to embedded and GUI programming where internal state matters and the program reacts to input. It might even be a better way to incorporate ad-hoc code that has side effects. However, most people think about programming languages top-down and use grammars to describe them.

recursivedoubts · on Aug 21, 2021

as an aside, the parser for https://hyperscript.org is recursive descent, and our motto is: "remember what our parser professors taught us, then do the opposite of that."

There's some ugly stuff in there, but it's fun, all because we wrote it in recursive descent and can do whatever we darn well please.

Taniwha · on Aug 22, 2021

I think the best way to understand it is that you're creating an AST (a tree describing the program), for a bottom up parser you are recognising and making little treelets and stitching them together into larger treelets until you reach the top/root, while for recursive descent you're building them from the top down from the root.

The main difference is that LR (bottom up) family grammars can describe more languages than LL (top down) ones can largely because bottom up has slightly more contextual information available when it puts stuff together (often small treelets rather than just the next token)

exdsq · on Aug 21, 2021

I took a compilers class at an ex-Polytechnic in the UK that taught recursive descent - I now wonder how old the textbook they were using was!

deterministic · on Aug 23, 2021

Yep that’s the way to do it. Teaching LL/LR and it’s variations is a complete waste of time.

Koshkin · on Aug 21, 2021

I hope you are not making the same mistakes as your teacher at Stanford did (and leave students clueless).

recursivedoubts · on Aug 21, 2021

i hope so too

atarian · on Aug 21, 2021

Any resources you recommend for someone who is interested in learning about recursive descent?

recursivedoubts · on Aug 21, 2021

Yes, the second part of crafting interpreters has a great introduction to the topic and is available online for free:

https://craftinginterpreters.com/contents.html

atarian · on Aug 21, 2021

Thanks!

mhh__ · on Aug 22, 2021

Get a lexer (like, just steal one if you don't think you'll be able to write one without getting bored, they're usually fairly self contained) and just starting writin' code. They're basically that simple.

The lovely thing about a nicely written recursive descent parser is that the code pretty much is the grammar (topologically), so any grammar you can actually imagine that can be parsed by recursive descent doesn't take much thought to make work.

twic · on Aug 22, 2021

IMVLE, a lexer is just a big regular expression, with a branch at the top level for each type of token. Are there cases where this doesn't work?

samus · on Aug 22, 2021

Python and Haskell are interesting because the indentation is crucial to figure out the correct nesting. Haskell allows to use curly braces and semicolons to completely remove indentation, so the lexer only has to add virtual tokens into the token stream. For Python, I don't know how it deals with it.

C/C++ have plenty of difficult syntax edge cases. For example, >> can represent a single operator in an expression or two successive > in a template declaration.

twic · on Aug 22, 2021

Python has some preprocssing to inject INDENT and DEDENT pseudo-tokens into the lexing stream:

https://docs.python.org/3/reference/lexical_analysis.html#in...

rhdunn · on Aug 22, 2021

Some things require (or are more easily implemented) using state-based lexing, such as identifying escaped characters in JSON strings or XML attributes, or documentation comments. These can be thought of as a Finite State Machine (for the state) driving the different sub-lexer regex token matching.

tiborsaas · on Aug 21, 2021

I know it's not Reddit but username checks out :)

jillesvangurp · on Aug 22, 2021

Ultimately, the reason many languages use manually written parsers is because the code generated by parser generators just isn't worth the time savings not manually writing a parser instead. The underlying assumption is that that is somehow super hard, error prone, and time consuming. I think that assumption is simply wrong.

I had my compiler course at the University of Utrecht in the mid nineties. The local faculty there had some functional programming pioneers and did a lot of work on parse generators. So, I learned how to write a parser generator in Gopher (a Haskell predecessor). Very nice course and definitely a bit of a mind fuck since I also had to wrap my head around monads at the same time. And of course not really that useful for real world stuff and I've since forgotten most of what I learned as I have not used it in over 25 years.

A few years ago, I had to sit down and write a parser for a little query language that we came up with. I spend some time looking at all sorts of convoluted tools for this and refreshing some decades old knowledge on grammars to actually try to understand what the tools were trying to do. I then found a useful example of a lexer that simply used regular expressions and suddenly realized I was wasting time and that this stuff was stupidly simple. I then knocked out this little project out in a few hours using that. Once you have a lexer, things get easy and they are not that hard to write. I think this is why people end up doing these things manually. It's just not that hard or worth automating and it's rare enough that it's not worth investing time in learning new tools.

Also, modern languages are a lot better at this stuff too. String handling in C was always a bit tedious. But modern languages come with so many built in stuff for that that it's not really a burden. So, in short, parser generators solve a problem that's not really that big of a problem anymore.

Zababa · on Aug 21, 2021

> Developers often think parser generators are the sole legit way to build programming language frontends, possibly because compiler courses in university teach lex/yacc variants.

They're far from the only way, handwritten parsers are often mentionned as having way better error recovery. But if you're making a new configuration language, programming language, or something like that, ensuring that it has LR-compliant grammar and that this grammar is the source of truth can avoid a lot of pain later down the road.

not2b · on Aug 22, 2021

The C++ grammar is not LALR(n) for any n. It requires unbounded lookahead. This means that it does not fit the yaccc/bison model, and back when bison was used, this was dealt with by adding a third pass, called spew, between the lexer and the parser to look far enough ahead to tell types from variables from functions. It was a nasty hack which made it very difficult to generate good error messages. But the language is a much better fit for a good hand written recursive descent parser.

Taniwha · on Aug 22, 2021

I built an Algol68 compiler way back when - it was a mix - used recursive descent to parse the bracketing structure and type definitions and an LALR (homegrown generator) for the second pass it fit well provided you allowed the parser generator to resolve on of the shift/reduce conflicts on the fly at runtime (to handle the language's ability to change the priority of operators)

mseepgood · on Aug 21, 2021

I think more programmers should learn how to write parsers by hand. You need them everywhere; for protocols, file formats, user input, little languages. They often use regular expressions or other crutches when a parser would be the better choice.

SOLAR_FIELDS · on Aug 22, 2021

Sometimes when the problem space is simple enough you can get the best of both worlds by using regex to perform your tokenization when lexing and then passing the tokens to a traditional stack based parser. It’s possible for this approach to be bad performance wise but it depends on the use case.

The lexer that yacc implements is indeed regex based and as I also recently discovered was originally written by a little known developer by the name of Eric Schmidt [1]

1: https://en.m.wikipedia.org/wiki/Lex_(software)

suncherta · on Aug 21, 2021

What would be good resources for this?

eatonphil · on Aug 21, 2021

Start with a simple language like parsing JSON or S-expressions and do it in a language you already know. There are a ton of tutorials you can find online along these lines!

srcreigh · on Aug 21, 2021

Course notes for University of Waterloo CS241 "Foundations of sequential programs"

Part of the course is understanding exactly when Regexps are good enough and when you need something more powerful.

http://anthony-zhang.me/University-Notes/CS241/CS241.html

myk9001 · on Aug 22, 2021

As @rofrol already said, definitely Crafting Interpreters.

Just would like to add a link: https://craftinginterpreters.com/contents.html

It explains parsing and other topics in a really clear and accessible way.

rofrol · on Aug 21, 2021

Crafting interpreters?

PartiallyTyped · on Aug 21, 2021

Do you know Haskell? If not, I suggest you get accustomed to the language, and then read about monadic parsing [1] through Graham Hutton's work. Graham is a famous CS professor at U Notthingham, appears often in ComputerPhile [3,4], and wrote a book on Haskell [2].

I had to write an interpreter, optimizer and engine for a declarative language plus bottom up knowledge base in Haskell as part of an assignment, and an exam in a graduate course on advanced programming. Haskell made the problem significantly easier compared to languages I am much more comfortable with, like Python or C.

[1] www.cs.nott.ac.uk/~pszgmh/pearl.pdf

[2] https://www.amazon.com/Programming-Haskell-Graham-Hutton/dp/...

[3] https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA

[4] https://www.youtube.com/watch?v=eis11j_iGMs

PartiallyTyped · on Aug 22, 2021

Okay, this is really perplexing, this reply was at +5 and is now at -1. I don't understand why and I'd appreciate any explanation.

adamgordonbell · on Aug 21, 2021

The Ruby yacc file is scary to look at. 13+ thousand lines in a single file.

Would it be better with hand rolled and they could have abstracted and organized somethings or does it all make sense in its current format if you are familiar with it?

https://github.com/ruby/ruby/blob/v3_0_2/parse.y

chrisseaton · on Aug 21, 2021

I have a PhD in compiling Ruby and I struggle to read and modify that file.

adamgordonbell · on Aug 21, 2021

Ruby is very flexible. Is that the issue?

I suspect not, because Scala is super flexible and probably more difficult to parse, but has a better organized parser.

chrisseaton · on Aug 21, 2021

The language is just quite large and quite complex to parse - context-sensitive and requires interaction back and forth with the lexer as it parses. TreeSitter seems to have a much simpler Ruby parser - but I believe it ignores the context-sensitive parts of Ruby.

jasonwatkinspdx · on Aug 21, 2021

Ruby's syntactic flexibility is definitely a big factor in it.

ckok · on Aug 21, 2021

Parser generators generally suffer from at least speed, recovery, error handling and lack of debuggability. They also require you to learn a new language, the parser generators language.

Last but not least, what you actually end up wanting to build is an ast, as at some point you want to do something with the input, for most parsers you then have to implement even more code to build up the ast.

It is much easier to hand write it. In the end its faster to write and usually faster to run.

Every few years I evaluate the new options for the languages I use (c#, pascal), every time so far I am disappointed with the tooling. Maybe one year.

Zababa · on Aug 21, 2021

I think you have to see them a bit like regexes. It may be a bit annoying to learn the grammar at first, but it's cross-language learning in a way, as you can use parser generators libraries in any language.

chrisseaton · on Aug 21, 2021

But it's still one extra language to learn, for in practice no benefit.

Zababa · on Aug 21, 2021

Copy/pasting the grammar of a language and using a library will be faster to do than any hand-written implementation, so I don't agree that there are no benefits.

chrisseaton · on Aug 21, 2021

But you can't do that because in practice these grammars have lots of imperative action code inserted all over the place because the parser generator's model fundamentally doesn't match the language class you're parsing and it needs to be worked around.

For example I work with a parser definition that's supposedly shared between C and Java, but it's a massive nightmare because it's 50% imperative actions.

Zababa · on Aug 21, 2021

That depends on the language you're trying to parse. If it has a LL or LR grammar, it'll work well. If it's something else that can't be described with something like that or something less powerful, you're going to have a bad time.

chrisseaton · on Aug 21, 2021

> That depends on the language you're trying to parse.

Yes... but most practical languages do not have simple LL or LR grammars. Hence this blog post and discussion and these problems.

Zababa · on Aug 21, 2021

That's not true. You can have a LL or LR grammar, and use a hand-written parser to get better speed, better error reporting, any number of reasons really. Java is LL(1), but javac seem to use a hand-written recursive descent parser. Now that I think about it, Java being omitted is weird.

UncleEntity · on Aug 22, 2021

> Java is LL(1)...

Java hasn't ever been able to be parsed with one token look ahead, the earlier versions had a section on how to modify the grammar to be LALR(1)[0] but it was dropped in later versions of the specification -- probably due to added features which made it unfeasible like generic classes.

It's actually quite a good resource since they explain the reasons why the parser needs to have more than one token to figure out what's going on.

[0] http://titanium.cs.berkeley.edu/doc/java-langspec-1.0/19.doc...

chrisseaton · on Aug 21, 2021

What’s not true?

I said ‘most’ not ‘all’, and nobody here said the only reason to hand-write a parser was context sensitivity.

wirthjason · on Aug 22, 2021

This will probably get a lot of downvotes but in the way that “paint” is a solved problem and “art” is unsolved, I think parsing as a solved problem. Good languages and DSLs are an unsolved problem.

veltas · on Aug 21, 2021

> Clang. Not only handwritten but the same file handles parsing C, Objective-C and C++.

And it's under 3000 lines!

https://github.com/llvm/llvm-project/blob/llvmorg-12.0.1/cla...

NilsIRL · on Aug 21, 2021

It seems to me that the parsing code in clang is distributed over multiple files which together are way more than 3000 lines: https://github.com/llvm/llvm-project/tree/llvmorg-12.0.1/cla...

thewakalix · on Aug 22, 2021

There's also parser combinators, which (for example) are used in GHC's parser.

eatonphil · on Aug 22, 2021

When I lookup "ghc parser" I get this page [0] that says the GHC parser is built from Happy [1], a Yacc-like parser generator. Grammar is here [2].

Could you link me to where you see GHC's parser using parser combinators?

[0] https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compil...

[1] https://www.haskell.org/happy/

[2] https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GH...

thewakalix · on Aug 22, 2021

I was scanning the GHC source by eye, and misunderstood a comment about the "parsing monad". (Alex, the equivalent of lex, uses monads to describe actions, but I thought that monads were the main interface to the lexer.) Thanks for the correction.

valenterry · on Aug 22, 2021

Agreed! I would say that parser combinators are the sweet spot and the right choice in most cases.

Scala has them as well, e.g.: https://com-lihaoyi.github.io/fastparse/

And the good thing is, you don't have to learn a completely new language/syntax, you can use the host language's syntax and you have full IDE support as well.

cosmic_quanta · on Aug 22, 2021

I find using parser combinators a joy. Makes parsing e.g. LateX somewhat straightforward

aix1 · on Aug 22, 2021

I've not come across parser combinators. Is there good introductory material?

DylanSp · on Aug 22, 2021

It's Haskell-based, but here you go: http://dev.stephendiehl.com/fun/002_parsers.html

CornCobs · on Aug 22, 2021

Agreed. I personally think of parser generator/parser combinator as similar to framework/library, where in the framework you have much less control and often have to rely on "escape hatches" to make custom stuff work. Whereas parser combinators are simply a library making implementing your parser much more convenient and expressive. You call the code, not the framework

codr7 · on Aug 21, 2021

I wasted so much time, first on using other people's frameworks and later on writing my own.

Recursive descent might be a tiny bit more code, but it will never stand in your way.

And you don't have to write absolutely everything by hand; it's perfectly possible to simplify the process somewhat using the full power of the host language to get incremental parsing, look ahead etc:

https://github.com/codr7/swifties/blob/main/Sources/Swifties...

account-5 · on Aug 21, 2021

I'd love to learn about parsers but I need an ELI5 version.

I've written a parser for CSV which I think works very well but I'm self taught and have no idea if it's any good.

Any 'parsers for dummies' guides?

mhh__ · on Aug 22, 2021

Crafting Interpreters, then graduate onto "engineering a compiler".

The Dragon book is the classic reference on the subject, but I don't think it's very good pedagogically, and most printings I have seen recently have been awful.

account-5 · on Aug 22, 2021

Thanks. I'm looking at crafting interpreters.

conaclos · on Aug 22, 2021

To continue the discussion, I recommend [0].[0] advocates the use of both parser generators and handwritten parsers. Parser generators enable to easily experiment and are so more suited for early development. Once a language is more stable, it could be more relevant to switch to a handwritten parser.

[0] https://drewdevault.com/2021/04/22/Our-self-hosted-parser-de...

mathgladiator · on Aug 21, 2021

When I switched from ANTLR to hand written for Adama ( http://www.adama-lang.org/ ), I felt way better about things. I was able to get sane error messages, and I could better annotate my syntax tree with comments and line/char numbers.

A killer feature for a parser generator would be the ability to auto-generate a pretty printer which requires stuffing comments into the tree as a "meta token".

fjfaase · on Aug 21, 2021

I implemented an unparse function in IParse, which is not a parser generator, but a parser that interprets a grammar. See for example https://github.com/FransFaase/IParse/blob/master/software/c_... where symbols starting with a back slash are a kind of white space terminals during the unparse. For example, \inc stands for incrementing the indentation where \dec decrements it. The \s is used to indicate that at given location a space should be included.

jwlato · on Aug 22, 2021

Have you seen BNFC (http://bnfc.digitalgrammars.com/)? I've used it with the Haskell bindings and really appreciated it. Parser, pretty-printer, and AST skeleton all auto-generated. I've only used it with Haskell though, can't speak to the other languages.

deterministic · on Aug 23, 2021

Somebody unfortunately chose to use ANTLR on a project I am now maintaining. ANTLR is really really bad. Run don’t walk away.

kaby76 · on Aug 22, 2021

I have a fundamental question: What happened to the idea of specifying the syntax of a language using a "formal" grammar, however terribly flawed and imperfect that has been--for decades--and still is? It seems that if you need a grammar, e.g., to write a programming language tool, it's considered "old school". If it's published, it's of secondary consideration because it's usually out-of-sync with the implementation (e.g., C#, soon to be version 10, the latest "spec"--draft at that--version 6). Some don't even bother publishing a grammar, e.g., Julia. Why would the authors of Julia not publish a "formal" grammar of any kind, instead just offer a link to the 2500+ lines of Scheme-like code? Are we to now write tools to machine-scrape the grammar from the hand-written code or derive the grammar from a ML model of the compiler?

ChrisRackauckas · on Aug 22, 2021

Who is going to put the time in to maintain it if the language is given by the implementation of the language? In the past, people wrote huge specs because you wanted multiple companies to implement compilers for their various computers. Now there's enough centralization that a single implementation of a language can exist. Why spend the time supporting 6 compilers when 99.99% of users will use the big 1 that the core language developers are working on? Labor is a finite resource, and because this requirement is gone the extra work that came with that requirement is discarded as well. If someone was willing to write up a full spec and maintain it for 10 years then I am sure it would be accepted to any open source language, but nobody decides to do that with their free time for some reason.

saurik · on Aug 22, 2021

One big reason is because it is highly likely they not only don't have one, but one couldn't even exist in any meaningful way, as their primary implementation is some turing-complete monstrosity where the only "formal" spec is their code as there is nothing "formal" about it: the cost of not using parser generators is essentially the same as the cost of using dynamic typing... it not only means you probably have inconsistency bugs all over the place, even if it made it feel easier to code without having to actually prove your program was correct all the time, but it further tends to lead you to relying on the dynamic typing features in places that make the resulting program not something that fits any known formal type model anymore even if it works perfectly.

rdtennent · on Aug 22, 2021

My textbook

https://www.cambridge.org/core/books/specifying-software/FC5...

has a tutorial introduction to recursive descent.

EamonnMR · on Aug 21, 2021

There's something to be said for a tool that a novice can use (my first hand written recursive descent parser was, no joke, because I wasn't up to the task of including a json library in my Java project) but still useful for professionals.

userbinator · on Aug 22, 2021

I think the reason parse generators are enticing and highly regarded, especially in academic settings, is because they tend to attract the same people who love complexity and abstraction for the sake of abstraction. Recursive descent, and even more so handwritten RD, is seen as "lowly" and "unsophisticated", despite all the practical advantages it has.

If you ask developers who have no compiler experience to come up with a solution for e.g. evaluating maths expressions, chances are that they will effectively derive recursive descent in writing a solution, because of its simplicity and ease of understanding.

SirensOfTitan · on Aug 21, 2021

Does anyone recommend any resources for learning how to write handwritten parsers nowadays? I’ve written a couple simple ones for various tasks before, but I’d really love to learn more about it.

junon · on Aug 21, 2021

Learn what lexing is. That's step one. Once you have a token stream, look into how you'd transform them into structures.

PUB FN ID(main) OPAREN CPAREN COLON ID(i32) OBRACE NL RETURN NUMBER(0) CBRACE

Look familiar? How might you parse that into something like

{ kind: "function", name: "main", args: [], public: true, return_type: { kind: "integer_type", width: 32 }, body: [ { kind: "return", expression: { kind: "number_literal", value: 0 } } ] }

A recursive descent parser is just a parser that starts at a root node type and recursively traverses a token stream and calls handler functions based on what's next. Those functions might also call other handlers (e.g. for nested expression parsing, etc.) which is why it's called recursive.

deterministic · on Aug 23, 2021

Actually don’t. Knowing how to write parsers without a lexer (scanner less parsers) is the true superpower.

junon · on Aug 23, 2021

Fully disagree. Care to expand?

deterministic · on Aug 24, 2021

Scanner less parsers are strictly more powerful than parsers using fixed lexers. You can literally change the language you are parsing based on what you have parsed so far. I have code running in production doing exactly that.

junon · on Aug 26, 2021

Generally such languages are so abstract that they lose usefulness in practice. Because something can be done, doesn't mean it should. Writing a simple recursive descent parser with a Lexer's token stream is much easier than doing bytestream ingestion on the fly, especially for a beginner.

deterministic · on Aug 24, 2021

You can even parse a definition of a parser and then continue parsing the rest of the code using that parser. Not something I have needed in production but it shows the power of non-lexing parsers.

deterministic · on Aug 25, 2021

Downvoting without giving a reason why is not useful.

adamgordonbell · on Aug 21, 2021

This book is great: https://interpreterbook.com/

It is sort of a walk-through of a small programming langauge written in go, using a recursive descent parser.

Crafting Interpreters is good as well, though I only read part of it, because it wasn't done at the time I read it. https://craftinginterpreters.com/

I've also interviewed both of them, if you like an audio intro: https://corecursive.com/032-bob-nystrom-on-building-an-inter...

WalterBright · on Aug 22, 2021

Just read the code for an existing one like:

https://github.com/dlang/dmd/blob/master/src/dmd/cparse.d

which is a C parser. It's not hard to follow.

muuglay · on Aug 21, 2021

It seems strange that in an effort to solve a common problem that 80% of the success stories opt to not use the automation. Perhaps, there is ground for another toolkit

mhh__ · on Aug 21, 2021

Work on a compiler for a bit and I guarantee you'll get it. Parser generators pretty much always get in your way

shoulderchipper · on Aug 21, 2021

Building an AST is an easy task once you've figured out your grammar and error recovery mechanism, but these are essentially a kind of language-specific design work that can not be automated.

pharmakom · on Aug 22, 2021

I love the lack of magic with parser combinators but they seem to struggle with left recursion and backtracking. Has anyone created a parser combinator library that is practical for real programming languages?

thaliaarchi · on Aug 21, 2021

Are there any significant languages that do parsing with derivatives?

mhh__ · on Aug 21, 2021

What are derivatives in this context?

olzd · on Aug 21, 2021

Probably Brzozowski’s derivative of regular expressions, e.g https://matt.might.net/papers/might2011derivatives.pdf

drdeca · on Aug 21, 2021

From what I can find, in the case of regexes, the derivative of a regex R with respect to a character c, is the regex \partial_c R , which matches a string S iff R matches c S. So, the derivative of (ab)* with respect to the character 'a' , would be b(ab)* .

I imagine that the same idea applies for more general languages, so, I expect the derivative of a language with respect to a character, is the language of strings such that prepending one by the character results in a string of the original language.

So, if R_1 and R_2 are two different regexes, then the derivative with respect to c of the regex R_1 | R_2 , i.e. \partial_c (R_1 | R_2) , would be (\partial_c R_1) | (\partial_c R_2),

and, if + is used for concatenation of two regexes, the derivative of R_1 + R_2 would be, ((\partial_c R_1) + R_2) | ((if R_1 accepts the empty string, then the language which accepts only the empty string, otherwise, the language that does not accept any string) + (\partial_c R_2))

etc.

... I think.

eigenhombre · on Aug 21, 2021

I enjoyed learning that Julia's parser is written in Scheme.

ChrisFoster · on Aug 22, 2021

This is from to the history of how the Julia language was bootstrapped. The fact that Jeff happened to have written his own scheme (femtolisp) as a previous project probably helped :-)

Actually it's not just the parser but most of the compiler frontend which is written in femtolisp. It would be nice to make the frontend more accessible by replacing it with Julia code at some stage. Bootstrapping is tricky though until someone gets separate compilation working.

snicker7 · on Aug 25, 2021

Femtolisp is, by-the-way, a brilliant and performant implementation of Scheme. Wish it was used outside of Julia.

wirthjason · on Aug 22, 2021

If you’re interested in this stuff CodeMirror is a fascinating exploration into these grammars. Particularly CofeMirror 6 and the Lezer parser.

rurban · on Aug 21, 2021

Perl uses yacc, but with extreme lexer tricks to be dynamic.

All lisps use handwritten parsers, in under 100 lines.

serguzest · on Aug 23, 2021

Now I see why postgresql syntax errors are so unhelpful

prestonbriggs · on Aug 21, 2021

Compilers have parsers, not languages. Many languages have more than one compiler (e.g., C), and the various compilers may use different techniques for parsing.

eatonphil · on Aug 21, 2021

That's right! But HN has a title length limit. See the (actual) blog post title and the blog post contents. :)

prestonbriggs · on Aug 25, 2021

Indeed. But consider the 2nd sentence, the conclusion, and the twitter post.

On the other hand, I certain agree with your idea about teach/using handwritten recursive-descent parsers. Here's an old book that presents it pretty clearly, along with a nice approach for error handling. Maybe you can find it in a library. https://www.springer.com/gp/book/9783540082408

Zababa · on Aug 21, 2021

To be even more pedentic: implementations have parsers. Some programming languages are not compiled.

chrisseaton · on Aug 21, 2021

Nobody was mistaken about this fact.

jimmyed · on Aug 22, 2021

PromQL uses yacc too

myk9001 · on Aug 22, 2021

Sorry for a shameless plug! I read up a lot on this subject while working on [my own toy compiler](https://news.ycombinator.com/item?id=28262149).

Other well-known languages that have hand-written parsers are Rust and D.

And here are a couple quotes from compiler developers explaining their reasons for going with hand-crafted parsers.

Someone from the C# compiler’s team gave the following reasons [1]:

>Hello, I work on the C# compiler and we use a handwritten recursive-descent parser. Here are a few of the more important reasons for doing so:

>Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.

>Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make it more relevant to the context in which the error occurred.

>Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code ` var x = velocity.` is invalid C#. However, in order to give autocomplete on `velocity. `, that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.

GCC actually used Bison for a long time but eventually switched to a hand-written parser, the team gives some reasons in the changelog [2]

>A hand-written recursive-descent C++ parser has replaced the YACC-derived C++ parser from previous GCC releases. The new parser contains much improved infrastructure needed for better parsing of C++ source codes, handling of extensions, and clean separation (where possible) between proper semantics analysis and parsing.

Some people argue coding parsers by hand is prone to errors. This line of reasoning certainly makes a lot of sense to me, but it's not that simple in practice for non-trivial grammars. For instance, ANTLR 4 emits a so called parse tree which you would like to convert to AST, if you have a somewhat complex grammar. This conversion takes quite a bit of code written manually, see AstBuilder.java [3] to get an idea (I'm not affiliated with the project). So, there's still plenty of opportunities for having errors in that amount of code. To be fair, creating AST-s is an explicit non-goal for ANTLR 4 [4].

[1]: https://news.ycombinator.com/item?id=13915150

[2]: https://gcc.gnu.org/gcc-3.4/changes.html

[3]: https://github.com/crate/crate/blob/5173b655a9fbf72028876ae7...

[4]: https://theantlrguy.atlassian.net/wiki/spaces/~admin/blog/20...

deterministic · on Aug 23, 2021

Run don’t walk away from ANTLR. Unless you are an intellectual masochist. I maintain a high performance Haskell like compiler for a living. Somebody added ANTLR to another tool I am now maintaining as well. It’s a maintenance nightmare.