Hacker News new | past | comments | ask | show | jobs | submit login
Parser generators vs. handwritten parsers: surveying major languages in 2021 (eatonphil.com)
190 points by eatonphil on Aug 21, 2021 | hide | past | favorite | 145 comments



For serious projects, I find myself typically resorting to making a hand-written parser (usually recursive descent) because it’s the only way I can really get excellent and contextual error reporting and recovery. Parser generators are computer science marvels, but for the most part they’re typically too hard to customize well enough to produce a sensible error to the user, with line numbers, token stream provenance, and suggested recovery information. I can’t imagine the difficulties of trying to get good messages for something as large and pervasive as, say, C.

I think that with enough work, we can adapt and evolve parser generators to mesh well with better error reporting, and give the programmer more options and control, but it’ll take some elbow grease and probably breaking backwards compatibility with the production syntax of these parser generator systems. There’s also a risk, of course, that if there’s too much customization, you sort of lose the value of having a parser generator DSL in the first place.


One of my favorite PLT related books is Programming Language Pragmatics, which spends a non trivial section near the beginning on just this issue. Among other things it's very useful to continue a best effort parse on the remainder after the parse error so you can show more syntax errors vs merely the first. But this is something of a nebulous art to accomplish.

I agree that I think generators could address this issue, they simply haven't.

Also I'm particularly fond of PEGs, because they match the intuitiveness of recursive descent parsers with a generator suitable formalism (though they have their rough edges as well).


In my programming language MethodScript, (which has a handwritten parser), I fail immediately during lexing, (because then that was a really bad error) but during compilation, I only immediately fail on some things that are likely to cause a cascade of errors, such as a missing bracket or something.

But yeah, it makes the compiler output way easier to work through when you have everything at once.


I'm always shocked to find major projects that use parser generators. A hand-written recursive descent parser is easier to write, read, understand, and modify. It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.

So looking at the list I'm dumbfounded that CPython and Ruby use generators.


You won't find me defending the use of yacc, but in general hand written parsers are far more error prone.


My strategy when faced with doing a DSL of some kind is to write an accepter in bison or similar and knock the bugs out of the grammar and basic readability issues. Then go recursive descent with a table-driven precedence-climbing parser for expressions when ready for the real deal.


This is, I think, a very good strategy for many reasons. Not only is it a cheap way to work out tthe design of the language and a way to verify the correctness of your implementation. It's also a good fallback solution: if your (more expensive, expressive) implementation is not ready in time, you can fall back to integrating the generated implementation to meet an important deadline or whatnot.


It's harder to ensure that the grammar doesn't have unintended ambiguities when using a parser generator. That doesn't matter as much when "just" developing another implementation of an existing fully specced language, but for cases like the examples you cite that doesn't exist.

That issue is the primary reason why postgres continues to use bison. The SQL standard introduced potential parsing ambiguities frequently and leaves a large portion of the things necessary for a functioning RDBMS unspecced.


> It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.

It's also potentially quite slow.


In my experience (with both hand-written and generators, eg Antler), that's not the case.

Hand-written recursive descent are simple to write, debug and maintain and extend, while with generators I always had to fight the tool at some point, and that always ended in frustration.


We've just had different experiences then. It's funny you mention ANTLR because it was in particular very slow last time I tried it (just a couple years ago).

As for fighting the tool and simplicity, sure, that's beside my point.


Some grammars can't be parsed recursive descent. At least if I remember parsing class from decades ago they can't.


There are techniques that help parsers parse a far larger class of grammars than their underlying model permits. Some of these are:

- turning it into a state machine (required to deal with nested comments and C++ template syntax)

- building the symbol table while parsing and querying for disambiguation

- treating expressions as a stream of atoms and using a specialised precedence parser (probably required to deal with operator overloading)

Parser generators usually employ these as well.


Not true in practice. C++ is one of the hardest practical languages to parse. And it can be parsed using a recursive descent scanner less parser.


I really hope to know why people choose one parsing algorithm vs another.

I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.

"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."

however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.


Early and similar algorithms give you a parse forest rather than a parse tree. For a programming language you don't want to have your grammar be ambiguous. When Early gives you a bunch of alternative parse trees, how do you disambiguate?


If you are parsing incorrect programs, you want resilience in the parser, including considering alternate parsings and generation of parses of sections of the file.

Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems.


It is quite possible to redifine much of the syntax with the cpp. Parsing without the cpp is hopeless.

My favourite is Bourne's longing for Algol: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd...


Of course parsing without cpp is, in general, hopeless. But for something like a structure-aware editor you don't need to parse it perfectly, just "good enough". Ditto for a source-to-source transformation system.


If your grammar is ambiguous, then an ambiguous program isn't incorrect.


It's not that the grammar is ambiguous, it's that the thing you're parsing may not even be in the language generated by the grammar. But you want to parse it as well as possible, and there may be more than one way to do that.


wow, absolutely not. if i'm parsing a programming language I want to basically stop and bail from the current context as soon as possible. there is basically nothing to gain by continuing.

if the shit i'm parsing isn't in the language there is no reason to continue. its invalid.


If you are writing a code editor you might want to continue parsing it. Or maybe you detected an error, but you want to report more than one error at a time for a better programmer experience. There are lots of use cases for partially parsing a computer program.


> there is basically nothing to gain by continuing.

You gain error messages for subsequent errors, so you can fix them all before recompiling.


I had already explained why one might want to do this:

"Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems."

One cannot in general parse un-cpped C/C++. But that doesn't mean it's useless to do as well as one can, even if it cannot be parsed exactly. Parsing is for more than just compilers.


in my implementation, I give each rule a priority. I always choose the one with the highest priority to expand. I basically followed this https://loup-vaillant.fr/tutorials/earley-parsing/parser


I've written parsers for context-sensitive grammars, and as it turns out, this is not a desirable feature of a language. The reason you typically see simple parsers is because simple grammars are usually unambiguous and context-free, which is more convenient for humans to understand and internalize.

After all, when you're writing software, you pretend to be the compiler to some extent.

Natural languages, like English, are a good example of something which humans struggle with because they are complex, ambiguous, and often require context. Sure, it's extremely expressive, but that is the sharpest double-edged sword in programming languages.


Neat, though this just seems like PEG with extra steps and more pain. Still this answers my question. Thanks!


Most languages have a simple, unambiguous syntax, so LL or LR is fine. LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.

In general most people think advanced = bad, they want the easiest solution which gets the job done well.


This is probably known somewhere, but something I noticed is that a way to think about Earley parsing is that, when you create an LR(1) shift-reduce parser, rather than having it be an error in the grammar for there to be shift-reduce or reduce-reduce conflicts, instead you nondeterministically reduce (i.e., your state machine has a set of states rather than a requiring there to be a single state). This also ends up solving the left recursion problem in an efficient way.

This is certainly slower than a deterministic LR(1) parser, but maybe if you're clever about the implementation you can efficiently handle local ambiguities in a grammar. Large-scale ambiguity is bad since you don't want small changes to have action-at-a-distance, but you could in principle have a strict LR(1) grammar that contains Earley sub-grammars for expressions.


> LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

Earley's algorithm should be able to be trivially parallelized which might make it a contender for languages which are ambiguous (like most real world languages) where they are hand-rolling parsers. I haven't tried since I have no real need but looking at the steps I can't see a reason it couldn't do its work across multiple threads.

Honestly, other than JavaScript 1.1 I can't think of a popular language which has an unambiguous syntax and I really like playing with language grammars for some odd reason -- probably wrong though...


From what I infer from articles like https://jeffreykegler.github.io/personal/timeline_v3 , the original Earley paper had a bug, and wasn't a good fit for 1960s hardware, and had poor performance for some types of grammars.

By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."

It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).

(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)


> I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.

That one is really the only Earley parser I've found used in the wild (don't know what marpa is used for) and unfortunately it is mostly unhackable because they did some serious optimization voodoo on it so it was replaced by a hand-written recursive decent parser a while back because nobody in the world could figure how it works[0] -- which is kind of strange since ASDL is super simple to parse and the generator which used spark was meant to check files into source control but, whatever.

Its easy to play around with but not a great source if you want to see how an Earley parser is put together. There are also some bugs with parser action on duplicate rules not working properly that were pretty easy to fix but python pulled it out of the source tree so no upstream to send patches to?

[0] might be making that part up, dunno?


You are one of the "couple of people" I was referring to. :)

I know SPARK's docstring use influenced PLY.

PLY doesn't use Earley, but "Earley" does come up in the show notes of an interview with Beazley, PLY's author, at https://www.pythonpodcast.com/episode-95-parsing-and-parsers... . No transcript, and I'm not going to listen to it just to figure out the context.

https://github.com/lark-parser/lark "implements both Earley(SPPF) and LALR(1)".

Kegler, the author of that timeline I linked to, is the author of Marpa. Home page is http://savage.net.au/Marpa.html . The most recent HN comments about it are from a year ago, at https://news.ycombinator.com/item?id=24321395 .


Lark [1] is a battle-tested pure-Python Earley parser. My company uses it in production. It is by far the easiest parser generator I've ever used.

1. https://github.com/lark-parser/lark


Lark is amazing... But it's also one of the best LR parsers out there and I would guess that mode is used a lot more than the Earley mode.

Either way, I have never used a better parser generator. It has the best usability and incredible performance when you consider it is written in pure Python.


nearley.js is an Earley parser that sees at least some use. https://nearley.js.org/


Without certain optimizations it will be too slow. You really need to optimize right recursion to not be quadratic, for example.


jq uses bison, and it takes a startling amount of work to diagnose parsing problems so as to give helpful error messages. Last time I checked, the issues were full of new users stopped by something an appropriate error messsge would resolve.


I took the compilers class at Stanford and never really understood the algorithms of bottom up parsing, or even really how grammars worked. I just made the tool work.

I then went to work at a private company, and an older guy who had gone to a state school that taught recursive descent (his was the last class to teach it) taught me how to do it. In a month or so I had learned more about how grammars actually work, what ambiguity is, and so forth, than in my whole class at Stanford.

I now teach compilers at a university, and I teach recursive descent.


I recommend you also teach "precedence climbing", which is basically a slight refactoring of RD to parameterise the recursive functions by level so that grammars with many levels don't require proportionally deep call stacks and it also massively reduces the duplication of code at each level:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing


I still recommend spending the time to (re-)learn bottom-up parsing though, even if you never use it again. Specifically, LR and then GLR (no need to bother with LALR/SLR/all those variants). Maybe even recursive ascent too, or recursive ascent-descent if you find those interesting. It's painful and confusing and takes a long time to pick them up, but it's quite gratifying once you do learn it. The algorithms are super clever and cool, and help you gain a lot of appreciation and intuition for what is/isn't possible to do (whether efficiently or inefficiently).

Using recursive descent for everything is like using Python (or Go or pick your favorite easy language to learn) for everything. Yeah you can totally do it, and a heck of a lot of people do precisely because of how nice and convenient it is, but it's good to have a sense what else is out there and how you can think about a problem differently IMO, even if you rarely use other languages in practice.


i agree, but for an undergraduate understanding of parsing and grammars, recursive descent is perfect

i think the more complex parsing algorithms were good for producing papers, which explains their success in academia but relative lack of success in industry


Recursive parsing is a completely valid concept, but it's unnatural at first to hand-write parsers with it. I have tried it - even wrote a prototype DSL in Haskell (the CPS monad is very useful for that), but what I end up doing is using the DSL to write down the LR state machine, bottom up. This is eerily familiar to embedded and GUI programming where internal state matters and the program reacts to input. It might even be a better way to incorporate ad-hoc code that has side effects. However, most people think about programming languages top-down and use grammars to describe them.


as an aside, the parser for https://hyperscript.org is recursive descent, and our motto is: "remember what our parser professors taught us, then do the opposite of that."

There's some ugly stuff in there, but it's fun, all because we wrote it in recursive descent and can do whatever we darn well please.


I think the best way to understand it is that you're creating an AST (a tree describing the program), for a bottom up parser you are recognising and making little treelets and stitching them together into larger treelets until you reach the top/root, while for recursive descent you're building them from the top down from the root.

The main difference is that LR (bottom up) family grammars can describe more languages than LL (top down) ones can largely because bottom up has slightly more contextual information available when it puts stuff together (often small treelets rather than just the next token)


I took a compilers class at an ex-Polytechnic in the UK that taught recursive descent - I now wonder how old the textbook they were using was!


Yep that’s the way to do it. Teaching LL/LR and it’s variations is a complete waste of time.


I hope you are not making the same mistakes as your teacher at Stanford did (and leave students clueless).


i hope so too


Any resources you recommend for someone who is interested in learning about recursive descent?


Yes, the second part of crafting interpreters has a great introduction to the topic and is available online for free:

https://craftinginterpreters.com/contents.html


Thanks!


Get a lexer (like, just steal one if you don't think you'll be able to write one without getting bored, they're usually fairly self contained) and just starting writin' code. They're basically that simple.

The lovely thing about a nicely written recursive descent parser is that the code pretty much is the grammar (topologically), so any grammar you can actually imagine that can be parsed by recursive descent doesn't take much thought to make work.


IMVLE, a lexer is just a big regular expression, with a branch at the top level for each type of token. Are there cases where this doesn't work?


Python and Haskell are interesting because the indentation is crucial to figure out the correct nesting. Haskell allows to use curly braces and semicolons to completely remove indentation, so the lexer only has to add virtual tokens into the token stream. For Python, I don't know how it deals with it.

C/C++ have plenty of difficult syntax edge cases. For example, >> can represent a single operator in an expression or two successive > in a template declaration.


Python has some preprocssing to inject INDENT and DEDENT pseudo-tokens into the lexing stream:

https://docs.python.org/3/reference/lexical_analysis.html#in...


Some things require (or are more easily implemented) using state-based lexing, such as identifying escaped characters in JSON strings or XML attributes, or documentation comments. These can be thought of as a Finite State Machine (for the state) driving the different sub-lexer regex token matching.


I know it's not Reddit but username checks out :)


Ultimately, the reason many languages use manually written parsers is because the code generated by parser generators just isn't worth the time savings not manually writing a parser instead. The underlying assumption is that that is somehow super hard, error prone, and time consuming. I think that assumption is simply wrong.

I had my compiler course at the University of Utrecht in the mid nineties. The local faculty there had some functional programming pioneers and did a lot of work on parse generators. So, I learned how to write a parser generator in Gopher (a Haskell predecessor). Very nice course and definitely a bit of a mind fuck since I also had to wrap my head around monads at the same time. And of course not really that useful for real world stuff and I've since forgotten most of what I learned as I have not used it in over 25 years.

A few years ago, I had to sit down and write a parser for a little query language that we came up with. I spend some time looking at all sorts of convoluted tools for this and refreshing some decades old knowledge on grammars to actually try to understand what the tools were trying to do. I then found a useful example of a lexer that simply used regular expressions and suddenly realized I was wasting time and that this stuff was stupidly simple. I then knocked out this little project out in a few hours using that. Once you have a lexer, things get easy and they are not that hard to write. I think this is why people end up doing these things manually. It's just not that hard or worth automating and it's rare enough that it's not worth investing time in learning new tools.

Also, modern languages are a lot better at this stuff too. String handling in C was always a bit tedious. But modern languages come with so many built in stuff for that that it's not really a burden. So, in short, parser generators solve a problem that's not really that big of a problem anymore.


> Developers often think parser generators are the sole legit way to build programming language frontends, possibly because compiler courses in university teach lex/yacc variants.

They're far from the only way, handwritten parsers are often mentionned as having way better error recovery. But if you're making a new configuration language, programming language, or something like that, ensuring that it has LR-compliant grammar and that this grammar is the source of truth can avoid a lot of pain later down the road.


The C++ grammar is not LALR(n) for any n. It requires unbounded lookahead. This means that it does not fit the yaccc/bison model, and back when bison was used, this was dealt with by adding a third pass, called spew, between the lexer and the parser to look far enough ahead to tell types from variables from functions. It was a nasty hack which made it very difficult to generate good error messages. But the language is a much better fit for a good hand written recursive descent parser.


I built an Algol68 compiler way back when - it was a mix - used recursive descent to parse the bracketing structure and type definitions and an LALR (homegrown generator) for the second pass it fit well provided you allowed the parser generator to resolve on of the shift/reduce conflicts on the fly at runtime (to handle the language's ability to change the priority of operators)


I think more programmers should learn how to write parsers by hand. You need them everywhere; for protocols, file formats, user input, little languages. They often use regular expressions or other crutches when a parser would be the better choice.


Sometimes when the problem space is simple enough you can get the best of both worlds by using regex to perform your tokenization when lexing and then passing the tokens to a traditional stack based parser. It’s possible for this approach to be bad performance wise but it depends on the use case.

The lexer that yacc implements is indeed regex based and as I also recently discovered was originally written by a little known developer by the name of Eric Schmidt [1]

1: https://en.m.wikipedia.org/wiki/Lex_(software)


What would be good resources for this?


Start with a simple language like parsing JSON or S-expressions and do it in a language you already know. There are a ton of tutorials you can find online along these lines!


Course notes for University of Waterloo CS241 "Foundations of sequential programs"

Part of the course is understanding exactly when Regexps are good enough and when you need something more powerful.

http://anthony-zhang.me/University-Notes/CS241/CS241.html


As @rofrol already said, definitely Crafting Interpreters.

Just would like to add a link: https://craftinginterpreters.com/contents.html

It explains parsing and other topics in a really clear and accessible way.


Crafting interpreters?


Do you know Haskell? If not, I suggest you get accustomed to the language, and then read about monadic parsing [1] through Graham Hutton's work. Graham is a famous CS professor at U Notthingham, appears often in ComputerPhile [3,4], and wrote a book on Haskell [2].

I had to write an interpreter, optimizer and engine for a declarative language plus bottom up knowledge base in Haskell as part of an assignment, and an exam in a graduate course on advanced programming. Haskell made the problem significantly easier compared to languages I am much more comfortable with, like Python or C.

[1] www.cs.nott.ac.uk/~pszgmh/pearl.pdf

[2] https://www.amazon.com/Programming-Haskell-Graham-Hutton/dp/...

[3] https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA

[4] https://www.youtube.com/watch?v=eis11j_iGMs


Okay, this is really perplexing, this reply was at +5 and is now at -1. I don't understand why and I'd appreciate any explanation.


The Ruby yacc file is scary to look at. 13+ thousand lines in a single file.

Would it be better with hand rolled and they could have abstracted and organized somethings or does it all make sense in its current format if you are familiar with it?

https://github.com/ruby/ruby/blob/v3_0_2/parse.y


I have a PhD in compiling Ruby and I struggle to read and modify that file.


Ruby is very flexible. Is that the issue?

I suspect not, because Scala is super flexible and probably more difficult to parse, but has a better organized parser.


The language is just quite large and quite complex to parse - context-sensitive and requires interaction back and forth with the lexer as it parses. TreeSitter seems to have a much simpler Ruby parser - but I believe it ignores the context-sensitive parts of Ruby.


Ruby's syntactic flexibility is definitely a big factor in it.


Parser generators generally suffer from at least speed, recovery, error handling and lack of debuggability. They also require you to learn a new language, the parser generators language.

Last but not least, what you actually end up wanting to build is an ast, as at some point you want to do something with the input, for most parsers you then have to implement even more code to build up the ast.

It is much easier to hand write it. In the end its faster to write and usually faster to run.

Every few years I evaluate the new options for the languages I use (c#, pascal), every time so far I am disappointed with the tooling. Maybe one year.


I think you have to see them a bit like regexes. It may be a bit annoying to learn the grammar at first, but it's cross-language learning in a way, as you can use parser generators libraries in any language.


But it's still one extra language to learn, for in practice no benefit.


Copy/pasting the grammar of a language and using a library will be faster to do than any hand-written implementation, so I don't agree that there are no benefits.


But you can't do that because in practice these grammars have lots of imperative action code inserted all over the place because the parser generator's model fundamentally doesn't match the language class you're parsing and it needs to be worked around.

For example I work with a parser definition that's supposedly shared between C and Java, but it's a massive nightmare because it's 50% imperative actions.


That depends on the language you're trying to parse. If it has a LL or LR grammar, it'll work well. If it's something else that can't be described with something like that or something less powerful, you're going to have a bad time.


> That depends on the language you're trying to parse.

Yes... but most practical languages do not have simple LL or LR grammars. Hence this blog post and discussion and these problems.


That's not true. You can have a LL or LR grammar, and use a hand-written parser to get better speed, better error reporting, any number of reasons really. Java is LL(1), but javac seem to use a hand-written recursive descent parser. Now that I think about it, Java being omitted is weird.


> Java is LL(1)...

Java hasn't ever been able to be parsed with one token look ahead, the earlier versions had a section on how to modify the grammar to be LALR(1)[0] but it was dropped in later versions of the specification -- probably due to added features which made it unfeasible like generic classes.

It's actually quite a good resource since they explain the reasons why the parser needs to have more than one token to figure out what's going on.

[0] http://titanium.cs.berkeley.edu/doc/java-langspec-1.0/19.doc...


What’s not true?

I said ‘most’ not ‘all’, and nobody here said the only reason to hand-write a parser was context sensitivity.


This will probably get a lot of downvotes but in the way that “paint” is a solved problem and “art” is unsolved, I think parsing as a solved problem. Good languages and DSLs are an unsolved problem.


> Clang. Not only handwritten but the same file handles parsing C, Objective-C and C++.

And it's under 3000 lines!

https://github.com/llvm/llvm-project/blob/llvmorg-12.0.1/cla...


It seems to me that the parsing code in clang is distributed over multiple files which together are way more than 3000 lines: https://github.com/llvm/llvm-project/tree/llvmorg-12.0.1/cla...


There's also parser combinators, which (for example) are used in GHC's parser.


When I lookup "ghc parser" I get this page [0] that says the GHC parser is built from Happy [1], a Yacc-like parser generator. Grammar is here [2].

Could you link me to where you see GHC's parser using parser combinators?

[0] https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compil...

[1] https://www.haskell.org/happy/

[2] https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GH...


I was scanning the GHC source by eye, and misunderstood a comment about the "parsing monad". (Alex, the equivalent of lex, uses monads to describe actions, but I thought that monads were the main interface to the lexer.) Thanks for the correction.


Agreed! I would say that parser combinators are the sweet spot and the right choice in most cases.

Scala has them as well, e.g.: https://com-lihaoyi.github.io/fastparse/

And the good thing is, you don't have to learn a completely new language/syntax, you can use the host language's syntax and you have full IDE support as well.


I find using parser combinators a joy. Makes parsing e.g. LateX somewhat straightforward


I've not come across parser combinators. Is there good introductory material?


It's Haskell-based, but here you go: http://dev.stephendiehl.com/fun/002_parsers.html


Agreed. I personally think of parser generator/parser combinator as similar to framework/library, where in the framework you have much less control and often have to rely on "escape hatches" to make custom stuff work. Whereas parser combinators are simply a library making implementing your parser much more convenient and expressive. You call the code, not the framework


I wasted so much time, first on using other people's frameworks and later on writing my own.

Recursive descent might be a tiny bit more code, but it will never stand in your way.

And you don't have to write absolutely everything by hand; it's perfectly possible to simplify the process somewhat using the full power of the host language to get incremental parsing, look ahead etc:

https://github.com/codr7/swifties/blob/main/Sources/Swifties...


I'd love to learn about parsers but I need an ELI5 version.

I've written a parser for CSV which I think works very well but I'm self taught and have no idea if it's any good.

Any 'parsers for dummies' guides?


Crafting Interpreters, then graduate onto "engineering a compiler".

The Dragon book is the classic reference on the subject, but I don't think it's very good pedagogically, and most printings I have seen recently have been awful.


Thanks. I'm looking at crafting interpreters.


To continue the discussion, I recommend [0].[0] advocates the use of both parser generators and handwritten parsers. Parser generators enable to easily experiment and are so more suited for early development. Once a language is more stable, it could be more relevant to switch to a handwritten parser.

[0] https://drewdevault.com/2021/04/22/Our-self-hosted-parser-de...


When I switched from ANTLR to hand written for Adama ( http://www.adama-lang.org/ ), I felt way better about things. I was able to get sane error messages, and I could better annotate my syntax tree with comments and line/char numbers.

A killer feature for a parser generator would be the ability to auto-generate a pretty printer which requires stuffing comments into the tree as a "meta token".


I implemented an unparse function in IParse, which is not a parser generator, but a parser that interprets a grammar. See for example https://github.com/FransFaase/IParse/blob/master/software/c_... where symbols starting with a back slash are a kind of white space terminals during the unparse. For example, \inc stands for incrementing the indentation where \dec decrements it. The \s is used to indicate that at given location a space should be included.


Have you seen BNFC (http://bnfc.digitalgrammars.com/)? I've used it with the Haskell bindings and really appreciated it. Parser, pretty-printer, and AST skeleton all auto-generated. I've only used it with Haskell though, can't speak to the other languages.


Somebody unfortunately chose to use ANTLR on a project I am now maintaining. ANTLR is really really bad. Run don’t walk away.


I have a fundamental question: What happened to the idea of specifying the syntax of a language using a "formal" grammar, however terribly flawed and imperfect that has been--for decades--and still is? It seems that if you need a grammar, e.g., to write a programming language tool, it's considered "old school". If it's published, it's of secondary consideration because it's usually out-of-sync with the implementation (e.g., C#, soon to be version 10, the latest "spec"--draft at that--version 6). Some don't even bother publishing a grammar, e.g., Julia. Why would the authors of Julia not publish a "formal" grammar of any kind, instead just offer a link to the 2500+ lines of Scheme-like code? Are we to now write tools to machine-scrape the grammar from the hand-written code or derive the grammar from a ML model of the compiler?


Who is going to put the time in to maintain it if the language is given by the implementation of the language? In the past, people wrote huge specs because you wanted multiple companies to implement compilers for their various computers. Now there's enough centralization that a single implementation of a language can exist. Why spend the time supporting 6 compilers when 99.99% of users will use the big 1 that the core language developers are working on? Labor is a finite resource, and because this requirement is gone the extra work that came with that requirement is discarded as well. If someone was willing to write up a full spec and maintain it for 10 years then I am sure it would be accepted to any open source language, but nobody decides to do that with their free time for some reason.


One big reason is because it is highly likely they not only don't have one, but one couldn't even exist in any meaningful way, as their primary implementation is some turing-complete monstrosity where the only "formal" spec is their code as there is nothing "formal" about it: the cost of not using parser generators is essentially the same as the cost of using dynamic typing... it not only means you probably have inconsistency bugs all over the place, even if it made it feel easier to code without having to actually prove your program was correct all the time, but it further tends to lead you to relying on the dynamic typing features in places that make the resulting program not something that fits any known formal type model anymore even if it works perfectly.


My textbook

https://www.cambridge.org/core/books/specifying-software/FC5...

has a tutorial introduction to recursive descent.


There's something to be said for a tool that a novice can use (my first hand written recursive descent parser was, no joke, because I wasn't up to the task of including a json library in my Java project) but still useful for professionals.


I think the reason parse generators are enticing and highly regarded, especially in academic settings, is because they tend to attract the same people who love complexity and abstraction for the sake of abstraction. Recursive descent, and even more so handwritten RD, is seen as "lowly" and "unsophisticated", despite all the practical advantages it has.

If you ask developers who have no compiler experience to come up with a solution for e.g. evaluating maths expressions, chances are that they will effectively derive recursive descent in writing a solution, because of its simplicity and ease of understanding.


Does anyone recommend any resources for learning how to write handwritten parsers nowadays? I’ve written a couple simple ones for various tasks before, but I’d really love to learn more about it.


Learn what lexing is. That's step one. Once you have a token stream, look into how you'd transform them into structures.

PUB FN ID(main) OPAREN CPAREN COLON ID(i32) OBRACE NL RETURN NUMBER(0) CBRACE

Look familiar? How might you parse that into something like

{ kind: "function", name: "main", args: [], public: true, return_type: { kind: "integer_type", width: 32 }, body: [ { kind: "return", expression: { kind: "number_literal", value: 0 } } ] }

A recursive descent parser is just a parser that starts at a root node type and recursively traverses a token stream and calls handler functions based on what's next. Those functions might also call other handlers (e.g. for nested expression parsing, etc.) which is why it's called recursive.


Actually don’t. Knowing how to write parsers without a lexer (scanner less parsers) is the true superpower.


Fully disagree. Care to expand?


Scanner less parsers are strictly more powerful than parsers using fixed lexers. You can literally change the language you are parsing based on what you have parsed so far. I have code running in production doing exactly that.


Generally such languages are so abstract that they lose usefulness in practice. Because something can be done, doesn't mean it should. Writing a simple recursive descent parser with a Lexer's token stream is much easier than doing bytestream ingestion on the fly, especially for a beginner.


You can even parse a definition of a parser and then continue parsing the rest of the code using that parser. Not something I have needed in production but it shows the power of non-lexing parsers.


Downvoting without giving a reason why is not useful.


This book is great: https://interpreterbook.com/

It is sort of a walk-through of a small programming langauge written in go, using a recursive descent parser.

Crafting Interpreters is good as well, though I only read part of it, because it wasn't done at the time I read it. https://craftinginterpreters.com/

I've also interviewed both of them, if you like an audio intro: https://corecursive.com/032-bob-nystrom-on-building-an-inter...


Just read the code for an existing one like:

https://github.com/dlang/dmd/blob/master/src/dmd/cparse.d

which is a C parser. It's not hard to follow.


It seems strange that in an effort to solve a common problem that 80% of the success stories opt to not use the automation. Perhaps, there is ground for another toolkit


Work on a compiler for a bit and I guarantee you'll get it. Parser generators pretty much always get in your way


Building an AST is an easy task once you've figured out your grammar and error recovery mechanism, but these are essentially a kind of language-specific design work that can not be automated.


I love the lack of magic with parser combinators but they seem to struggle with left recursion and backtracking. Has anyone created a parser combinator library that is practical for real programming languages?


Are there any significant languages that do parsing with derivatives?


What are derivatives in this context?


Probably Brzozowski’s derivative of regular expressions, e.g https://matt.might.net/papers/might2011derivatives.pdf


From what I can find, in the case of regexes, the derivative of a regex R with respect to a character c, is the regex \partial_c R , which matches a string S iff R matches c S. So, the derivative of (ab)* with respect to the character 'a' , would be b(ab)* .

I imagine that the same idea applies for more general languages, so, I expect the derivative of a language with respect to a character, is the language of strings such that prepending one by the character results in a string of the original language.

So, if R_1 and R_2 are two different regexes, then the derivative with respect to c of the regex R_1 | R_2 , i.e. \partial_c (R_1 | R_2) , would be (\partial_c R_1) | (\partial_c R_2),

and, if + is used for concatenation of two regexes, the derivative of R_1 + R_2 would be, ((\partial_c R_1) + R_2) | ((if R_1 accepts the empty string, then the language which accepts only the empty string, otherwise, the language that does not accept any string) + (\partial_c R_2))

etc.

... I think.


I enjoyed learning that Julia's parser is written in Scheme.


This is from to the history of how the Julia language was bootstrapped. The fact that Jeff happened to have written his own scheme (femtolisp) as a previous project probably helped :-)

Actually it's not just the parser but most of the compiler frontend which is written in femtolisp. It would be nice to make the frontend more accessible by replacing it with Julia code at some stage. Bootstrapping is tricky though until someone gets separate compilation working.


Femtolisp is, by-the-way, a brilliant and performant implementation of Scheme. Wish it was used outside of Julia.


If you’re interested in this stuff CodeMirror is a fascinating exploration into these grammars. Particularly CofeMirror 6 and the Lezer parser.


Perl uses yacc, but with extreme lexer tricks to be dynamic.

All lisps use handwritten parsers, in under 100 lines.


Now I see why postgresql syntax errors are so unhelpful


Compilers have parsers, not languages. Many languages have more than one compiler (e.g., C), and the various compilers may use different techniques for parsing.


That's right! But HN has a title length limit. See the (actual) blog post title and the blog post contents. :)


Indeed. But consider the 2nd sentence, the conclusion, and the twitter post.

On the other hand, I certain agree with your idea about teach/using handwritten recursive-descent parsers. Here's an old book that presents it pretty clearly, along with a nice approach for error handling. Maybe you can find it in a library. https://www.springer.com/gp/book/9783540082408


To be even more pedentic: implementations have parsers. Some programming languages are not compiled.


Nobody was mistaken about this fact.


PromQL uses yacc too


Sorry for a shameless plug! I read up a lot on this subject while working on [my own toy compiler](https://news.ycombinator.com/item?id=28262149).

Other well-known languages that have hand-written parsers are Rust and D.

And here are a couple quotes from compiler developers explaining their reasons for going with hand-crafted parsers.

Someone from the C# compiler’s team gave the following reasons [1]:

>Hello, I work on the C# compiler and we use a handwritten recursive-descent parser. Here are a few of the more important reasons for doing so:

>Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.

>Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make it more relevant to the context in which the error occurred.

>Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code ` var x = velocity.` is invalid C#. However, in order to give autocomplete on `velocity. `, that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.

GCC actually used Bison for a long time but eventually switched to a hand-written parser, the team gives some reasons in the changelog [2]

>A hand-written recursive-descent C++ parser has replaced the YACC-derived C++ parser from previous GCC releases. The new parser contains much improved infrastructure needed for better parsing of C++ source codes, handling of extensions, and clean separation (where possible) between proper semantics analysis and parsing.

Some people argue coding parsers by hand is prone to errors. This line of reasoning certainly makes a lot of sense to me, but it's not that simple in practice for non-trivial grammars. For instance, ANTLR 4 emits a so called parse tree which you would like to convert to AST, if you have a somewhat complex grammar. This conversion takes quite a bit of code written manually, see AstBuilder.java [3] to get an idea (I'm not affiliated with the project). So, there's still plenty of opportunities for having errors in that amount of code. To be fair, creating AST-s is an explicit non-goal for ANTLR 4 [4].

[1]: https://news.ycombinator.com/item?id=13915150

[2]: https://gcc.gnu.org/gcc-3.4/changes.html

[3]: https://github.com/crate/crate/blob/5173b655a9fbf72028876ae7...

[4]: https://theantlrguy.atlassian.net/wiki/spaces/~admin/blog/20...


Run don’t walk away from ANTLR. Unless you are an intellectual masochist. I maintain a high performance Haskell like compiler for a living. Somebody added ANTLR to another tool I am now maintaining as well. It’s a maintenance nightmare.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: