> However, in practice it turns out that LALR(1) isn’t always the answer. For instance, both GCC, Rust and Go have chosen to use handwritten parsers that are not based on a declarative grammar file. I find this disappointing: We have decades of experience with specifying languages in a declarative format, and apparently it’s still easier to manually write parsers. Obviously there’s something lacking with the algorithms we have today.
Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.
Consider the example of missing parenthesis. I can specify a grammar with the following:
EXPR =
...
| '(' EXPR ')'
The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.
It is possible to modify the grammar to look for this case specifically:
But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.
So, many production compilers end-up with custom parsers to report unexpected input in a sane way. It is not that the algorithms are lacking; there is simply no mechanism for a generator to know what a mismatched parenthesis should be reported as.
I completely agree: the poor quality of error reporting in parsing tools is a significant reason not to use them. When I started looking into this a couple of years back, what surprised me is how much work there's been in this spread over decades.
Building upon that (and obvious bias alert!) led us to come up with a couple of new algorithms. The current paper draft is https://arxiv.org/abs/1804.07133 (a future version of that paper will probably chop out MF which, in retrospect, adds too much complexity over CPCT+ for the relatively small gains). The Rust library (https://crates.io/crates/lrpar) it's implemented in is documented at https://softdevteam.github.io/grmtools/master/book/errorreco... which is probably a friendlier starting point. Summary: one can do a surprisingly decent job of error recovery for very little work.
Error recovery and reporting are for sure the biggest challenge of a production parser. If you work out of band, you can play a lot of tricks in your parser to make it much more effective in handling errors. For example, braces can be matched in many languages without parsing any other constructs in the language, meaning these errors can all be reported and recovered from independent of the rest of the language. Then again, the overloading of < and > as operators and braces in many languages defeats that a bit :)
Hand coding a parser is usually worth it for a moderately popular production language. Parser generators really shine for smaller language efforts that can’t afford the overhead and don’t really care about the niceties of decent error reporting.
I don't think it's just error handling. A nice, human readable form has significant chance of being ambiguous - you can remove the ambiguity with a variety of transforms, putting the code into Greibach Normal Form, and this resolves the ambiguities. ... and translates pretty directly to a hand-written recursive descent parser (which you can generate also). But the thing is that here the code essentially serves as a lower level of abstraction, teasing out what you really mean. Which is to say there's nothing wrong with the code being the final spec, that way you don't have to keep the code in sync with the spec.
For a mainstream programming languages, complicated ambiguities are hard on users anyways, by writing your parser by hand you have an extra incentive to keep the grammar simple (eg Scala). A lot of the power a parser generator gives you shouldn’t be used.
Remember, complicated ambiguities are different from complicated structures.
The Ruby programming language is a poster-child for potential ambiguities that seem simple - allowing conditional before and after for example. Essentially, all the programming languages that are "human like" wind-up like this, with the sort of ambiguities that people are comfortable with in natural language. This has a cost in terms of exact expression but a lot of users think of this as being "easy on them".
Oppositely, languages without grammatical ambiguities often seem irritatingly verbose to use - new users dislike lisp's parenthesis proliferation and personally find Scala irritatingly verbose.
The complexity of compiler-writing approaches as limiting factor to language complexity probably depending what someone is familar with. Some people can spit out annotated YACC grammar pretty easily whereas my head swimming looking at the stuff. I can produce a recursive descent parser from a grammar pretty easily however.
> Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.
Yes! I totally agree. Error reporting was one of the things I meant when I said that there is somethings "lacking with the algorithms we have today".
> The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.
I'm not sure if this is the best example. Ruby is using Bison for parsing and it's certainly able to detect a missing parenthesis:
Most parser algorithms have a concept for "currently expected characters/tokens" and can easily report this on an error.
I also think that this will be more precise in Glush than in e.g. LR-based algorithms because Glush is following the grammar strictly top-down. It knows exactly which characters are allowed at any given moment.
I have however not made an effort to make nice error messages so I don't know how well automatic error reporting would work and I don't feel comfortable making any concrete claims or promises.
> It is possible to modify the grammar to look for this case specifically:
Not sure if I see much value in manually adding such a case. This seems like a case that would be easily handled automatically (again: see Ruby above).
> But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.
Well, if you want the equivalent error reporting in a hand-written parser you also need to add an equivalent amount of code. There are probably ways you can annotate a grammar to improve error reporting and I think it's a far more interesting solution to investigate. I haven't experimented with this yet, but I'd love to try to tackle it if I can find the time.
> But now I've just polluted my once-pristine grammar file
What's bollixed up your grammar file are the users who absolutely insist on not writing syntactically perfect programs every time.
Your 2nd example is still cleaner than writing a hand parser that does the same thing, surely? A slightly mucky DSL has got to be cleaner, clearer, shorter and so more maintainable than a handwritten equivalent, no?
What would you want the grammar language to look like that handled errors 'properly'?
There is also the friction added by throwing in another language and subsystem, potentially with some fancy build system requirements for integrating code generation....
Sounds promising! And a great exposition of the background and development.
About the Glushkov construction for regular expressions, here's some Python code that may perhaps help to explain it, since the Wikipedia page is hard to follow (at least for me) and the Mastodon thread linked in the post admitted to some confusion about the "Play on Regular Expressions" paper (which I haven't read): https://github.com/darius/sketchbook/blob/master/regex/nfa_p...
Sorry it's not pedagogically great either; I was rederiving the method for myself rather than implementing from textbooks. What I called 'follows' in the comment should correspond to 'pair set' in the post; the other fields have the same names.
Incidentally I think the post a bit overstates the problems with PEGs (you shouldn't need left recursion anywhere a hand-written recursive-descent parser would also work; PEGs were meant to streamline and formalize recursive descent, and this failure to model the practice can be fixed without big changes) -- but CFGs are great too and I'm looking forward to studying Glush. I wish I'd thought of exploring the generalization to context-free grammars.
That sounds really cool, and I hope it will find success. The field of parsing truly needs some refreshment, especially in usability.
I do have a few questions:
1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?
2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)
3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?
Author of the article here. Thanks for your comments!
> 1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?
Not quite yet. There is a lot more work needed and writing this article was one step in right direction. I was hoping I was able to get even closer to
> 2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)
This is yet to be determined. A few comments:
- I'm not 100% concerned about performance because as mentioned in the article, I don't plan to use this in a place where performance is critical.
- I'm not too worried about the performance since I'm not doing too complicated data structures (it's all hash maps), and most intermediate data structures are small in size and are probably best implemented linearly (e.g. plain array) for practical grammars with few ambiguities.
- It's based on NFA-style "advance one state at a time" which actually is not good for performance. This means that parsing a keyword such as "function" actually passed through 8 different states instead of just looking ahead and jumping straight ahead. I think taking advantage of these look ahead will probably help a lot
> 3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?
A single output from the parser is conceptually a flat list of _marks_ (think like a reverse RPN):
// this input:
1 + 2 * 3 + 4
// with this precedence:
(1 + (2 * 3)) + 4
// could generate this flat list of marks:
add, add, 1, mul, 2, 3, 4
The total output from the parser is a "tree of marks".
Thanks for sharing. Currently I'm looking for an elegant language for querying Json structures. In addition to JsonPath, JMESPath, Glush & GROQ is definitely worth a try
Worth noting jq is not quite like what GP has tried, in that it seems to be designed as a command-line utility first with only incidental concessions to allow itself to be linked as a library. Its C API is poorly documented and essentially defined by the single implementation. The language itself is immensely powerful and not really sandboxable.
It's a nice language to work in but definitely not one I'd want to expose as a user interface.
> By "not really sandboxable", what powerful features of jq do you mean?
On close inspection I can't find things as bad as I imagined, but there are still some unexpected barely-documented features like reading files from anywhere on disk[1] as long as they have the right extension, with much wider implications than one might expect at a glance. (Note the while the module path is relative and checked for path traversal, you can just set the search path. Undocumented is that the text needn't even parse as JSON because you can just set the raw flag.) In general, many language features are implemented by exposing a more powerful primitive than necessary to the language and documenting a less powerful wrapper but not the primitive itself; the security implications go more or less unexamined.
There seems to be no particular provision for executing any less than enough to generate the next output, so if you want to set limits on execution you'll have to do that at the process level. This is something you can probably do better in a separate implementation, though.
> I'm not sure what's meant by "expose as a user interface". It's a pretty abstract DSL, so I would certainly only expose it to programmers.
Only trusted users should have access to jq. I would assume that anyone who has access to executing jq code can execute arbitrary code in the interpreter's context. I'm less sure that it's true right now than I was when I made the post above, but it still seems pretty likely.
I can recommend jtc as something nice to use for extracting stuff from deep json structures.
It should be quite feasible to integrate as a library if one so desires.
Skipping to "Glush, the algorithm" can be helpful if you already know the history of parsing.
TL;DR: Glush grammars consist of "rules" that are regexes + recursion of rule calls. Compiles to NFAs.
I'm not sure I see the point in declaring precedence levels inside rules -- maybe it's just a preference thing, but I like having operator precedences in a separate section of the grammar. Yacc does this. Megaparsec for Haskell does this.
I am tempted to classify this as "regex extended with recursion" rather than "an alternative to CFGs", since I cannot decipher from the article the exact expressive power.
No. It cannot, since NFAs cannot parse languages that are not regular. Rather, the algorithm executes like a pushdown automaton. I don't think it's correct to call this compilation.
> I cannot decipher from the article the exact expressive power.
To me it seems to suggest being able to parse (exactly) all context-free languages.
Nice article, this looks like a useful parser generator. Though, as others have mentioned, error reporting is a very important point that should be considered even at this early stage.
Here's something that surprised me though:
> a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby)
> [some other approach] it looked very interesting, but it’s not completely smooth. First of all, all of the implementations have been in highly expressive, garbage collected languages (Haskell, Scala, Racket).
This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.
> This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.
Aah. I should probably have clarified that. At Sanity.io today we only need JavaScript and Go, and those are the most imminent languages to support (and what I can actually spend time on during work). However, as an overall parser algorithm I consider it crucial to be able to easily target C as this would unlock a whole new world of possibilities.
Does that make it more clear?
> but it's still a strange point to spend almost an entire paragraph on
Yeah, this article isn't exactly short on details. The purpose of this post was a mixture between (1) arguing for the value of declarative-based grammars, (2) showing all of the fascinating algorithms that have been discovered, and (3) showing how Glush works.
I've had a great time playing with Parsing with Derivatives so even if I didn't end up directly using it, I feel like sharing cool ideas to new people.
Thanks for your response. You made your point more explicit, but I still think that being afraid of GC due to planning to support C at some point could be premature pessimization. Your milage obviously varies!
Also, yes, thank you for mentioning Parsing with Derivatives and all the other stuff, I really liked the overview you gave of the field.
In my experience, shift/reduce conflicts typically reflect real ambiguities, at least in your initial draft of a grammar. Reporting them at that stage is a good thing; you should be addressing them instead of glossing over them.
There are legitimate reasons to need more power than LALR(1) provides, but careless thinking is more common.
I would have really preferred a table of contents, either at the top or in a sidebar. It would have made it easier to skip over all the background materials that essentially rehashes an undergrad complier course.
Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.
Consider the example of missing parenthesis. I can specify a grammar with the following:
The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.It is possible to modify the grammar to look for this case specifically:
But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.So, many production compilers end-up with custom parsers to report unexpected input in a sane way. It is not that the algorithms are lacking; there is simply no mechanism for a generator to know what a mismatched parenthesis should be reported as.