This paper, like some others that I've seen, treats parsing as an academic/algorithmic topic. Given some token stream we'd like to minimize some error criteria. However, these papers ignore the fact that people don't write token streams: they write code that is formatted using whitespace. I don't know any programmer that writes their source code on a single line. On the contrary, pretty much every programmer formats their source code to some reasonable (if not complete) degree. In other words: parsing error recovery is not primarly an algorithmic problem: it's a usability problem.
Newlines and indentation are a source of extra information that we can use to infer what the programmer meant. Why throw all that away during tokenization? That's crazy! We can totally use it for error recovery/error message generation.
I decided to play around with this idea in the design my programming language Muon [1]. The language uses redundant significant whitespace for parser error recovery. This simple approach by itself has turned out to work surprisingly well! In my admittedly biased opinion, it frequently surpasses the error recovery quality in mature compilers such as the C# compiler, which has had tons of effort poured into error recovery.
Of course, a purely whitespace based recovery scheme is not perfect: there are rough edges, like having to deal with tabs vs spaces, and recovering inside a line is not usually possible. But the fact that such a simple approach has led to good results makes me think this would be a great area for future research, that can perhaps combine the best of both worlds here.
I think you hit the nail right in the head, but I would say that whitespace is only one source of signal. Lately I've become convinced that languages need redundancy in their grammar to help with error recovery. Having a more flexible syntax actually hurts usability because valid but semantically incorrect code can't be caught early enough and makes it much harder to give friendly errors. rustc uses whitespace for error recovery in several places, although the most prevalent is checking whether two different elements are on different lines, like when detecting a `:` when a statement ending `;` was actually meant[1].
IMO extra redundancy makes Greek much easier to read than Latin. The upfront learning is a bit more effort, but then reading it is much easier. Unlike Latin it has articles (which are declined and agree with the nouns/adjectives), and the endings on verbs & nouns have way less ambiguity. I can certainly see how the same would apply to programming languages.
Somewhat related is my belief that text based languages have long since outlived their usefulness. With text based languages the compiler creates an abstract syntax tree and a bunch of metadata and then throws it all away afterwords.
Oh sure you can save an object file if and ONLY if it didn't change and it's cached locally.
A perfect system if you edit a file and save it 90% of the compilers work would already done and saved as part of the file format. Topical is the parser would usually have access to the previously unbroken file.
In my experience parsing is pretty robust and fast already, to the point where optimizing that stage of the compiler isn't worth it. Regarding the discrepancy between the AST and the source code, I think that it is worth noting that although there's a relationship between the syntax and the semantics, there need not be a 1:1 mapping between these, both allowing different features look similar (because they are pedagogically similar, like in impl Trait in Rust in argument position and return position) or dissimilar. I could see leveraging ASTs for more accurate diffing algorithms and fancy tools, but none of these would require that the AST be the source of truth.
I mean it sounds like you have an issue with batch compilers more than "text based languages." If you didn't toss out the AST you'd have to serialize it (now you're compiling to two targets at once), and when you want to recompile you need to parse the serialized AST along with the source code for 90% of the same information.
Yes, modern query-oriented compilers (such as TypeScript and C# Roslyn) keep the AST in memory and co-operate closely with the editor via the language server protocol.
I recently saw an example of a language that aims to be AST-first rather than notation-first: it serialises to XML! http://mbeddr.com/
In a query-oriented compiler, you can give it a source location and ask for type infomation, or where is the definition of the symbol at that location, and the compiler will do just enough work to give you the answer, using laziness and memoization to make it reasonably efficient. As far as I know, LISP systems don't work that way: they parse and evaluate (and maybe compile) as soon as an expression is entered, without the kind of laziness you see in a query-oriented compiler.
An even bigger flaw with most academic research into parser error recovery is that the vast majority of syntax errors occur from modifying a valid program to produce an invalid program, but the recovery algorithms are oblivious to this.
tree-sitter is excellent stuff! It's heavily inspired by Tim Wagner's PhD thesis (original site seems to be down, but https://web.archive.org/web/20150919164029/https://www.cs.be... works). IMHO more people should know about that work, and the sequence of work from Susan Graham's lab that led up to it. We have also been heavily inspired by Tim's work and Lukas's thesis extends and updates a number of aspects of that seminal work including, in Chapter 3, error recovery (https://diekmann.co.uk/diekmann_phd.pdf).
All that said, it's surprisingly difficult to compare error recovery in an online parser (i.e. one that's parsing as you type) to a batch parser. In the worst case (e.g. load a file with a syntax error in), online parsers have exactly the same problems as a batch parser; however, once they've built up sufficient context they have different, sometimes more powerful, options available to them (but they also need to be cautious about rewriting the tree too much as that baffles users).
Approaching this from the opposite side, language designers should also take into account how code with slight mistakes (typos and confused use of features) could be detected. Sometimes adding small things to the grammar can pay a lot of dividends when writing the production ready compiler. Alternatively when adding a new feature they should be thinking "what common mistakes will be made to produce worse errors if we introduce this with this syntax".
Which suggests that parsing should be done while editing, supporting many other refactoring tools as well. The key feature enabling this facility is incremental parsing.
Pseudo-regular parsing combined with hand-written pratt-parsers is a nice sweet-spot: easy to read, very robust, and produce great error messages. The technique was, I am told, moderately popular in the 80s and 90s. (My familiarity with them comes from a language called Spad.)
It works as long as people actually care about the formatting of their code. I've seen far too many cases where people just don't care or care about the wrong things when it comes to formatting. Like they copy/cut & paste some code and now the indentation is off by a space or a tab, but they don't care. There's also the space vs tabs rabbit hole.
I sort-of do write source code in one long line: I use an autoformatter so when I want to insert something, I usually just write it in one long line and hit save. If I have a syntax error the line won’t get formatted and then I have to stare at it.
That said, I think whitespace-based error recovery is a good idea and would probably still help in my case.
Project looks great! I see on your README you're working on an LLVM backend. I'm doing the same for an unrelated language, would love to contribute to Muon if you're looking for extra hands.
Tip: Don't ask visitors to your repo to follow you on Twitter. Some of us don't use Twitter and find this advice rather annoying. Start a blog or something.
I hope to study this paper well (which is by the author of the above post) and understand what their algorithm is. I feel that good error messages make or break a language.
In theory it's "the solved problem that isn't", in practice it is --- there's a reason the compilers for the most popular programming languages either moved to or have always been recursive-descent with perhaps a bit of operator-precedence/precedence-climbing to handle the repetitive cases of binary operators. GCC, Clang/LLVM, Java, and C# are the ones that immediately come to mind; and no doubt there are plenty of others.
There are way more parsers out there in production than the top-ten programming language implementations. Just think about serialization formats (json, bson, protobuff, et.al.): There are maybe 10 or so widely used. Then think about message formats (http, xml, yaml, ...): Maybe another 10 or so? Then think about configuration languages (.ini, .toml, ...), let's call that 5?. On top of that we have a bunch of widely used scripting languages (Make, sh, various lisps), I would say again 5 or so.
So for any new language, say rust, you want (ideally) parsers for about 30 or more languages in real-world use cases. And that does not include language prototyping or consider the fact that recursive-descent parsers are usually not available as libraries (say for use in an editor).
In conclusion: Having good automatic parser generators readily available is very relevant. And "good" can obviosly include usability, IMO.
I've thought about doing something like this when using matching/parsing with derivatives[1]. When the graph collapses to ∅ (indicating an error), you could back up one character, then ask the graph "what are the next valid token(s)?" This would give you a list of possible things to insert (`,` and `=` for the example given in the article, though naively implementing this would also suggest stuff like `;`).
But this article takes it a step further and suggests deletions, too. That's actually really cool. My biggest gripe with people making DSLs is the crappy error messages you get when you have a syntax error (GCL, I'm looking at you). Hopefully research like this will improve the lives of those of us that have to use DSLs.
Most grammars define a very specific starting point for languages. This is fine of course, but what if we want to parse, say, only an expression, only a function body, etc?
I feel most parsing APIs tend to expose a single method, something like:
parser.parse(source)
... which is kind of a limited API to do the things I ask about above.
With a more flexible API, an structure aware editor [1] could keep different areas of the buffer separated, and know if the user is typing a function body, a variable definition, etc, and only parse that part w/o breaking the rest of the parser progress (is this a good idea? I think so but I don't think most editors work like this, so maybe not :-)
--
An area of parser APIs that I've recently seen trouble with is lack of metadata on the productions. I'd like to be able to do something like:
production.meta() => { :line 23, :column 10 }
... or something like that. This is very useful while error reporting, for instance. I was trying to find a parser for RDF Turtle that would do this, but couldn't find any!
In the first figure, it gives "int x y;" and says that the new algorithm gives the complete set of minimal cost repair sequences, which is in this case "Delete y", "Insert ,", and "Insert =".
I haven't read the article, so maybe this is covered later, but why isn't "Delete x" included?
Trying to repair bad syntax by editing the tokens stream is decades obsolete. It was an important technique when compiling had long turnaround times which included having to stand in line to submit a deck of punched cards to a clerk in a job submission window. Naturally, programmers wanted to catch as many errors as possible in a single round trip.
That said, it could be useful again, for an IDE to get the most amount of information (for completion and whatnot) even in the face of a bit of bad syntax.
The technique opens up possibilities for the compiler to get into a bit of a loop generating a large quantity of error messages, most of which result from its repair attempts. If any of the recovery strategies insert tokens, there is a potential getting into an infinite loop, if the code isn't careful.
In the 1970's, Creative Computing magazine held a contest for producing the longest ream of error messages from a single error in a program, or something like that. That would have been fueled by error recovery.
> A more grammar-specific variation of this idea is to skip input until a pre-determined synchronisation token (e.g. ‘;’ in Java) is reached [8 , p. 3], or to try inserting a single synchronisation token. Such strategies are often unsuccessful, leading to a cascade of spurious syntax errors (see Figure 1 for an example). Programmers quickly learn that only the location of the first error in a file – not the reported repair, nor the location of subsequent errors – can be relied upon to be accurate.
C.java:2: error: ’;’ expected
int x y;
^
C.java:2: error: <identifier> expected
int x y;
^
This is a common problem, but the solution here for the example shown is for the parser not to attempt to recover that statement and ignore everything until it reaches the "synchronization point" (I call them "landmarks"), which sounds like what their were arguing for at first. Doing that would make the AST be marked with a node along the lines of "some int named x with errors", and would completely skip everything between the `x` and the `;`. Because an error has been reported, there will be no output binary, but the second error will not be emitted.
Beyond the parser, the AST needs to keep around a lot of extra information for error recovery that would otherwise not be needed. An example from rustc[1] (what I'm familiar with) is structs and struct variants that have had some parse error. We recover the parse and keep a node for the variant in the AST signaling the existence of a parse error, which means that later passes of the compiler can type and borrow check, but if it finds a use that is either missing or adding fields we do not emit an error for it[2][3] under the assumption that the parse error caused the field being used to not be captured in the AST node for the variant. This can have quite dramatic impact on the quantity of errors being emitted.
I think we are in agreement that skipping over large chunks of input is rarely a good idea. In Section 7 ("Using error recovery in practice") we show how you can make fine-grained decisions about what to do when recovery has happened in our Rust parsing system. The more fine-grained you go, the more code you have to write, but it allows you to do exactly the sorts of things that rustc does with structs if you want. https://softdevteam.github.io/grmtools/master/book/errorreco... is a more approachable version of the same stuff.
This paper, like some others that I've seen, treats parsing as an academic/algorithmic topic. Given some token stream we'd like to minimize some error criteria. However, these papers ignore the fact that people don't write token streams: they write code that is formatted using whitespace. I don't know any programmer that writes their source code on a single line. On the contrary, pretty much every programmer formats their source code to some reasonable (if not complete) degree. In other words: parsing error recovery is not primarly an algorithmic problem: it's a usability problem.
Newlines and indentation are a source of extra information that we can use to infer what the programmer meant. Why throw all that away during tokenization? That's crazy! We can totally use it for error recovery/error message generation.
I decided to play around with this idea in the design my programming language Muon [1]. The language uses redundant significant whitespace for parser error recovery. This simple approach by itself has turned out to work surprisingly well! In my admittedly biased opinion, it frequently surpasses the error recovery quality in mature compilers such as the C# compiler, which has had tons of effort poured into error recovery.
Of course, a purely whitespace based recovery scheme is not perfect: there are rough edges, like having to deal with tabs vs spaces, and recovering inside a line is not usually possible. But the fact that such a simple approach has led to good results makes me think this would be a great area for future research, that can perhaps combine the best of both worlds here.
[1] https://github.com/nickmqb/muon