This was a really great read! I'm wrote the tree sitter grammar for the Unison programming language, and discovered I really like the work involved in pattern matching that writing tokenizers and parsers comes down to. It also gives you an in-depth understanding of how the language works that you've writing a parser for, and how tooling works.
Like if you have an AST with the ability to map onto code that is displayed in your IDE, the algorithm for an IDE to refactor a variable name is to traverse up the AST until you get to the variable's declaration and then traverse all sibling trees, changing each matching name, but stopping a traversal whenever you encounter a new binding with same name. Code folding is to identify the categories of node that are "foldable" and then you hide every child of that node. Etc. It's all tree traversal algorithms.
It gives you a deep appreciation for how powerful the tooling can be thanks to proper parsing.
> quadruple G=(N,Σ,P,S), where T is a finite set of nonterminal symbols, Σ a finite set of terminal symbols, P is a set of production rules and S is a start symbol.
I think "T" is supposed to be "N" in that sentence[1], based solely upon the further use of "N" nomenclature in subsequent paragraphs
1: he said, 5 years too late into a forum just discussing the article
I somewhat regularly stop to marvel that one of the greatest anarchist thinkers of our time is also responsible for foundational theories in linguistics that also correspond intimately with the foundational theories of computing. God bless Noam.
It's nice to review some of this theory after a week of coding my own interpreter. I have been studying about compilers at pikuma.com the whole week and reading this article after coding a parser is a great way of reviewing what I've implemented.
I have experience with compilers, creating language from scratch, handwritten parsers
but language theory is always difficult to understand in its theoretical form
I've started with practice, so I think in terms of strings and operations on it (indices, substrings, loops, looks ahead, etc) and then I read this theory I strugle hard to understand it and understand why I'd want to use it
So, going with code examples make things easier for me
I think this makes it sound a lot more difficult than it has to be, with the formal theory.
When it's really one of the most simple things if you divide it in parts and look at it from a tokenizer (string to list of tokens) and parser on top. Where the tokenizer can usually be very simple: a loop, large switch on the current character, where a choice is made on "what can this be", and making it into a formal token or error. Then a simple recursive parser that can almost be a 1 to 1 copy of the (E)BNF.
I had exactly the same feeling as you after reading the article. And interestingly, all production parsers for all major languages are hand-written recursive descent parsers.
On the other hand, if you inspect the actual code for these production parsers (even for newer languages like Swift, Scala, Kotlin, or Rust), the complexity and amount of code is still quite staggering.
Yes, that explains a lot of the complexity. Another reason is type checking/inferring and making the AST detailed enough for code analysis tools, such as code hinting in IDEs.
I believe the proper term for what i am describing is a recursive descent parser. With which it is also quite doable to generate proper error handling and even recovery. Some form of this is used in almost every production language I think.
It has been years since I've written a proper parser but before that every time I had to write one I tried the latest and greatest first. ANTLR, coco/r, combinators. All the generated ones seemed to have a fatal flaw that hand writing didnt have. For example good error handling seemed almost impossible, very slow due to Infinite look ahead or they were almost impossible to debug to find an error in the input schema.
In the end hand crafting seems to be faster and simpler. Ymmv.
My point about the article was mostly that all the formal theory is nice but all it does is scare away people, while parsing is probably the simplest thing about writing a compiler.
IMHO it gets even better when you can use regular expressions and write a 'modal' parser where each mode is responsible for a certain sub-grammar, like string literals. JavaScript added the sticky flag (y) to make this even simpler.
I couldn't locate the part where Pike addresses regexes in his 50-minute talk.
The second piece seems to be about someone complaining about a dysfunctional and untidy software situation where incompetence led to the incorrect application of greedy regexes, producing wrong results.
The third one is the most famous rant against attempts to parse a language with symmetric bracing (start tags that must match end tags) with a single regex from a language that doesn't provide regexes with symmetric bracing support, that is of course doomed to fail.
None of these provide any argument against lexing with sticky regexes. For one thing, the rant against regexes being unable to match bracing elements is only valid for regex engines that don't provide extensions for brace matching, but many languages and extensions do (e.g. https://stackoverflow.com/a/15303160/7568091).
However this point is typically irrelevant because this is not about parsing, it's about lexing, but I realize this my fault because in the above I wrote write a 'modal' parser where I should've written write a 'modal' lexer.
In lexing you typically do not match braces, you just realize you've found a brace and emit an appropriate token. It's up to the downstream processing to see whether barce tokens are matching.
Like if you have an AST with the ability to map onto code that is displayed in your IDE, the algorithm for an IDE to refactor a variable name is to traverse up the AST until you get to the variable's declaration and then traverse all sibling trees, changing each matching name, but stopping a traversal whenever you encounter a new binding with same name. Code folding is to identify the categories of node that are "foldable" and then you hide every child of that node. Etc. It's all tree traversal algorithms.
It gives you a deep appreciation for how powerful the tooling can be thanks to proper parsing.