I agree, but lexers and parsers are especially stateful pieces of code.
I'm not a static typing person (I like Python, bash, and C), but I have resolved to write all future parsers in OCaml or a similar language.
For other types of code, if you don't have a code path explosion, you can do OK in C. For example, numerical simulation code.
Unfortunately async code in C like Nginx tends to also produce this code path explosion. And kernel code is also littered with state machines.
Another interesting thing is that SQLite's parser is generated with Lemon. It's not even a hand-written parser in C (like Lua), which would be much worse. I guess all the crashes were in the semantic actions of the parser?
I've written parsers with ANTLR, lex/bison, and other tools, and I've hand written parsers.
With the generators, it's very hard to get several features of a high quality parser right. In particular error messages and error recovery, but also handling some more esoteric grammars (in particular ones that are not entirely unambiguous, LALR(1), etc). Using a parser generator also complicates your build chain, complicates refactorings, usually means you get bad source tooling, and often slows down the edit/test cycle quite a bit.
At the same time, if you're doing it right, writing a simple LL(1) parser by hand really isn't that hard, and the resulting code doesn't have to be much longer than the spec in your generator's language. Even in languages that are not that DSL-friendly (e.g. Java), you can end up with very compact, readable code. Plus you get all your usual tooling to do so (e.g. IDE support).
Error messages and error recovery are to some extent red herrings. You really do not want a production system to accept wrong inputs (and especially not try to 'recover' them into a usable parse tree). This will definitely make your parser ambiguous and can have very real security implications (say, microsofts anti-xss transformations introducing XSS).
As to LL1 - many real world protocols can't be parsed efficiently as LL1.
Error messages really are core feature of parsers. Would you want SQLite to just "return false" whenever there was some syntactical error somewhere in your query?
With error recovery in the context of a parser I mean the ability to continue parsing with predictable results after encountering an error. This is not about automagically correcting errors, it's about being able to report more than one error to the user. Returning just the first error encountered sucks, as does returning a slew of non errors caused by your parser getting messed up.
As to LL1 - many real world protocols can't be parsed efficiently as LL1.
Not sure what you mean with efficiently - a language either can or cannot be parsed as LL(1) because it's in that language class or not. But in any case, it's still very straightforward to make LL decisions with a longer lookahead in hand-written code, and the decision code is often more efficient than what a generator would create.
The way I write parsers nowadays is to write a pseudo-code description of the grammar, then treat that pseudo-code as data that must be transformed into a valid parser.
The end result is a fast, memory-safe parser in C or C++, and a whole lot of ugly python code :)
I used to entirely use hand-written recursive-descent parsers, until I discovered packrat style parsers. They are crazy fast and so easy to write. Also, the parser generator I use (esrap) lets you fall-back on recursive decent for parsing rules.
Do you have a link/citation for that? I always thought they were asymptotically linear due to the memoization, so comparable to LL, LALR etc, but in practice much slower due to the overhead compared to those more traditional approaches.
That's interesting -- do you have examples somewhere?
Are you saying you write an parser-specific parser generator in Python for every problem, with the goal of producing readable C or C++? (As opposed to the sometimes unreadable output of parser generators.)
I think it'd actually be better to use a hand-written recursive-descent parser. With RD the state is implicitly in the position of the code and the stack, so there's no need to explicitly manage it. AFAIK the SQL grammar isn't unusually difficult to parse.
I've used Flex/Bison before, and it makes it easy to get started, but with more complex parsers, threading the state around quickly gets irritatingly annoying and I could see how bugs could arise from it.
It's basically advocating the ML subset of OCaml, Haskell, and now Rust. Those languages all feature algebraic data types and pattern matching, which were pioneered in ML. Rust is basically a cross between ML and C++, with C-ish syntax.
(None of C, C++, Java, JavaScript, Python, etc. have those features, and thus they are all suboptimal for writing parsers.)
A good clue is the name -- ML stands for "meta-language". It's literally a language for describing languages.
ADTs and pattern matching are basically two sides to the same coin -- data vs code. ADTs are how you describe variants in the data. Specifically, "sum types" are essentially type-safe tagged unions (if you are a C person).
And pattern matching is how you describe variants in the code. The compiler checks that you have considered all cases.
This helps you tame code with nested conditionals. Nested conditionals give rise to the state space and code path explosion that I mentioned.
Put me in the "most boring interests ever" category, but I'd be interested in a discussion of the actuarial modeling DSL mentioned in link if said write up exists.
I imagine actuaries are about the most technical "nontechnical" users there are so a DSL seems a particularly good fit.
I'm not a static typing person (I like Python, bash, and C), but I have resolved to write all future parsers in OCaml or a similar language.
For other types of code, if you don't have a code path explosion, you can do OK in C. For example, numerical simulation code.
Unfortunately async code in C like Nginx tends to also produce this code path explosion. And kernel code is also littered with state machines.
Another interesting thing is that SQLite's parser is generated with Lemon. It's not even a hand-written parser in C (like Lua), which would be much worse. I guess all the crashes were in the semantic actions of the parser?