Hacker News new | past | comments | ask | show | jobs | submit login

I agree, but lexers and parsers are especially stateful pieces of code.

I'm not a static typing person (I like Python, bash, and C), but I have resolved to write all future parsers in OCaml or a similar language.

For other types of code, if you don't have a code path explosion, you can do OK in C. For example, numerical simulation code.

Unfortunately async code in C like Nginx tends to also produce this code path explosion. And kernel code is also littered with state machines.

Another interesting thing is that SQLite's parser is generated with Lemon. It's not even a hand-written parser in C (like Lua), which would be much worse. I guess all the crashes were in the semantic actions of the parser?




I think you shouldn't write parsers by hand at all and instead generate them automatically -- without semantic actions. I wrote a parser generator for tricky binary formats a few months ago - https://www.usenix.org/system/files/conference/osdi14/osdi14... and there's a paper explaining it here https://github.com/jbangert/nail


I've written parsers with ANTLR, lex/bison, and other tools, and I've hand written parsers.

With the generators, it's very hard to get several features of a high quality parser right. In particular error messages and error recovery, but also handling some more esoteric grammars (in particular ones that are not entirely unambiguous, LALR(1), etc). Using a parser generator also complicates your build chain, complicates refactorings, usually means you get bad source tooling, and often slows down the edit/test cycle quite a bit.

At the same time, if you're doing it right, writing a simple LL(1) parser by hand really isn't that hard, and the resulting code doesn't have to be much longer than the spec in your generator's language. Even in languages that are not that DSL-friendly (e.g. Java), you can end up with very compact, readable code. Plus you get all your usual tooling to do so (e.g. IDE support).


Error messages and error recovery are to some extent red herrings. You really do not want a production system to accept wrong inputs (and especially not try to 'recover' them into a usable parse tree). This will definitely make your parser ambiguous and can have very real security implications (say, microsofts anti-xss transformations introducing XSS). As to LL1 - many real world protocols can't be parsed efficiently as LL1.


Error messages really are core feature of parsers. Would you want SQLite to just "return false" whenever there was some syntactical error somewhere in your query?

With error recovery in the context of a parser I mean the ability to continue parsing with predictable results after encountering an error. This is not about automagically correcting errors, it's about being able to report more than one error to the user. Returning just the first error encountered sucks, as does returning a slew of non errors caused by your parser getting messed up.

    As to LL1 - many real world protocols can't be parsed efficiently as LL1.
Not sure what you mean with efficiently - a language either can or cannot be parsed as LL(1) because it's in that language class or not. But in any case, it's still very straightforward to make LL decisions with a longer lookahead in hand-written code, and the decision code is often more efficient than what a generator would create.


Cool, I printed out your paper a few months ago and will give it another look!


The way I write parsers nowadays is to write a pseudo-code description of the grammar, then treat that pseudo-code as data that must be transformed into a valid parser.

The end result is a fast, memory-safe parser in C or C++, and a whole lot of ugly python code :)


I used to entirely use hand-written recursive-descent parsers, until I discovered packrat style parsers. They are crazy fast and so easy to write. Also, the parser generator I use (esrap) lets you fall-back on recursive decent for parsing rules.


    They are crazy fast
Do you have a link/citation for that? I always thought they were asymptotically linear due to the memoization, so comparable to LL, LALR etc, but in practice much slower due to the overhead compared to those more traditional approaches.


For me, I expect Esrap to end up within a factor of 2-4 of a hand-tuned recursive decent parser on sbcl.


That's interesting -- do you have examples somewhere?

Are you saying you write an parser-specific parser generator in Python for every problem, with the goal of producing readable C or C++? (As opposed to the sometimes unreadable output of parser generators.)


I think it'd actually be better to use a hand-written recursive-descent parser. With RD the state is implicitly in the position of the code and the stack, so there's no need to explicitly manage it. AFAIK the SQL grammar isn't unusually difficult to parse.

I've used Flex/Bison before, and it makes it easy to get started, but with more complex parsers, threading the state around quickly gets irritatingly annoying and I could see how bugs could arise from it.


Can you please elaborate why OCaml is a good choice for writing parsers?


I usually post the VandenBerghe link too.

It's basically advocating the ML subset of OCaml, Haskell, and now Rust. Those languages all feature algebraic data types and pattern matching, which were pioneered in ML. Rust is basically a cross between ML and C++, with C-ish syntax.

http://en.wikipedia.org/wiki/ML_(programming_language)

(None of C, C++, Java, JavaScript, Python, etc. have those features, and thus they are all suboptimal for writing parsers.)

A good clue is the name -- ML stands for "meta-language". It's literally a language for describing languages.

ADTs and pattern matching are basically two sides to the same coin -- data vs code. ADTs are how you describe variants in the data. Specifically, "sum types" are essentially type-safe tagged unions (if you are a C person). And pattern matching is how you describe variants in the code. The compiler checks that you have considered all cases.

This helps you tame code with nested conditionals. Nested conditionals give rise to the state space and code path explosion that I mentioned.

Here is more recent explanation:

https://queue.acm.org/detail.cfm?id=2038036

Take a look at the OCaml vs Java code snippets.



Put me in the "most boring interests ever" category, but I'd be interested in a discussion of the actuarial modeling DSL mentioned in link if said write up exists.

I imagine actuaries are about the most technical "nontechnical" users there are so a DSL seems a particularly good fit.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: