This is a good essay. However, the author didn't mention Pratt Parsing [1], which cleanly addresses [2] the shortcomings of recursive descent parsing. Parser generators look promising until you actually start building anything of moderate complexity. Then it's a constant cycle of kludges and hair pulling.
[2] Mostly. The Pratt model does introduce potential context around operators (depending on the grammar). It's easy enough to extend the basic Pratt model to support this, but it isn't spelled out in many examples or literature.
> Parser generators look promising until you actually start building anything of moderate complexity
I've done both, by hand and with parser generators (flex/bison and antlr) and getting the machine to do the boring work is total fuckload[0] faster and more productive.
Edit: and unless you know what you're doing, you will screw up when hand-writing a parser. I know of a commercial reporting tool that couldn't reliably parse good input (their embedded language).
What do you think is special about recursive descent parsing that makes you more likely to screw up unless you know what you're doing?
My experience has been the exact opposite - particularly as the language gets complicated and/or weird. In which case the generated parser becomes horribly brittle. Adding an innocent looking new rule to your working Antlr or Yacc grammar feels like fiddling with a ticking bomb - it might work straight away or it could explode in your face - leaving you with hours or days whack-a-moling grammar ambiguities.
I didn't say recursive descent parsing wrt screwing up, I just said "hand-writing a parser". Nothing about the flavour.
I guess our experiences differ but I don't know why. I have written a seriously major SQL parser in antlr and had no problem. And it was huge and complex, well that's TSQL for you.
It may be you have been parsing non-LR(1) grammars in bison which could prove a headache but... well IDK. Maybe I've been lucky.
That's an interesting coincidence - the biggest parser I wrote was also an SQL parser using Antlr. In fact, SQL was only part of it - it was a programming language that supported multiple flavours of embedded SQL (DB2 and Oracle). It worked but I always dreaded when it would have to be changed to support a new feature in a new release of Oracle (or DB/2).
I don't think it's an LR vs LL thing either. I feel that there is no sense of locality with parser generators; it's a bit like quantum mechanics - every rule has a "connection" to every other one. Change one seemingly small part of the Antlr grammar and some "far away" seemingly unrelated parts of the grammar can suddenly blow up as being ambiguous.
Coincidence indeed - I'm currently modifying my giant SQL grammar right now and building an AST off it. And struggling a bit with that, but that's mainly down to me not antlr.
It is strange that we're having such different experiences of it. I don't recognise your quantum view of it either, as a antlr rules, and bison, are very context-specific as they can only be triggered in the context of larger rules, and only when given piece of the target language (SQL here). They get triggered only in those very specific cases. I've never had your struggles with it. I don't understand.
I completely agree, yacc/bison are my goto tools - the big different i]s that you are building trees from the bottom up rather top down - if you're building an AST it probably doesn't matter, however a bunch of things (constant folding, type propagation) tend to go in the same direction so sometimes you can combine stuff.
Exercise for the reader: Write a simple arithmetic expression language parser, with numbers, variables, parentheses and working precedence levels. Extend it with "EXP0 EXP1 ... EXPn" syntax for function calls. Now extend it with "VAR1 ... VARn '.' EXP" syntax for lambdas. Yes, with no "fun", or "\", or "λ" to introduce it—you realise you have a function definition only when you arrive at the dot. The function definition spans as long as it makes sense.
Pretty fun, although requires some small hackery (I solved it by testing, at seeing the dot, that the "apply" list I've build to this point contains only variable names, and then re-using it as the "arglist").
I don't know why you'd want to write lambda like that.
Long ago, I made a prototype mixfix syntax for Scheme intended for a shell-like REPL with ISWIM overtones. The paren-less function call syntax (terminated by newline or semicolon, like shell) was done by modifying the Pratt engine to treat juxtaposition as application. Sexp syntax was also valid as parenthesized expressions.
Well, when I played with a minimal lambda-calculus evaluator, I quickly got tired of having to print "\" or whatever to denote the start of a lambda. But look, for example, at this:
(f. (x. f (v. x x v)) (x. f (v. x x v)))
fct. n.
(church0? n)
(_. church1)
_. church* n (fct (church- n 1))
Are "\" really needed there? You have to put lots parens to denote where lambdas end anyway, and they also happen to show where they start too, and a dot is a clear enough signal for a human (IMHO) to see that it's a lambda. So that's how I've came up with the idea of writing lambdas like that. Then I extended my lambda-calculus evaluator with some arithmetic primitives, and yes, the syntax got somewhat peculiar.
I haven't made the Show HN post yet, but using the parser combinator library that I've been building[1], here's an answer to your exercise:
module Joker_vDChallengeParser
include Parsby::Combinators
extend self
def parse(s)
(expr < eof).parse(s)
end
define_combinator(:exp_op) {|left, right| group(left, spaced(lit("^")), right) }
define_combinator(:pos_op) {|subexpr| group(lit("+") < ws, subexpr) }
define_combinator(:neg_op) {|subexpr| group(lit("-") < ws, subexpr) }
define_combinator(:div_op) {|left, right| group(left, spaced(lit("/")), right) }
define_combinator(:mul_op) {|left, right| group(left, spaced(lit("*")), right) }
define_combinator(:add_op) {|left, right| group(left, spaced(lit("+")), right) }
define_combinator(:sub_op) {|left, right| group(left, spaced(lit("-")), right) }
define_combinator(:call_op) {|left, right| group(left, ws_1, right) }
define_combinator :identifier do
first_char = char_in([*('a'..'z'), *('A'..'Z'), '_'].join)
rest_char = first_char | char_in([*('0'..'9')].join)
first_char + join(many(rest_char))
end
define_combinator :lambda_expr do
group(
sep_by_1(ws_1, identifier) < spaced(lit(".")),
expr,
)
end
def scope(x, &b)
b.call x
end
define_combinator :expr do
lazy do
e = choice(
decimal,
lambda_expr,
identifier,
between(lit("("), lit(")"), expr),
)
# Each e = scope ... block is a precedence level. You can switch
# them around to play with the precedence of the operators.
#
# hpe -- higher precedence level expression
# spe -- same precedence level expression
# Basing myself on Haskell and making function calls the highest
# precedence operators.
e = scope e do |hpe|
reduce hpe do |left_result|
choice(
call_op(pure(left_result), hpe),
)
end
end
# Our only right-associative precedence level.
e = scope e do |hpe|
recursive do |spe|
choice(
exp_op(hpe, spe),
hpe,
)
end
end
e = scope e do |hpe|
recursive do |spe|
choice(
neg_op(spe),
pos_op(spe),
hpe,
)
end
end
# Left-associativity done by parsing left operand bottom-up and
# right operands top-down.
e = scope e do |hpe|
reduce hpe do |left_result|
choice(
mul_op(pure(left_result), hpe),
div_op(pure(left_result), hpe),
)
end
end
e = scope e do |hpe|
reduce hpe do |left_result|
choice(
add_op(pure(left_result), hpe),
sub_op(pure(left_result), hpe),
)
end
end
end
end
end
That was a nice exercise. Here's some example parses:
You can try it by cloning my repo, running `./bin/console`, and pasting the module, or putting it in a file in `lib/`. If you put it in a file, you can reload changes with `reload!` in the console.
[1] https://en.wikipedia.org/wiki/Pratt_parser
[2] Mostly. The Pratt model does introduce potential context around operators (depending on the grammar). It's easy enough to extend the basic Pratt model to support this, but it isn't spelled out in many examples or literature.