Hacker News new | past | comments | ask | show | jobs | submit login

This is a good essay. However, the author didn't mention Pratt Parsing [1], which cleanly addresses [2] the shortcomings of recursive descent parsing. Parser generators look promising until you actually start building anything of moderate complexity. Then it's a constant cycle of kludges and hair pulling.

[1] https://en.wikipedia.org/wiki/Pratt_parser

[2] Mostly. The Pratt model does introduce potential context around operators (depending on the grammar). It's easy enough to extend the basic Pratt model to support this, but it isn't spelled out in many examples or literature.




> Parser generators look promising until you actually start building anything of moderate complexity

I've done both, by hand and with parser generators (flex/bison and antlr) and getting the machine to do the boring work is total fuckload[0] faster and more productive.

Edit: and unless you know what you're doing, you will screw up when hand-writing a parser. I know of a commercial reporting tool that couldn't reliably parse good input (their embedded language).

[0] 3.14159 shedloads in metric


What do you think is special about recursive descent parsing that makes you more likely to screw up unless you know what you're doing?

My experience has been the exact opposite - particularly as the language gets complicated and/or weird. In which case the generated parser becomes horribly brittle. Adding an innocent looking new rule to your working Antlr or Yacc grammar feels like fiddling with a ticking bomb - it might work straight away or it could explode in your face - leaving you with hours or days whack-a-moling grammar ambiguities.


I didn't say recursive descent parsing wrt screwing up, I just said "hand-writing a parser". Nothing about the flavour.

I guess our experiences differ but I don't know why. I have written a seriously major SQL parser in antlr and had no problem. And it was huge and complex, well that's TSQL for you.

It may be you have been parsing non-LR(1) grammars in bison which could prove a headache but... well IDK. Maybe I've been lucky.


That's an interesting coincidence - the biggest parser I wrote was also an SQL parser using Antlr. In fact, SQL was only part of it - it was a programming language that supported multiple flavours of embedded SQL (DB2 and Oracle). It worked but I always dreaded when it would have to be changed to support a new feature in a new release of Oracle (or DB/2).

I don't think it's an LR vs LL thing either. I feel that there is no sense of locality with parser generators; it's a bit like quantum mechanics - every rule has a "connection" to every other one. Change one seemingly small part of the Antlr grammar and some "far away" seemingly unrelated parts of the grammar can suddenly blow up as being ambiguous.


Coincidence indeed - I'm currently modifying my giant SQL grammar right now and building an AST off it. And struggling a bit with that, but that's mainly down to me not antlr.

It is strange that we're having such different experiences of it. I don't recognise your quantum view of it either, as a antlr rules, and bison, are very context-specific as they can only be triggered in the context of larger rules, and only when given piece of the target language (SQL here). They get triggered only in those very specific cases. I've never had your struggles with it. I don't understand.


I completely agree, yacc/bison are my goto tools - the big different i]s that you are building trees from the bottom up rather top down - if you're building an AST it probably doesn't matter, however a bunch of things (constant folding, type propagation) tend to go in the same direction so sometimes you can combine stuff.


Exercise for the reader: Write a simple arithmetic expression language parser, with numbers, variables, parentheses and working precedence levels. Extend it with "EXP0 EXP1 ... EXPn" syntax for function calls. Now extend it with "VAR1 ... VARn '.' EXP" syntax for lambdas. Yes, with no "fun", or "\", or "λ" to introduce it—you realise you have a function definition only when you arrive at the dot. The function definition spans as long as it makes sense.

Pretty fun, although requires some small hackery (I solved it by testing, at seeing the dot, that the "apply" list I've build to this point contains only variable names, and then re-using it as the "arglist").


I don't know why you'd want to write lambda like that.

Long ago, I made a prototype mixfix syntax for Scheme intended for a shell-like REPL with ISWIM overtones. The paren-less function call syntax (terminated by newline or semicolon, like shell) was done by modifying the Pratt engine to treat juxtaposition as application. Sexp syntax was also valid as parenthesized expressions.


Well, when I played with a minimal lambda-calculus evaluator, I quickly got tired of having to print "\" or whatever to denote the start of a lambda. But look, for example, at this:

    (f. (x. f (v. x x v)) (x. f (v. x x v)))
        fct. n.
            (church0? n)
            (_. church1)
             _. church* n (fct (church- n 1))
Are "\" really needed there? You have to put lots parens to denote where lambdas end anyway, and they also happen to show where they start too, and a dot is a clear enough signal for a human (IMHO) to see that it's a lambda. So that's how I've came up with the idea of writing lambdas like that. Then I extended my lambda-calculus evaluator with some arithmetic primitives, and yes, the syntax got somewhat peculiar.


I haven't made the Show HN post yet, but using the parser combinator library that I've been building[1], here's an answer to your exercise:

  module Joker_vDChallengeParser
    include Parsby::Combinators
    extend self

    def parse(s)
      (expr < eof).parse(s)
    end

    define_combinator(:exp_op) {|left, right| group(left, spaced(lit("^")), right) }

    define_combinator(:pos_op) {|subexpr| group(lit("+") < ws, subexpr) }
    define_combinator(:neg_op) {|subexpr| group(lit("-") < ws, subexpr) }

    define_combinator(:div_op) {|left, right| group(left, spaced(lit("/")), right) }
    define_combinator(:mul_op) {|left, right| group(left, spaced(lit("*")), right) }

    define_combinator(:add_op) {|left, right| group(left, spaced(lit("+")), right) }
    define_combinator(:sub_op) {|left, right| group(left, spaced(lit("-")), right) }

    define_combinator(:call_op) {|left, right| group(left, ws_1, right) }

    define_combinator :identifier do
      first_char = char_in([*('a'..'z'), *('A'..'Z'), '_'].join)
      rest_char = first_char | char_in([*('0'..'9')].join)
      first_char + join(many(rest_char))
    end

    define_combinator :lambda_expr do
      group(
        sep_by_1(ws_1, identifier) < spaced(lit(".")),
        expr,
      )
    end

    def scope(x, &b)
      b.call x
    end

    define_combinator :expr do
      lazy do
        e = choice(
          decimal,
          lambda_expr,
          identifier,
          between(lit("("), lit(")"), expr),
        )

        # Each e = scope ... block is a precedence level. You can switch
        # them around to play with the precedence of the operators.
        #
        # hpe -- higher precedence level expression
        # spe -- same precedence level expression

        # Basing myself on Haskell and making function calls the highest
        # precedence operators.
        e = scope e do |hpe|
          reduce hpe do |left_result|
            choice(
              call_op(pure(left_result), hpe),
            )
          end
        end

        # Our only right-associative precedence level.
        e = scope e do |hpe|
          recursive do |spe|
            choice(
              exp_op(hpe, spe),
              hpe,
            )
          end
        end

        e = scope e do |hpe|
          recursive do |spe|
            choice(
              neg_op(spe),
              pos_op(spe),
              hpe,
            )
          end
        end

        # Left-associativity done by parsing left operand bottom-up and
        # right operands top-down.
        e = scope e do |hpe|
          reduce hpe do |left_result|
            choice(
              mul_op(pure(left_result), hpe),
              div_op(pure(left_result), hpe),
            )
          end
        end

        e = scope e do |hpe|
          reduce hpe do |left_result|
            choice(
              add_op(pure(left_result), hpe),
              sub_op(pure(left_result), hpe),
            )
          end
        end
      end
    end
  end

That was a nice exercise. Here's some example parses:

  pry(main)> Joker_vDChallengeParser.parse "- 3 - foo bar . foo - bar - 5"                                     
  => [["-", 3], "-", [["foo", "bar"], [["foo", "-", "bar"], "-", 5]]]
  pry(main)> Joker_vDChallengeParser.parse "- 3 - (foo bar . foo - bar - 5) - 4 * -2 + 5 / 2"                 
  => [[[["-", 3], "-", [["foo", "bar"], [["foo", "-", "bar"], "-", 5]]], "-", [4, "*", ["-", 2]]], "+", [5, "/", 2]]
  pry(main)> Joker_vDChallengeParser.parse "- 3 - (foo bar . foo 5 6 - bar - 5) - 4 * -2 + 5 / 2"             
  => [[[["-", 3], "-", [["foo", "bar"], [[[["foo", " ", 5], " ", 6], "-", "bar"], "-", 5]]], "-", [4, "*", ["-", 2]]], "+", [5, "/", 2]]

  pry(main)> Joker_vDChallengeParser.parse "foo bar . foo bar baz qux . baz qux"                              
  => [["foo", "bar"], [["foo", "bar", "baz", "qux"], ["baz", " ", "qux"]]]
  pry(main)> Joker_vDChallengeParser.parse "foo bar . foo bar + baz qux . baz qux"                            
  => [["foo", "bar"], [["foo", " ", "bar"], "+", [["baz", "qux"], ["baz", " ", "qux"]]]]

You can try it by cloning my repo, running `./bin/console`, and pasting the module, or putting it in a file in `lib/`. If you put it in a file, you can reload changes with `reload!` in the console.

[1] https://github.com/jolmg/parsby

EDIT: Fixed the syntax for newer versions of Ruby.


Looking back at it, I forgot to allow spacing between parentheses and inner expression

  between(lit("("), lit(")"), spaced(expr)),
and using space as an operator prevents `(foo)(bar)` from being parsed as a call.

This is better, just adjacent expressions with optional in-between whitespace:

  define_combinator(:call_op) {|left, right| group(left < ws, right) }
Too late to edit, though. Oh well.


Also no mention of combinator parsing.

https://en.wikipedia.org/wiki/Parser_combinator


They're mentioned in footnote 1 as a flavor of recursive descent parsing.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: