Hacker News new | past | comments | ask | show | jobs | submit login
Undershoot: Parsing theory in 1965 (jeffreykegler.github.io)
102 points by dedalus on July 9, 2018 | hide | past | favorite | 30 comments



At the end, Kegler links to this comprehensive overview of the history of parsing:

https://jeffreykegler.github.io/personal/timeline_v3

which contains this easily overlooked but IMHO extremely significant statement:

"a recursive descent implementation can parse operator expressions as lists, and add associativity in post-processing"

Personally, it has always seemed like a no-brainer to me that this is clearly the Right Answer. It is a mystery to me that the computing world at large has spent so much effort on a problem whose solution is actually very straightforward if you just give in on one tiny little piece of theoretical purity.

See http://www.flownet.com/ron/lisp/parcil.lisp for my own implementation of such a parser. As you will see if you count LOCs, it's very, very simple by parser standards, and yet it handles all the "hard" problems: associativity, precedence, infix and prefix operators.


I have a toy problem that I throw at anyone who thinks they've "solved" parsing: Haskell list comprehensions. A list comprehension takes the form

    "[" EXPR "|" STMT["," STMT]* "]"
Where STMT is given by:

    STMT ::= EXPR | PAT "<-" EXPR
The problem is that, until you see that "<-", you don't know whether you're supposed to be parsing a pattern PAT or an expression EXPR. This is extra hard because patterns and expressions overlap significantly; for example, "(x, Left 2)" could be either a pattern or expression. But "(x, Left (2+3))" is definitely an expression. You can get arbitrarily deep into parsing a pattern before you realize it's actually an expression!

Recursive descent parsers, even with precedence climbing, can't really handle this nicely. Neither can LL or LR parsers. In fact, GHC's actual parser uses a hack: it parses patterns as expressions, and as a post-processing step checks that the expression it parsed actually was a valid pattern! This works IF your patterns are a subset of your expressions. But if there are patterns that aren't valid expressions (Racket's match-patterns have this property), then you need to get even cleverer.

General-purpose parsing algorithms like GLR, GLL, parsing with derivatives, and Earley parsing (which Kegler advocates) can handle this, of course, although I'm not sure how efficiently they do it.


I argue the parsing "problem" is not really a "problem" (for most people, anyway).

When you view something as a problem, the problem should have an input and a desired output. The name of a parsing "problem" assumes that the input is a grammar and the output is a (hopefully efficient) function from a sequence of symbols to a parse tree or a boolean flag certifying that the grammar accepts given string. Naturally, solving this "problem" requires a general algorithm that works well for most or all variety of grammars.

However this setting is far from the current practice: people just avoid a grammar that is hard to parse at all! Of course there are many practical grammars have ambiguities that require hacks, but there seems the maximum number of hacks permitted, and people ditch a grammar requiring too many hacks. Limiting l-value syntax to a subset of r-value syntax works because there is generally only one of them in the entire grammar. Well, even in the (LA)LR age people just resorted to the semi-automatic s-r conflict resolution without much thought. So the grammar is not an uncontrollable input; in many cases it is flexible enough to make the whole parsing "problem" irrelevant.

I still believe that having a solution to the general parsing problem is nice (there are still a lot of applications with no control available to input grammars, granted), but I think its importance is somewhat overvalued.


You can get arbitrarily deep into parsing a pattern before you realize it's actually an expression!

C++ has a similar ambiguity with function definitions and object initialisers.

Thankfully, it's an easily solved problem --- recursive descent with branching handles it just fine. In your example, when it reaches STMT, it will start parsing both EXPR and PAT in parallel, and terminate the EXPR branch when "<-" is encountered. In general, the idea is to split whenever the current position is ambiguous, and proceed until one of the branches can't (or if both branches remain valid, choose one or report an error.) Then what is left must be the correct parse.


This does work, and is available in e.g. the Parsec library for Haskell, but

(1) can lead to exponential worst-case backtracking behavior.

(2) can produce very unhelpful error messages without careful engineering.

In this case, for list comprehensions, I think you can show that you won't have exponential backtracking behavior. But it involves carefully thinking about whether patterns can nest inside of expressions nesting inside of patterns nesting... etc.

It's probably still the best solution if you're writing your own parser, though.


There is very little sport in inventing syntaxes that are hard to parse. My response, like the doctor responding to the patient who complains, "It hurts when I poke myself in the eye", is: don't do that.


On the one hand, you're right. Python takes exactly this approach: it disambiguates using "for" and "if":

    LISTCOMP ::= [ EXPR STMT* ]
    STMT ::= "if" EXPR | "for" PAT "in" EXPR
This is LL(1) and trivial to parse recursively.

On the other hand, Haskell list comprehensions really aren't hard to parse... for humans! Most people I've talked about this have no idea that Haskell comprehensions are formally so much harder to parse than Python list comprehensions. The difficult cases just don't arise in practice.

And there are plenty of places where we accept problems or algorithms which are ridiculously inefficient in theory, but fine in practice. Damas-Milner type inference, for example, takes time O(2^2^PROGSIZE) in the worst case - doubly exponential! Far worse (in theory) than the O(n^3) of general CFG parsing algorithms (which can be quite simple to implement!).

And this problem doesn't just come up in Haskell list comprehensions, but also with monadic do-notation. I'd much rather write:

    do (x,y) <- foo
       z <- bar x
       return (x + y + z)
than:

    do for (x,y) <- foo
       for z <- bar x
       return (x + y + z)
The extra "for" (or whatever keyword you want to put there) is just syntactic noise to a human, however much easier it makes it for a computer to parse it.

So I'm of two minds, really.


I think humans have an easy time because it's easy for them to look ahead (literally!) to see if there's a <- token. So if I were going to tackle this problem I would do it in two passes. In the first pass I'd just tokenize everything, and set a flag if I saw "<-". Then I'd use the value of that flag to inform the second pass about what it was parsing.


Not at all experienced in Lisp, but I believe this is the core algorithm in that parser...

      (cond
       ((and (binop? $next) (> new-priority priority))
         (scan)
        (loop (list op result (parse-expression new-priority))))
       (t result)))))
...and is what amounts to the widely-used "precedence climbing" algorithm described here:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing

I like to think of it more as a clever refactoring of recursive descent that condenses all the very similar recursive functions at each level into one which is parameterised by level, and also handles associativity by choosing whether to recurse (right-associative) or loop (left-associative) with the RHS after a binary operator.


Thanks for that pointer. I had no idea that what I was doing was in any way noteworthy. It just seemed kinda obvious.


The problem with that is that math expressions are just another grammar from the point of view of the parser generator, so if that is handled incorrectly then the parser generator is broken. You see this problem in recursive descent parser generators that try to handle left recursion. In other words, incorrect handling of precedence is just one symptom of the bug. If you hand write the parser then you can work around that on a case by case basis, but a parser generator should be correct on all grammars.


Hey, do you know of any papers or more writing about this? (Or even keywords that would aid a search). Reading code is one thing, but being able to read code alongside a paper or web article increases the types of information and makes it much more digestible for me.


No paper, sorry. I kinda wrote it by the seat of my pants. The main point is that I could write it by the seat of my pants, and it works, and the code is pretty short and IMHO readable.


The thing is that writing a parser requires that a person to understand what a formal language is. Overall, only a subset of programmers understand even this, so parsing has a certain inherent hardness to it (you can't just use a library or just use an object).

Of course, the problem of how to create a parser is solvable any number of ways if you mean how to convert an unambiguously specified formal language into a parser. But that doesn't mean basic challenges don't remain. Especially because a formal language is hard to understand (and can be ambiguous) and because what one wants the language to actually do something, there is a further trickiness involved (you have to bridge interface between syntax and semantics). So which way to solve the problem of parsing become a complex decision. But it's not so much "we don't know how to efficiently do this yet" but rather "there is no one-size fits all approach."


You’d be surprised how many programmers don’t understand (or at least think about) a formal language and manage to write a parser. Heck, it explains why a lot of languages have bizarre hacky syntax.

The problem of writing a high performance parser and the problem of writing just a parser at all are fairly isolated. I’ve written many parsers during my career but don’t consider myself a parsing person by any means (though the number of people who have written incremental parsers for IDEs is probably countable on one or two hands, most people don’t think about that as a parser problem).


Superb website with loads of content.

This https://jeffreykegler.github.io/personal/timeline_v3 is also worth your time twofolds.


lol this dude hates PEG parsers


If parsing is "complicated", then there's another solution. Don't play the game. Change the rules. Play a different game.

My understanding (and, since this is the Interwebs I will quickly be corrected if I'm wrong) is that Python is easy to parse; a lot of the battles about adding features to the language involve keeping the grammar simple.

And yet Python is eminently useful, despite being simple to parse.

I'm reminded of how we didn't understand how to specify a simple grammar in the "good old days". E.g. take ancient FORTRAN.

The for-loop in FORTRAN is actually called do. And you specify the end of the loop by numerical statement label (found in columns 1 thru 5). Thus:

      DO 10 I = 1, 7
      some stuff here, loop done for I = 1,2,3,4,5,6,7
   10 final line of loop
But spaces aren't significant. So if you write the following statement

      DO 10 I = (1, 7)
You get something totally different. You set the value of the complex variable DO10I to (1,7). Bheech. Who wants to parse that? (And yet, there were very capable FORTRAN compilers back in the 1960s!)


> Python is easy to parse

Lisp is even easier.


And Forth easier still.


Yes...and far more accurately no, you can't actually write a parser for Forth with a fixed grammar, you can only write a complete interpreter for it as it is capable of modifying its own grammar on the fly. It is possible to define new words which when executed take over the input stream and do arbitrary things.


Those tricks will not necessarily compile right though. Forth is a compiled language, if implemented completely.


Those tricks certainly are necessary to compile Forth, it's common to define words that create new words for custom data structure, which extend the grammar of Forth. Any time you use 'create ... does>' you are in effect extending the grammar in an ad-hoc way.


Isn't forth actually a regular language, and parsable on a finite state machine?


On the contrary, it requires a Turing machine to parse fully.


Which is typically implemented in Forth. (The same is true of Common Lisp. But that in and of itself does not make it hard to parse.)


Yes it does, in fact it is pretty much the canonical definition of something being hard to parse. Given an arbitrary Forth program you can't even tell if the parser will terminate. Granted, the non-fixed grammar starts off simple, but it can be made to be arbitrarily complex.


IMHO "hard to parse" means that writing a parser that works requires a lot of effort. In the case of both Forth and Lisp, that is not the case. Writing a working parser for either language is an elementary exercise, notwithstanding that the syntax can be arbitrarily extended by the user.


I’m a little surprised that ANTLR and L* don’t make the list (ANTLR from the practitioner POV and L* from theory).


I'm sorry, but this is just rambling. He goes on about theorists and practitioners without actually saying anything about the problem at all. He doesn't explain why he thinks the current state of the art isn't the solution. What does he want to do that isn't handled well?

There are many ways in which parsing can be improved in practice and theory. Actual technical aspects would be more interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: