Hacker News new | past | comments | ask | show | jobs | submit login
The difference between top-down parsing and bottom-up parsing (qntm.org)
81 points by zniperr on Jan 4, 2015 | hide | past | favorite | 43 comments



I really didn't understand grammars until I started doing hand-rolled top-down recursive descent parsing for Gosu.

It's a shame that parsing in CS schools is taught using theory and lex/yacc type tools when a basic lexer and recursive descent parser can be written from first principles in a few months. It is more incremental, and it gives you a much deeper feel for the concepts, plus it lets you learn a bit about software engineering and organization as well.


Parsing and compilers are an area where I think "the theory gets in the way"; I started by reading the Dragon Book, and while I (vaguely) understood what it was talking about, the feeling it gave was "parsers are complex, use a parser generator instead"... which sounds like a good idea until you actually try to debug one. I eventually realised that recursive descent was much simpler, and with things like the "Let's Build a Compiler" articles everything became so clear and it felt like all that theory I learned before was mostly useless. I think the fact that top-down parsers are theoretically less powerful doesn't make much difference in practice; and the trend of compilers (in C at least) now seems to be to move away from generated parsers and to recursive descent.

Big, "production quality" compilers like GCC and LLVM use RD, and so do small ones like (o)tcc and various others' "toy compiler" projects (e.g. https://news.ycombinator.com/item?id=8558822 and https://news.ycombinator.com/item?id=8576068) The "precedence climbing" method (http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm ), which simplifies and makes RD more efficient, is common too but I don't think I've seen it appear in any of the usual academic texts.


Funny enough, I found the dragon book way too applied and not theoretical enough. Perhaps it was just a different feeling about the same problem: the dragon book is just not very good anymore.

Something like Parsec seems like a good introduction into writing your own parsers. You can figure out formal grammars later.


It is a shame that Gosu has not become more popular as a JVM language. You guys were doing things that took other languages years to catch up.


Yep, them's the breaks...

We are still working on it and it has a long life ahead of it, since it is crucial to many very large, un-sexy companies.


Indeed. Also Groovy++ (aka Groovy with static typing) is worth mentioning.


Or even Beanshell. In fact, the ideas behind Beanshell (i.e. closures on the JVM) and Groovy++ (i.e. static typing) were scooped up and copied into Groovy, that kitchen sink language, which was bundled as part of Grails, that knock-off of Rails for the JVM.

The software industry is divided into "Makers" and "Takers". The people behind Gosu are obviously Makers, whereas those presently behind Groovy are Takers.


> "...copied into Groovy, that kitchen sink language, which was bundled as part of Grails, that knock-off of Rails for the JVM. The software industry is divided into 'Makers' and 'Takers'. The people behind Gosu are obviously Makers, whereas those presently behind Groovy are Takers."

This is all wrong. Groovy predates Grails, as Grails was written in Groovy. Gosu and Groovy came out at roughly the same time, and IIRC, Groovy had closures and features like the Elvis operator before Gosu did.

You apparently don't like Groovy's current owners, but slamming Groovy for implementing static typing and qualifying them as takers for doing so makes no sense at all.


I've gotten the impression over the years that Grails is written in Java and only bundles Groovy for user scripting, but I haven't checked up recently. I certainly looked through the Gradle codebase when its version 2.0 came out and there was hardly a line of Groovy code in there, only Java. It, too, only uses Groovy for user scripting. If I'm right about the Grails codebase, static-typed Groovy isn't actually used to build any systems of note. Perhaps Cedric Champeau, who wrote the static-typed Groovy codebase, will use it to build the Groovy for Android he's presently working on. It'll be interesting to see whether he uses it or uses Java because it'll show whether he has enough faith in what he built to actually use it to build something with it. Until static-typed Groovy is used to build something, Groovy will remain a dynamically-typed scripting language for testing, Gradle builds, webpages, etc.


I'm a sophomore at Ohio State and context free grammars are actually a major topic in one of our early software courses. We also got to write a recursive decent parser for a very simple made up language as one of our projects. Really exciting curriculum over here.


(facetious-comment "It's a tragedy how much brilliance is wasted on grammar parsing when it's a solved problem; just use a Lisp")


This was advocated by https://cs.brown.edu/courses/cs173/2012/lang/ too. Parsing is fun but in the end I can't disagree, syntax is rarely that much worth it.


JSON or XML would also work. Except that few people like languages based on XML, and I haven't seen anyone seriously try JSON.

Perhaps someone should try to build a JSON-like language that's close to how most programmers like to write their code?


Yeah, this is why no one likes languages based on XML: http://thedailywtf.com/articles/We-Use-BobX

If you were going to do it, I'm certain you'd want to use a more human-writable format such as YAML or TOML instead of JSON or XML.

But doing so means you're living out Greenspun's Tenth Rule. Again, just use a Lisp. http://www.defmacro.org/ramblings/lisp.html


I know BobX is supossed to be a troll language but it bothers me to no end that the example code uses <xbobendif> instead of the XML spirited </xboxif>. I believes this confirms that I suffer from OCD.


Then again, this is just moving the burden from the language implementation to the developer.


Bottom up parsing - "If not, it's necessary to backtrack and try combining tokens in different ways" I feel the way it is put along with shift reduce parsing is misleading. Backtracking is essentially an aspect avoided (more like solved) by shift-reduce parsing. They don't go together in bottom up parsing. Shift reduce parsers posses the potential to predict the handle to use by looking at the contents on top of the stack.

Good job BTW, there are very few people who write about compiler/language theory :)


For unrestricted CFG parsing, shift-reduce parsing does require backtracking. A modern parsing algorithm like GLR uses LR tables but has to resort to backtracking when it encounters shift-reduce conflicts (it avoids the exponential blow-up of backtracking by using dynamic programming to share results for subproblems).


Like in my other comment, I agree with this. I missed mentioning mentioning my assumption of a forceful shift or reduce action in case of an ambiguous grammar.


Not a fan of handling it that way. Shift-reduce conflicts don't correspond to people's intuitive understanding of the grammar because good shift-reduce parser tables are highly non-local as they're constructed from fixed-point dataflow iteration over the whole grammar. It's not like top-down parsing where it's easy to understand the compromise of making a non-backtrackable decision based on k-token lookahead. If you force them to manually resolve shift-reduce conflicts in the grammar, it's likely they will be surprised by what happens, and not in a good way.


Yes, it definitely isn't the right way to handle ambiguous grammars. My assumption was caused by extensive usage of YACC for writing parsers in the past. To avoid it's default action, generally ambiguous grammars are avoided (since it is used only to write programming language parsers anyway)


You need backtracking if you have an ambiguous grammar. This comes up in natural language parsing; I'd guess that it is avoided in programming languages.


Actually, many languages which started with hand written parsers do have ambiguous grammars, or have an unambiguous grammar so hideously unwieldy it is best ignored.

This sort of thing can be fixed up with predicates added to the grammar but it always feels like a bodge.


I totally agree with this. I was assuming the case of a default behavior like shift or reduce force fully for unambiguous grammars, which I should have mentioned. Thanks.


One more difference: top-down parsing is European and bottom-up - American :)


I'm not sure why this was downvoted - it's true(ish), and interesting. Historically, American computer scientists preferred LR parsers, and Europeans preferred LL parsers. That influenced the languages they designed.

I think i read about this in Sedgewick's 'Algorithms in C', although i could be wrong. I struggle to find any online citation. This was mentioned in the Wikipedia article at one point, but disppeared in edit described as "Removed heresay":

http://en.wikipedia.org/w/index.php?title=LL_parser&directio...


That is quite intersting, at one point the main language in AI for Europe was Prolog and in America it was LISP, or so I read. I wonder what other instances this kind of cultural differences occured, and if the internet has had any effect on this.


> AI for Europe was Prolog and in America it was LISP,

I wouldn't say 'main language', but Prolog maybe a little more popular in Europe. The Japanese though based their 'fifth generation' project on logic programming. I once saw a Prolog Machine, a computer with the architecture optimized for Prolog and the main software written in Prolog:

http://museum.ipsj.or.jp/en/computer/other/0009.html

$119000 a piece...


I was so happy when I finally grokked LR parsers: it's just a big state machine! _if_ that token found, push on stack, go to _that_ other state. Consume token, check the next state transition.

But while fun it was pretty much useless because recursive descents and combinators are so much easier.


That moment when you finally get it that CFG parsers can be implemented using push-down automata... :-)



I think I read somewhere incremental parsers are better off being written with bottomup parsers rather than topdown parsers. The reason was that when a small edit is made to the code being parsed, the artifact from a bottomup parser often only needs a minor change that ripples only as far as it needs to, whereas the topdown parser needs to be completely rerun because it can't tell whether the effect of one small edit is large or localized. Anyone out there who can verify or refute this?


I've written parsers both as top down and bottom up. Nite that neither is naturally incremental and some memoization is required, so for bottom up, you see if the parent you want to create already exists, for top down, your child. It is sort of a wash which one is better, but both are pretty trivial.

I use top down recursive descent these days backed up with memoization. When a change to the token stream comes, you simply damage the most inner tree that contains the token, or if on a boundary, damage multiple trees.


Thanks. I've been writing and using recursive descent parsers a little lately, both Parsec-style and Packrat-style. If it's possible to do decent incremental parsing without wrapping my head around LR-parsing, then I'll give it a try sometime.


It is actually quite easy if you have a memoized token stream to work on. Just store your trees based on their first token. When you do a parse, check to see if a tree of that type already exists for that token. If it does, just reuse it, and if there is any context to consider, dirty the tree if the context of its creation has changed.

There are more advanced incremental parsing algorithms that do not require memoization, but they really aren't worth it (the ability to parse incrementally is not really a performance concern, but in keeping around state associated with the tree and in doing better error recovery).


Could you expand on this? What's a memoized token stream?


A token stream that preserves token identities. Say you have a token stream of [t0, t1] and add a new token t2 between t0 and t1 to get [t0, t2, t1]. What you want is to be able to identify t0 and t1 in the new stream as being the same tokens as in the old stream. If you simply re-lex to get new tokens, that won't happen, and if you use character offsets as your token identities, t1 in the new and old stream will have different identities.

Incremental lexing is pretty easy: just convert the tokens around the damage back into text, re-lex that text, and then work from front and back of the new partial stream to replace new tokens with old existing tokens if their lexical categories match (e.g. an old identifier's text can be updated with a new identifier's contents, but that is ok because parsing only depends on the fact that the token is an identifier). You might not win any performance awards, but those reused old tokens make very good keys into various data structures for incremental parsing, incremental type checking, and even incremental re-execution (if you are into live programming).


> convert the tokens around the damage back into text, re-lex that text, and then work from front and back of the new partial stream

I haven't tried it and I'm no parsing expert but it seems when a rule is specified "forwards" (i.e. beginning-to-end) then it can be used for lexing (or parsing) forwards but there'd be problems using it to lex (or parse) backwards. Hence the first "L" in "LL" and "LR" parsing.

Maybe an algorithm could use both directions running in parallel and create 2 structures/memoizations, one for the forward parse and one for the backward, hence a dual LL/RL or LR/RR parse, and access both structures to repair "damage". However, there could be another problem: if the lookahead for some lex or parse rule is 1, couldn't that translate to any lookbehind value when lexing or parsing backwards? An LL(1) grammar might require an equivalent RL(k) grammar for some very high k. Of course you'd also need to avoid right recursion as well as left recursion for topdown parsing.

I haven't tried any of this but these are the things that cross my mind whenever I think of incremental parsing!


Thanks. To figure out what's damaged, it seems like you have to do a diff somewhere? It sounds like this is done at the character level rather than the token level?


No diff, just do the re-lex and damage the tree on the boundaries of where the user typed. My techniques (also, see glitch) usually just assume change has happened since the reprocessing is incremental anyways and won't cascade if nothing has really changed (when reprocessing a tree, the parent is damaged if the end token changed, I guess you could call that a diff).


The subset LALR(1) is pretty simple. Table with states that can shift or reduce, and finally accept or error. It's not too hard to get the dragon book and write an LALR(1) parser for a calculator by hand. Plenty of lectures out there that explain it. If I can write a lexer/parser from scratch after hearing the explanation of LALR(1), anyone can.

LR(k) would be more complicated. Start with the smaller CFG subset and work up to it.


What do you call it if you're just using a big list of regexes? I've seen that used for a simple dialog scripting language in a game.


If the language is simple enough to be parsed with only regular expressions, then the language does not have context-free grammar, so bottom-up/top-down distinction does not apply. This also means that the language cannot have recursive productions, e.g. cannot support expressions.

But it depends on how the list of regexes is used. If it is used as part of a recursive paring routine, then it is a recursive-descent parser where lexing and parsing happens in the same pass.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: