Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, that's accurate! I'm glad we finally know the running time for parsing with derivatives.

Let me add some more background, about stuff that I would expect to be covered in this paper's related work section, except that it doesn't have one...

Most well-known parsing algorithms that handle arbitrary CFGs run in cubic time, so this paper's O(n^3) running time is about the best you would expect. Some other parsing algorithms that also handle arbitrary CFGs in cubic time include CYK[1], Early[2], and GLR[3]. I used to be excited about parsing with derivatives because of its simplicity, but CYK and Early parsers are both actually simpler than parsing with derivatives, once you've thrown this paper's optimizations in.

[1] https://en.wikipedia.org/wiki/CYK_algorithm [2] https://en.wikipedia.org/wiki/Earley_parser [3] https://en.wikipedia.org/wiki/GLR_parser

[EDIT: Previously, I stated "CFGs, in general, can't be parsed in better than cubic time.", which is incorrect. Thanks, pacala.]




CFG parsing can be done better than O(n^3), reducing to Boolean matrix multiplication, which is about O(n^2.37).

http://theory.stanford.edu/~virgi/cs367/papers/valiantcfg.pd...

https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...

Agreed that CYK and Earley are rather simple to implement. Though Earley has an edge, as its best case is O(n), whereas CYK is O(n^2) best case.


> CFG parsing can be done better than O(n^3), reducing to Boolean matrix multiplication, which is about O(n^2.37).

Is there a known lower bound on running time to parse CFGs, like O(n^2)?


It is known that any fast CFG parsing algorithm can be turned into a fast boolean matrix multiplication algorithm.

So if CFG parsing was n^2, so is boolean matrix multiplication :)

See http://arxiv.org/abs/cs/0112018


One important note is that this holds for general grammars that can contain everything. If the grammar is restricted somehow, for example to lr(k) then the best complexity is linear (there are even parsers that manage both LR(n) in linear time and general parsing in cubic).


I'm confused by this argument. To show that

> if CFG parsing was n^2, so is boolean matrix multiplication

wouldn't you need to show that any boolean matrix multiplication can be turned into CFG parsing (and not the converse)?


Great question, I just learned something.

A boolean matrix multiplication problem can be turned into a CFG problem. Therefore, an algorithm that solves CFG problems can be turned into an algorithm that solves boolean matrix multiplication.


Sorry, yes, i deleted a line accidentally, but it turns out you can show both directions, and to prove a tight bound, you need to show they reduce to each other without changing the complexity class.


Out of curiosity, do any of the other algorithms have the ability to be "paused" arbitrarily, i.e. with inverted control? I'm thinking specifically of parsing input from an asynchronous interface.

Derivatives do that well, since input is essentially pushed into the parser.


Pausing is very straightforward with bottom up algorithms like GLR. The entire state of the parse is stored in an explicit stack data structure, as opposed to implicitly in the program's call stack as with top-down algorithms. So resuming the parse is as simple as resuming the parse loop using a previously-stored stack.


I think this is a bit of an orthogonal issue. Inversion of control is a problem that can be solved at the language level with coroutines / continuations / threads / etc without forcing you to choose a special type of algorithm.


Earley parsing can handle asynchronous input just fine.


> CYK and Early parsers are both actually simpler than parsing with derivatives, once you've thrown this paper's optimizations in.

Yeah, GLR is very elegant as well; I'm a little puzzled by this paper's initial claim (and similar claims in the Yacc is Dead paper):

> Current algorithms for context-free parsing inflict a trade-off between ease of understanding, ease of implementation, theoretical complexity, and practical performance. No algorithm achieves all of these properties simultaneously.

GLR is has good performance and isn't much harder to understand than LR, which is a very elegant theory. So I assume that 'ease of implementation' is the property the authors think GLR lacks. But ease is a little hard to measure, until you have an equally fast and production-ready implementation of both algorithms.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: