Hacker News new | past | comments | ask | show | jobs | submit login
LL and LR in Context: Why Parsing Tools Are Hard (2013) (reverberate.org)
222 points by ingve on Feb 23, 2016 | hide | past | favorite | 95 comments



> While rolling your own parser will free you from ever getting error messages from your parser generator, it will also keep you from learning about ambiguities you may be inadvertently designing into your language.

I tried several time to explain why I think people should use parser generators and the conclusion of this article express my point of view perfectly. The fact that we have a very good parser generator in OCaml (and Coq!)[1] really helps here.

I still think that ambiguities of the third kind, "Type/variable ambiguity", are a design issue and your grammar should be changed, because it's going to be completely ambiguous for humans too.

[1]: http://gallium.inria.fr/~fpottier/menhir/


A huge reason anything accepting network (or other potentially hostile input) should use parser generators and validate the entire input before using any of it is security. Additionally, all of those grammars should, whenever possible, be no more complex than deterministic context-free.

As Meredith and Sergey explain in their talk[1] at 28c3, Turing complete input and parsers that don't validate the input are creating a "weird machine" just waiting to be programmed in malicious and unexpected ways.

[1] https://media.ccc.de/v/28c3-4763-en-the_science_of_insecurit...


Couple questions coming--sharing some experience and hoping to learn something.

Do you think the use of parser generators is practical also for performance-critical binary data? For example video streams? (I've never parsed one but I imagine they could be so optimized that it could be impractical to do it with a generic parser generator).

What formats are examples of complex structure where parser generators are practical?

For my personal needs, I've been getting along very well with plain text relational data. Like

    numPersons 2
    numAncestors 1
    person john "John Doe"
    person jane "Jane Dane"
    ancestor john jane
That's so trivial that I would never want to depend on a parser generator. Instead I handroll a parser for this format in 10 minutes and 20 lines of C if I don't mind bad error messages. And what parser generator would make it easy to check (relational) integrity of above data in a secure way?

If I leave away the num* fields above, the parser would need one token of lookahead. What type of grammars are these two versions?


Data, if you have to format it in text, is best done using a regular grammar, which you can read with regular expressions into objects. You would then do the validations on the objects rather than on the string-formatted data.


Good point. But, are there practical tools for this? I still have to decide how I store the data in memory. For example,

    struct Model {
        int numPersons;
        int numAncestors;
        struct Person *person;
        struct Ancestor *ancestor;
    };
Or SOA instead of AOS?

    struct Model {
        int numPersons;
        int numAncestors;
        struct string *personname;
        struct string *persondesc;
        struct string *ancestorancestor;
        struct string *ancestordescendant;
    };
etc. Do you know tools which are in practice a better choice than the super boring ad-hoc code I described? Unless I'd have to define many of these types of formats the cost of going meta is just too high.


I use Ruby, so I can't help you re: tooling. But if you are parsing a regular grammar with regular expressions, how much tooling do you need? The regexp does all the work. All you have to do is ensure that your data format is sane, that your delimiters can never appear in the data.

With stuff like your problem, I can never be sure the data is clean, so I always default to a format like CSV or JSON and use standard library tools. I would shudder to have to work with that kind of data in C / C++ for the very reason you're asking these questions in this thread. It's low-level code for a high-level task. You're bound to fuck it up eventually and not foresee an edge case.

But it would depend on what you want to do with the data. If you're walking trees, then you're going to want to store them in memory as trees. Whichever memory structure would be the most efficient in the algorithm you've got planned.

Personally, since it's relational, I wouldn't bother with file storage at all and just dump it into a relational database, even if it's just SQLite. That would give you the most flexibility.


Extracting from a regex match with unknown number of captures is still a kludge, at least judging from my experience with python.

Come on, what can you mess up splitting lines into fields?

I'm not specific to C/C++. It's just often the easiest/most straightforward thing to do as opposed to popular opinion. C structs are a good tool for concise description of data inter-relations. (Ownership or mutability are totally orthogonal questions).

Conversely, how do you check integrity of JSON inputs (meaning the tree as it was returned from a JSON parser library)? The odds are not. Much more work since you don't know at each step precisely what has to come.

Yeah, sqlite3 is a popular option. But it's more an alternative answer to the question of in-memory storage. It's not particularly suited as a human-friendly data interchange format.


I think using flat file human-readable serialization for relational data is the real kludge, but I guess that's just me.

But unknown numbers of captures indicates a non-regular grammar. You should simplify the grammar so you don't need them. I'd have each line adhere to /^(\w+) (.*)$/, and use the first capture to determine which regex you use to read the rest of the line. But this will break if there are newlines in the input. Hopefully it will error out if it gets a bad line, and even if it does, it will pass bad data in that case.

> Come on, what can you mess up splitting lines into fields?

Really depends on how clean your inputs are. A short list of things that can mess you up: newlines in your data, encoding inconsistencies, delimiters in your data. You have to handle all these manually if you write your own parser or risk unclean data entering the boundaries of your system. If you're in full control of your input, sure, go nuts, but I'd prefer a more robust solution because you never know what you're going to want to feed into it in the future.

A standard format may have inconsistencies in implementation, but the format itself bakes in the solution to these problems, and also tends to have somewhat-useful error messages when they break.

> Yeah, sqlite3 is a popular option. But it's more an alternative answer to the question of in-memory storage. It's not particularly suited as a human-friendly data interchange format.

My suggestion would be to treat these as separate problems and not expect one solution to provide both of them. Why do you need human-friendliness? To me, the usual way to provide that is with an application and not a data format. I could knock one of those up pretty quickly with Rails and an admin backend, I'm sure you'd have your own way of doing that that you're comfortable with. I will use command-line applications if Rails is too heavyweight, I do this for a lot of my utility apps.


I appreciate the discussion, and I'm sorry I have to respectfully disagree :)

> I think using flat file human-readable serialization for relational data is the real kludge, but I guess that's just me.

Look into your /etc and tell me which of the files in there are a kludge, and which ones are not plain text relational (or key/value, for that matter).

> newlines in your data

Well, what I do is I split at newlines :). Very practical since it guarantees fields don't contain them. (In the rare cases where you want them, do percent-encoding, hex encoding, C string style, whatever).

> encoding inconsistencies

Parse at an abstract character level of your choice. In scripting languages you get basic Unicode sanity for free. In C I would not want to deal with Unicode. It's too big for my head. UTF-8 goes out of its way so I can ignore it (newlines etc. still work). C strings means no longer having to deal with painful questions like "is c.isspace() only for U+0020 or for also other characters in Unicode classes which I never even heard of (or which don't even exist yet)?". ASCII/C locale text fits in my head. I don't want to be concerned with more abstraction (~ additional semantic baggage) throughout most of a system.

> delimiters in your data

same as above.

> you never know what you're going to want to feed into it in the future.

You better have a rough idea. I don't see why you couldn't commit to non-whitespace user-ids now, for example. Conversely, if you don't want to enforce sensible constraints now, you end up with code which never quite knows what it can assume (and often it can not assume properties that are needed for a sensible implementation).

> To me, the usual way to provide that is with an application and not a data format.

Can I email that application to someone else? Can I grep it as easily as my text files? Can I version-control it? Can I quickly whip up a script, import the application into Excel or maybe sqlite and visualize or extract some relationships? Will that application server still serve my data in 10 years? Will that application still build next year? How do I access it when I don't have a stable connection?


> Do you think the use of parser generators is practical also for performance-critical binary data?

That's not a question I consider very important, because prioritizing performance over correctness (safety) is irresponsible.

That said, using a parser generator shouldn't have a significant impact on performance. When a network is involved, the parser isn't going to be the bottleneck anyway.

> video

I am not very familiar with the details of video formats, but I would suggest trying the parser combinator library Meredith mentions in the talk in my previous [1], known as "hammer"[2]. It's very easy to use, and it has specific support for bit-level parsing of binary formats.

> numPersons 2

I highly recommend not using formats like this, because it's more complex than deterministic context-free. Using a length field requires that the parser remember state. It's also redundant information, creating the possibility of a miss-match between the number specified in "numPersons" and the actual number of rows.

A format that uses s-expressions would be safer and more easier to implement:

    (ancestry
      (ancestor
        (person john "John Doe")
        (person jane "Jane Doe")
      )
    )
The closing tag of the s-expression provides a similar benefit as the num* fields by simply marking the end of a complete record/field.

> That's so trivial that I would never want to depend on a parser generator

Is it really "trivial"? Or is your parse that you write in 10 minutes accepting ambiguous or invalid data? Is it actually checking for all of the error conditions? For most real-world grammars, it's unlikely you will catch all of the subtle ambiguities, interactions, and edge cases.

The benefit of using a parser generator is that those entire classes of bugs are impossible.

> check (relational) integrity

In the trivial example you gave, you shouldn't have a grammar that requires checking relational integrity. In a more complex grammar, such checks would be the job of whatever is walking the AST that the parser returns. Parsers interpret syntax, not semantics.

> If I leave away the num* fields above, the parser would need one token of lookahead.

You need at least one token of lookahead anyway. Including the num* fields means the parser needs to remember and manage state.

[2] https://github.com/UpstandingHackers/hammer


>> numPersons 2

> I highly recommend not using formats like this, because it's more complex than deterministic context-free.

Barely. It avoids the complexity of look-ahead and reallocating growing buffers.

> A format that uses s-expressions would be safer and more easier to implement:

Trees are unsuitable for almost anything computationally, at least as authoritative representation. Prefer flat. (Widget hierarchies are one of a few counterexamples where the tradeoffs are different).

This s-expression you gave is much more complex to parse. Granted, you get parsing as a tree for free. But then you have to go inside the tree and see what's there. That there are two persons per ancestor, etc. Navigating trees is hard.

> Is it really "trivial"? Is it actually checking for all of the error conditions?

Yes. Yes. Of course. It's the most boring code in the world. Parse numbers, allocate arrays correspondingly, parse each following table in a loop corresponding to the numbers parsed. You can do it all using scanf. Bail out as soon as a bad line comes in. 20 lines.

> The benefit of using a parser generator is that those entire classes of bugs are impossible.

Yeah. Well, they are not impossible. They are only possible in the parser generator. And interfacing a parser generator is itself not necessarily trivial. Plus, dependencies are a maintenance burden.

The other approach is to make the design so simple (like the relational example) to make them (the classes) nonexistent.

> You need at least one token of lookahead anyway.

Not in the example I gave.

> Including the num* fields means the parser needs to remember and manage state.

I would hardly call that state. It's an immutable number, initialized in sequential code without real data dependencies (the data dependencies are resolved in the format -- you can not go wrong parsing it).


As far as I can tell, parser generators are harder to use and understand than recursive descent parsers because they involve feeding an abstract spec of your language into a tool designed for any language whereas a recursive descent parser involves creating a tool more or less specific to your language.

Now, it seems to me that being easier to use and understand is an inherent advantage - I can't see how easier to understand doesn't make bugs easier to understand and so-forth.

Moreover, constructing a recursive descent parser involves understanding and transforming your language, an exercise which can also discover ambiguities and give a "deeper" understanding as well. It's hard for me to see Bison commands + code fragments as the ultimate in illumination concerning one's language's properties.


I've looked at yacc + bison a couple times and always turned away before even attempting to write parsers in them. It's ugly as hell and I figure sprinkling code is hard to understand and thus insecure.

However, have you tried parser combinators? Parsec (a Haskell library) is really nice if it fits your grammar.

Here is one guy who simulated that in Python, also very nice:

http://jayconrod.com/posts/37/


Parsec is good but it's a rather old library. For more specialised needs there are even better choices in Haskell: (taken from Gabriel Gonzalez's recent post)

* attoparsec remains the king of speed, generating parsers competitive in speed with C

* trifecta remains the king of error messages, generating gorgeous clang-style errors

* earley is a newcomer but deals with context-free grammars, while other parser combinatory are basically recursive descent parsers

* and good old alex+happy for direct yacc+bison equivalent

* I should probably mention that in certain simple cases you don't even need a parser combinator library to parse things. Just a good old `StateT (Text, r) [] ()` or similar can do wonders.


Yeah. Reverse engineering a grammar you can give to a parser generator from a hand crafted parser is a fun and occasionally horrifying experience. There's nothing quite like discovering that

    loop @a
        ...
    endloop
means start the loop labeled a but

    loop
        @a
        ...
    endloop
means start a loop, the first statement of which refers to attribute a, but that this is the only use of labels in the language where these two cases are parsed differently...


Out of curiosity, is that a real example from some language?


Slightly altered to protect the guilty, but yes, this is a real example. There are few things more frustrating when converting hand written recursive descent parsers into proper grammars than having to insert arbitrary rules about where newlines can or cannot be. You generally can't change the language if there's production code out there (and if there isn't, why are reverse engineering a grammar?) because changing the semantics will lead to more confusion.

You just end up making a grammar with really weird rules about where newlines can or cannot be put, and comments that this should not be changed. Oh, and a BIG test suite.

Once you've dealt with the grammar then you'll only have bugs in the compiler and runtime semantics to emulate, and language implementations often have an exciting set of those.


Menhir is really amazing. Being able to use the high-level rules to parse lists, without needing to use recursion, was super nice. It also gives decent error messages when it detects an ambiguity.


exactly. People like to pretend that e.g. PEG grammars don't have ambiguities because of "ordered choice"... But the truth is, they simply mask ambiguity away! Even yacc will always produce a valid grammar, resolving conflicts in a well-defined manner - while also letting you know how to fix your grammar to make it even better!

Edit: this is actually explained much better in the article.


> But the truth is, they simply mask ambiguity away!

Incorrect. The choice is explicit. Ambiguity gone, period. If the order makes some of the branches unreachable, it can be easily statically proven. In practice this is more than enough and does not require nearly as much effort as all the other parsing approaches.


Something interesting that isn't covered in this blog entry but is relevant to many of us is the importance of deterministic parsers for DOS / DDOS prevention. APIs that make use of parsers, be they regular expression parsers, or parsers for more complex grammars such as CFGs, can be abused if the underlying parser is nondeterministic. Reducing a parser to a DFA (deterministic finite automaton) provides it with measurable performance per character parsed. If an NFA (non-deterministic finite automaton) is used instead, then an attacker can abuse backtracking to increase parse time, often exponentially.

This has been used, quite famously, in standards conforming XML parsers which have some very ugly backtracking issues dealing with entities and entity references. More commonly, people who decide to add regular expression validation to their APIs are often surprised to discover that an attacker can take up all sorts of server time while the API is trying to validate a carefully crafted degenerate case message. Many regular expression libraries out there are NFA, including the ones built into some popular high-level language runtimes.


> Reducing a parser to a DFA (deterministic finite automaton) provides it with measurable performance per character parsed. If an NFA (non-deterministic finite automaton) is used instead, then an attacker can abuse backtracking to increase parse time, often exponentially.

An NFA does not use backtracking - at least in a sensible implementation. The problem is that most implementors of RE engines prefer "obvious hands-on" implementation that uses backtracking instead of understating the elegant theory behind NFAs. Here is how to implement an NFA without using backtracking:

> https://swtch.com/~rsc/regexp/regexp1.html

Addendum: And it is also a bad idea to convert am NFA to an DFA for parsing because the number of necessary states can explode exponentially by this transformation.


There are still pathological cases for NFAs: they are just quadratic in runtime and linear in (additional) storage, instead of exponential in runtime. That's frequently good enough, but sometimes you really want the linear runtime and constant memory usage of a DFA.


It's a little better than that. A proper implementation of the NFA algorithm is always linear with respect to the search text and assuming the size of the regex is constant. More precisely, the time complexity is O(mn), where m ~ len(regex) and n ~ len(text).


While that's technically correct (the number of states in the NFA represents an upper bound on the number of states the simulation can be in at a time), this is one of those times when big-O can be misleading. In practice, there's a lot of variation between inputs. A typical case will leave you in only a handful of states at once, but a carefully constructed input can put you into many states. This will lead to quadratic runtime and linear additional memory usage up to the bound of the number of states in the NFA.

There is a key insight there: that it is in fact bounded, and you can plan around these bounds. It's not the tight constraint a DFA gives you (where the number of operations is constant, and runtime varies only by hardware details like L1 cache misses), but it's still good enough for most cases.


> This will lead to quadratic runtime and linear additional memory usage up to the bound of the number of states in the NFA.

If by quadratic runtime you mean O(len(regex) * len(input)), then fine. (I think it's a bit strange to call that quadratic, since the size of the regex is constant.) If by quadratic runtime, you mean O(len(input)^2)), then I contest that. My guess is that you probably do indeed mean the former.

And certainly, "linear additional memory usage" is not relative to the input either.

> There is a key insight there: that it is in fact bounded, and you can plan around these bounds. It's not the tight constraint a DFA gives you (where the number of operations is constant, and runtime varies only by hardware details like L1 cache misses), but it's still good enough for most cases.

Sure, but now you have to deal with exponential state blow up. :-)

You can skirt around that with a lazy DFA, and you're now technically back to O(mn) (same as the NFA algorithm in the worst case), but in the common case, you end up pretty close to the performance of a DFA constructed offline.

As "evidence" of my claim, you might peruse these unscientific benchmarks: https://github.com/jneem/regex-benchmarks (The first group uses an offline DFA and the latter two groups are both lazy DFAs. Of course, this is just an anecdote!)

My view on this is pretty biased because my goal is not to just be able to deal with untrusted input, but to also deal with untrusted regexes. It's hard to satisfy the latter case with an offline DFA. :-/


Actually, my mistake is that I was examining a particular pathological case for NFAs, that DFAs handle fine: something like /.+aaaaaaaaaa/. In this case, the number of states the automaton could be in grows with the length of the input, until you hit the upper bound of "all eleven states". For that regex, the worst case is O(min(len(regex),len(input))*len(input)). This is quadratic in the input length, until a certain point. Which formula is correct then depends on whether the regex or the input is longer.

In other cases (e.g. /(aa|ab|ac|ad|ae|af|ag|...)+/), you blow up to the maximum number of states immediately, and for a matching string stay in the same number of states the whole time. Therefore, it is technically correct to leave out the len(input) parameter if you are considering arbitrary regular expressions.

> You can skirt around that with a lazy DFA

This adds unwarranted complexity if you're doing ahead-of-time compilation. Perhaps the best approach there is to try to convert the regex to a DFA, but bound the number of states to some multiple of the number of NFA states. If you surpass this number, consult the requirements of the situation and either reject the expression or just use the NFA that was built in the process.


> something like /.+aaaaaaaaaa/. In this case, the number of states the automaton could be in grows with the length of the input, until you hit the upper bound of "all eleven states".

That's just false for the NFA algorithm. I'm not sure where your confusion lay, unfortunately. There are no pathological cases for the NFA algorithm. Zero. Zip. Nada. It's guaranteed O(mn) time. See Cox's articles. They explain.

It is true of course that certain regexes run slower than others using the NFA algorithm, but this is because of constant factors, not because of any degradation in runtime complexity. When implementing the NFA algorithm, you can allocate all of the space you need upfront and then execute a search guaranteeing no additional allocation.

Offline DFAs do, however, have pathological cases where the number of states required is exponential. For example, `0[01]{n}$`, has an exponential number of states proportional to 2^n. (I think this is isomorphic to your example, assuming `a{11}$`.)

> This adds unwarranted complexity if you're doing ahead-of-time compilation. Perhaps the best approach there is to try to convert the regex to a DFA, but bound the number of states to some multiple of the number of NFA states. If you surpass this number, consult the requirements of the situation and either reject the expression or just use the NFA that was built in the process.

Sounds pretty complex to me. Certainly more complex and fiddly IMO than implementing a lazy DFA. (I've done both.)


> It's guaranteed O(mn) time.

I'm apparently completely failing to communicate my point here, because you seem to think I'm contradicting this. I am not. I'll maybe try to work through the example. For the purposes of this example, we have a fixed regular expression of /.+aaaaaaaaaa/.

Processing the string 'a' takes one operation. Processing the string 'aa' takes three operations. Processing the string 'aaa' takes six operations. Processing a string of 'aaaa' takes ten operations. This is a quadratic progression. The DFA, however, has only eleven states.

Yes, there is a bound to the number of operations of 11*length. Yes, this means that big-O says it's linear. No, an attacker can't simply supply a 30-character string and lock up your system the way they might be able to for a backtracking approach. That's beside the point. An attacker still has significant control over system resource usage in a way that he would not with a constant-resource approach. An attacker may still be able to exploit this in a way that would not be an issue with a DFA, e.g. by attempting to exhaust memory by issuing many simultaneous expensive requests when your system is provisioned to handle average requests.

> Offline DFAs do, however, have pathological cases where the number of states required is exponential. For example, `0[01]{n}$`, has an exponential number of states proportional to 2^n.

Yes, there are pathological cases, which is why you don't use that approach with untrusted regexes. I've never disagreed with this.

This isn't even one of them, however. The DFA would have n+3 states: an initial state, a reject-loop state, a state for when we have seen the initial zero, and one state for when we have seen each of n additional zeroes or ones. I'm not sure how one would end up with more states than that: perhaps by not consolidating equivalent states in the NFA first?

Edit: Apparently you're using the convention that one searches for a match anywhere in the input string, which while common isn't what I'm used to. This doesn't change anything: it just makes most transitions that would otherwise go to the reject-loop state go to the initial state instead.

It's possible to get a pathological DFA, but one needs to work harder at it than this.


> Yes, there is a bound to the number of operations of 11{star}length. Yes, this means that big-O says it's linear.

... that's been exactly my position this entire time.

I guess I just don't understand why you are calling the NFA algorithm quadratic. Calling it quadratic because there exists a particular progression that moves the runtime complexity in a quadratic fashion up to a predetermined linear bound seems really confusing.

Overall, attackers can't make the entire algorithm run in quadratic time in the length of the input or the regex. That's what matters.

With all that said, I agree with your final conclusion of course, that the NFA algorithm is going to be slower than running a pre-built DFA. But I think attributing this to quadratic behavior is just confusing. I personally think of it more as the NFA having to maintain multiple simultaneous states and being forced to follow epsilon transitions over and over again. In contrast, a DFA will only ever be in one state and it will cache all of the epsilon transitions in its states.

> An attacker still has significant control over system resource usage in a way that he would not with a constant-resource approach.

Cox's formulation very explicitly provides an implementation of the NFA algorithm that uses constant resources (with respect to the input).

I mean, this isn't theoretical. You can go look at an implementation of the NFA algorithm by either Cox or myself, and it's pretty clear that some fixed allocation (based on the size of the compiled regex) happens up front, and no further allocation happens during the actual search. You can't make it blow up and take quadratic time.

> This isn't even one of them, however, because the NFA has no loop. The DFA would have n+3 states: an initial state, a reject-loop state, a state for when we have seen the initial zero, and one state for when we have seen each of n additional zeroes or ones.

It's exponential because it requires counting. You don't actually know when you've seen the initial zero, since you can see any number of 0s or 1s before it, and exactly N number of 0s or 1s after it, and then following that, the end of the string. (That final `$` is probably crucial to requiring an exponential number of states.)

It's a pretty standard example: https://en.wikipedia.org/wiki/Regular_expression#Expressive_...

See also: https://github.com/google/re2/blob/master/re2/testing/dfa_te...

This blurb from Wikipedia is also relevant:

> On the other hand, it is known that every deterministic finite automaton accepting the language Lk must have at least 2k states. Luckily, there is a simple mapping from regular expressions to the more general nondeterministic finite automata (NFAs) that does not lead to such a blowup in size; for this reason NFAs are often used as alternative representations of regular languages.

This exactly matches my experience and what Cox says.


Re: the first part, that's fair.

> Cox's formulation very explicitly provides an implementation of the NFA algorithm that uses constant resources (with respect to the input).

Not constant, bounded. Theoretical CS doesn't care about the difference, but in practice it matters a lot. People provision hardware for the median case (really more like the 90th percentile), because provisioning for the worst case is really expensive. When something is bounded, but not constant, there exists the possibility of a resource exhaustion attack.

> It's exponential because it requires counting. You don't actually know when you've seen the initial zero, since you can see any number of 0s or 1s before it, and exactly N number of 0s or 1s after it.

I misinterpreted this one, because the convention I'm used to is that the expression is expected to match the entire input. I prefer this convention for its elegance, but it's much less common in practice than the Perl-like convention where you need to specify that you want that behavior.

I also didn't sufficiently reconsider the expression when I realized this: the terminal anchor is quite relevant, but I forgot it was there. In reality you're correct about this with the convention you are using.


> Not constant, bounded.

If I can do this:

    re = regexp(".+aaaaaaaaaaa")
    state = malloc(len(re))
    match1 = exec(re, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
    match2 = exec(re, "aaa")
    match3 = exec(re, "aaaaaaaaaaa")
    ...
And I also claim that `exec` is guaranteed to never call `malloc` in any of its executions (or otherwise allocate).

Do you consider that constant or bounded? It looks like it's constant to me.

If you're talking about CPU resources, then the absolute worst case in the NFA algorithm is requiring `len(re)` operations per position in the input. I guess this is what you're referring to as bounded? If so, then it is a linear bound. :-) An attacker can't make matching take more than linear time.

I feel like this entire conversation has been one roundabout exploration of "DFAs are faster than NFAs." :-/

> I also didn't sufficiently reconsider the expression when I realized this: the terminal anchor is quite relevant, but I forgot it was there. In reality you're correct about this with the convention you are using.

You had me doubting myself there for a second, so kudos. :-)


Huh, I didn't consider the possibility of pre-allocating space for the entire state list. That's a cute trick, making it less memory efficient in the average case (though more efficient CPU-wise by avoiding malloc) but also addressing my exact concern. Nice.

edit: I guess I should have just looked that up sooner. Sorry for any hassle or aggravations.


Interesting. I had never even considered not pre-allocating space for it... :-)

In a "real" implementation, you also need space to store sub-capture locations. You get to reuse this space on each execution of the regex, so it's a really nice win.

There are also certain crazy people in this world who have encoded the NFA algorithm into a code generation tool, which enables it to run completely malloc-free and entirely on the stack. :-)


> If by quadratic runtime you mean O(len(regex) * len(input)), then fine. (I think it's a bit strange to call that quadratic, since the size of the regex is constant.) If by quadratic runtime, you mean O(len(input)^2)), then I contest that. My guess is that you probably do indeed mean the former.

I think he rather means O(len(regex)^2 * len(input)). For the reason: The number of states of an NFA is O(len(regex)) =: m. When parsing with an NFA (standard algorithm) you iterate through all states we are currently in (up to m). On the current input character each state we are in can go into up to m new states for the next state. This is up to m^2 transformations to check.


That's only true in a naive implementation. In fact, in a naive implementation, it's probably worse than that because there can be loops among epsilon transitions. Consider, for example, `(a{star}){star}`. (HN markup prevents use of a kleene star usefully. Sorry.)

In a "real" implementation, when computing the next set of states to visit for some position in the input, you explicitly avoid re-visiting states you've already seen for that position in the input. Cox talks about this quite a bit in his articles.

The guarantee of the NFA algorithm as presented by Cox is that each NFA state is visited at most once for each position in the input.


I think wolfgke's point (different from mine, by the way) was that you need to consider each state in each transition list just to check if it's in the set you've already seen. If you're in all states, and all of them transition to all states on an input, that's quadratic: you need to loop through n lists of n states just to check if you've already visited any of them.

Of course, I don't think there's an actual regular expression that would produce that automaton using Thompson's algorithm.


No, it can't happen at all. After processing the first state that visits all states on input i, all subsequent states explored in input i stop before traversing states that have already been visited. Since all states have already been visited, each state after the first on input i can be dismissed in constant time. Given that one can check whether a state has been visited in constant time, this is still just linear, not quadratic.

Let's just look at code already.

Here's the loop over states at input i: https://github.com/rust-lang-nursery/regex/blob/master/src/n...

If we match a character at input i, then we follow epsilon transitions: https://github.com/rust-lang-nursery/regex/blob/master/src/n...

If we visit a state that we've already seen, skip it: https://github.com/rust-lang-nursery/regex/blob/master/src/n...

If that first state visits all states, then `nlist` (which is maintained while following all states at any position in the input) will contain all states. In each subsequent iteration (line 165), even if the state wants to visit all states, it will be immediately stopped at the first check in line 300. No additional states for that state are followed.


The problem is that the "constant" memory usage of a DFA may well be exponential in the size of the corresponding NFA/pattern.


Sure. You want to do this when an attacker controls the input, and a benevolent but fallible actor controls the expression. It's not useful when the attacker controls the expression.


It could be larger, but it would be a known constraint with known runtime characteristics. There is value in this, especially with regards to understanding the worst-case performance.


Do you have more background on that? As far as I know, burntsushi is entirely correct, NFAs take a constant amount of space to run, and a linear amount of time (all in relation to the length of the input text).

Of course, backreferences and other stuff take you out of the realm of regular languages. Avoid these.


> Addendum: And it is also a bad idea to convert am NFA to an DFA for parsing because the number of necessary states can explode exponentially by this transformation.

You can skirt this problem using a lazy DFA with a bounded cache. (i.e., See the third article in Cox's series.)


Read your link carefully. Real-world regular expression engines support backreferences, and efficiently determining whether an RE with backreferences matches a given string is an NP-complete problem.

If you're going to implement this important feature in a sane way then at SOME point you're going to need to backtrack.


The parent comments are using the term "NFA engine," not "real world engines." NFA engines indeed cannot support backreferences for exactly the reason you specified. They simply aren't powerful enough. (The top comment in this thread mischaracterized regex engines, which is a common confusion.)

And there are plenty of "real world" regex engines in wide use that don't support backreferences.


The commenter also talked about "RE engines".

Most RE engines that are described as being an NFA (examples include Perl's, PCRE and Java's) are implemented with backtracking exactly because you can easily add in a little state and suddenly you support backreferences. Of course the language isn't really now a proper regular expression, but everyone knows what you mean when you talk about them.

And yes, I'm well aware of how many real-world regex engines there are without backreferences. However the ones with the most widespread use, do.


> Most RE engines that are described as being an NFA

Yes, and it's a huge disservice to continue doing so. It doesn't make any sense to describe Perl's (or Java's, or Python's or PCRE) regex implementation as an NFA.

> but everyone knows what you mean when you talk about them

I disagree. The confusion in this space is extremely high.


You are at least 20 years too late to change terminology here. There are more people who are used to the wrong terminology than will ever hear or care about the official terminology. Complaining about it is like telling the media the difference between crackers and hackers, or like telling most programmers that what they mean by big-O really is big-Omega.

The problem starts with "regular expression". Regular expressions in Java, JavaScript, Python, Perl, Ruby, or many other languages or environments, aren't regular. They support backreferences. In many of those languages, it would be a backwards incompatible change to change either the name of the engine, or its feature set. There are more users of these languages who "know" how to use regular expressions than know the actual definition. Good luck convincing Java, JavaScript and Python to make backwards incompatible changes so we can restore language terminology!

Moving on, every description of regular expressions starts with the DFA and NFA terminology. All of these regular expression engines that I named, when presented with an actual regular expression, solve it like an NFA. Admittedly with a backtracking algorithm. Even though their internal states go beyond NFA, it isn't actually wrong to say that "This RE engine uses an NFA algorithm on regular expressions."

Given that, it is hopeless either to get people to use "regular expression" in its original meaning, or to get people to stop calling their so-called (but not really) RE engines, NFA implementations.


I get that terms can change in meaning over time. I totally concede the point that complaining about the term "regular expression" being misused is a lost cause at this point. The term has graduated from a purely technical term to something that is colloquially used quite frequently. I find it unfortunate, but I'm not going to nitpick someone to death over that.

With that said, I absolutely take issue with using the term NFA to describe anything that is NP-complete. It's just plain nonsense. The term "NFA" isn't nearly as widely used as "regular expression," and it is a very technical term with very specific technical implications.

> it isn't actually wrong to say that "This RE engine uses an NFA algorithm on regular expressions."

This is exactly the problem. It is wrong to say that Perl uses an NFA algorithm. It literally cannot be true because Perl implements features that are NP-complete.

This is exactly what happens when people get sloppy with their teminology. They start bending the meaning of words to say something that doesn't make any sense. As a result, those of us who aren't as informed end up getting extremely confused.

> or to get people to stop calling their so-called (but not really) RE engines, NFA implementations.

In a discussion like the one in this thread, where the distinction ACTUALLY MATTERS, it is absolutely worth pointing out.

> Good luck convincing Java, JavaScript and Python to make backwards incompatible changes so we can restore language terminology!

This is such a bad argument and such a huge straw man. Do you think I'm on some crusade to fix terminology? No. That's stupid and a waste of time. The only reason I brought this up is because your initial comment in this thread suggested (to me) that you were confused about the terminology being used! Why? Because wolfgke used the term NFA correctly: "An NFA does not use backtracking." But you chimed in with stuff about backreferences and "real" regex engines. What gives? I don't know, but it seemed like a clarification was in order to me.


My response was triggered by wolfge saying, "...implementors of RE engines..." and attributing using backtracking to laziness.

However the most widely used RE engines in the wild use backtracking because that is the easiest way to handle the full feature set that is asked of them. That's not to say that they couldn't be implemented in a way that was able to handle strict regular expressions without backtracking, while backtracking only when they need to. (Perl makes a modest attempt in that direction, but doesn't do it completely.)

And then when programmers ask how the engine works, they start with the simple version for simple features. The naive explanation is of an actual NFA. And then someone goes off and implements one for a project, and they just follow that. Now you have an engine which is an honest NFA, and uses backtracking. It doesn't need to because it doesn't support extra features, but it is simple, it works, and if you ever needed to extend it, it is obvious how you'd go about it.

So there are two reasons other than laziness to implement an RE engine with backtracking. The first is that your RE engine supports (or might in the future) features that require it. The second is that it is easier to find and follow explanations of backtracking.

This is exactly the problem. It is wrong to say that Perl uses an NFA algorithm. It literally cannot be true because Perl implements features that are NP-complete.

Not true!

If you hand Perl's RE engine a regular expression that is actually regular, then Perl will construct a set of states that are a recognizable NFA, and then will attempt to match the NFA to the string using the obvious backtracking algorithm. Admittedly the states are decorated with fields which COULD specify non-NFA stuff. The code it passes through will have checks for whether to trigger features that an NFA can't do. But what it actually will do is construct a recognizable NFA, and attempt to match it to the string.

Now throw a non-regular "regular expression" at the same thing and you'll construct a data structure that clearly isn't a proper NFA. What the heck do you call it? It is some type of extended state machine, but it is easiest to think of it as "NFA with some extra stuff". And that is exactly how the people who do work on that code think of it. With no alternate terminology and thinking of it as an NFA, is there any wonder that they describe it as an NFA?

(As well, that is how it was described in a number of popular and widely read books. For example pick up _Mastering Regular Expressions_.)


Perl's algorithm has more power than an NFA, therefore, it is not an NFA. It's as simple as that. That certain inputs happen to correspond to a configuration solvable by an NFA does not make it an NFA. The worst part about calling it an NFA is that it misleads everyone about the theory and suggests that the engine is less powerful than it really is.

I completely understand why it looks like an NFA.

> What the heck do you call it?

Anything but an NFA? My personal preference is unbounded backtracking (still not quite right), or just "Perl regexes."

> (As well, that is how it was described in a number of popular and widely read books. For example pick up _Mastering Regular Expressions_.)

I read that book when I was a young pup and remained hopelessly uninformed about the entire situation until my first complexity theory class during my undergraduate study almost ten years ago.

That book is fantastic if you care about optimizing Perl regexes. It otherwise does a great disservice to theory and has likely been the cause of uncountable confusion. When pointed out the inaccuracy, Friedl essentially had the same response as you. It's tragic, frankly.

PCRE's documentation even cites Friedl's book when describing their implementation by calling it the "NFA algorithm." PCRE also has a supposed "DFA," but its description makes it sound more like the actual NFA algorithm (since it maintains multiple states simultaneously).


It is not that certain inputs happen to correspond to a configuration solvable by an NFA. It is that if you give it certain inputs it internally generates a data structure that actually IS a description of an NFA. And the same is true of most RE engines.

As for _Mastering Regular Expressions_, you've demonstrated both what it says and that it is popular. And further that it affected the terminology of a widely used project which has been embedded in many other projects. (Like PHP.)

And yes, I agree that it did a lot of people a disservice. But the damage is now long-done. See for example https://msdn.microsoft.com/en-us/library/e347654k%28v=vs.110... for yet another non-NFA that is described as one somewhere that programmers are likely to believe it.


> It is not that certain inputs happen to correspond to a configuration solvable by an NFA. It is that if you give it certain inputs it internally generates a data structure that actually IS a description of an NFA. And the same is true of most RE engines.

All this means is that Perl sometimes evaluates inputs using an NFA algorithm. To describe its matching engine generally as an NFA is extremely misleading.

> See for example https://msdn.microsoft.com/en-us/library/e347654k%28v=vs.110.... for yet another non-NFA that is described as one somewhere that programmers are likely to believe it.

Yes. It's sad.

I frankly can't tell you how much I see this confusion. It happens so much. I point it out when I see it in the hopes that others learn. I don't really buy the argument that we should stop educating others just because the "damage is done."

There's certainly a fine line between being an annoying pedant about this and genuinely trying to clarify it for someone who is confused. I thought I was doing the latter in this case initially. Of course, I am hyper sensitive to this topic because I maintain a regex engine! Can you guess which kind? :P


Great point, thanks! I'll add it to the article when I get the chance.


I take it sexp parsers are technically dfa?


S-expressions are a classic example of something that can't be matched with a DFA or NFA. In order to know how many parens are currently open (and therefore needing to be matched with closing parens) it would need an unbounded number of states.


On the other hand, it's just a matter of maintaining the number of open parentheses. Similar fact, one can't count string lengths in linear time (because lengths can be arbitrarily large). But for concerns of implementation it's just O(n).


Woohoo, real computer science!

I learned a lot from this. I did take compiler theory in college, but that was pretty academic (mostly Dragon Book), and more theory than practice. This is a good explanation of why lex/yacc is not enough.


> Terence Parr, author of ANTLR, often uses the metaphor of parsing as a maze

As someone who had a technical but "not real CS" curriculum, I enjoyed Parr's "Language Implementation Patterns" book [0] because it illustrates how certain grammar-forms become recognizable code-patterns and control-flow.

Granted, it's definitely got a bent towards Parr's ANTLR project and the Java language, but I still found it made things "click" a lot better than my daunting lack of progress an old "Dragon Book" [1] which IIRC tended to float along with more mathy-notations and theory than concrete examples.

[0]: http://www.amazon.com/Language-Implementation-Patterns-Domai...

[1]: https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...


In many cases trading in non-deterministic choice for deterministic choice (ie, parsing expression grammars) makes reasoning about and writing grammars much easier. For example, in the case of the arithmetic expressions grammar the rules should work as-is to get precedence.

Oddly you would think that sacrificing non-determinism would really hurt the power of PEGs to express languages but there are languages that are not context free (eg, `a^nb^nc^n`) that can be written as a PEG.


I don't get all these attacks on PEG being "ambiguous". This is a feature, and it is great once you get it.

And if you have a sufficiently smart compiler, it'll warn you about most of the practically important cases anyway. I wrote PEG parsers for dozens of different types of languages, and never ran into problems with an explicit choice order. This anti-PEG FUD should stop at once.


This is a great article (I haven't written a compiler in years but was still able to follow along) that can easily suck up an evening of descending his explanatory links. That's a good thing.

Favorite new trivia bit:

"C++ has an even more extreme version of this problem since type/variable disambiguation could require arbitrary amounts of template instantiation, and therefore just parsing C++ is technically undecidable (!!)"


I think the tools are challenging, but it is soooo refreshing to have tools that don't pass the complexity on to the user. The parser library that has done this for me is https://github.com/engelberg/instaparse. Using Instaparse is downright liberating. BNF in, parser out. Done.


I don't know of any light-weight markup language that are intentionally designed to be context-free and unambiguous. I wish I did.

I'm not sure if any exist, although some people have tried to write grammars for existing markup languages.

http://roopc.net/posts/2014/markdown-cfg/

https://github.com/asciidoctor/asciidoc-grammar-prototype

https://stackoverflow.com/questions/6178546/antlr-grammar-fo...

https://www.mediawiki.org/wiki/Markup_spec


I am working on a compiler for my new language. I went from learning about Yacc at the university (reaction: disgust) to discovering PEG (reaction: I should build my own parser generator!) to wisdom (reaction: just write the damn thing by hand and be done with it!)

I have a few basic reasons why I now think parser generators are a mistake for my purpose: 1) It is hard to learn all the details of a particular parser generator that you need for a complex parsing project. 2) The mix of grammar and semantic actions one has to come up with is ugly and complex. 3) The output is both ugly and hard to debug.

The most complicated part of a compiler is not recognizing good input, put doing something with it. A hand-written recursive descent parser can be a thing of beauty and simplicity that increases the joy of working on the meat of the compiler, that is semantics!


Parser combinators for lazy streams can be written as recursive descent and perform as, more or less, generalized LR parsing (exploring all tree branches "simultaneously").

http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf

What is more, these combinators can express context-sensitive grammars: https://www.cs.york.ac.uk/plasma/publications/pdf/partialpar... (page 5).

And, by tweaking combinators, you can arrive to the wonders of stream processing parsing without (mostly) space leaks: http://www.cse.chalmers.se/edu/year/2015/course/afp/Papers/p...

It is easy to add syntax error position reporting, also it is easy to add syntax error corrections. All this can be done in under a day.

For me it offers good things from both worlds - declarative specification of parser generators and easy modifications from "parse damn thing by hand".

I actually encountered one case where that approach shines incommensurably: supporting old languages like VHDL. VHDL fits every parser formalism almost without seams but still has one place where you cannot do things either efficiently or elegantly: character literals and attributes. 'a' is a character literal, a'a is an invocation of attribute a on the object a. The ambiguity is in the lexing level, needing feedback from parsing level. Usual parsing tools cannot do that good enough, with parsing combinators it is trivial.


I believe PEGS are a better, modern approach to parsing. PEGS are only mentioned in the article, with no links.

Here's the PEG paper: http://www.brynosaurus.com/pub/lang/peg.pdf


The article mentions PEG in good detail and notes the ambiguity problems. Can you guide me on how those can be resolved with PEG?


All such articles are always mumbling about "ambiguity problems", and never cares to provide even a marginally real world example of such.

In practice PEGs do not have an ambiguity problem, period. Stop this FUD!


"Error messages" is the usual argument I hear. The article does not address this. I would appreciate an article, why parser generators can or cannot provide error messages on par with handwritten parsers.


Good point. I have heard this too and I don't have a deep answer to this question. Maybe someday I'll get to the bottom of it and write a follow-up article. (Article author here).


Years ago I worked on an SML parser that used LR tables, on errors (ie. with no valid transition) it would try to find a symbol to insert/delete/modify that would allow parsing to continue. It dealt reasonably well with simple errors, though ML syntax is a bit flexible - function application is juxtaposition and it has user-defined infix operators (to deal with them, the parser would resolve reduce-reduce and reduce-shift conflicts dynamically by calling a function at runtime).

Code is here, in fact:

https://github.com/Ravenbrook/mlworks/blob/master/src/parser...


Here's one that might interest you.

Generating Good Syntax Errors http://research.swtch.com/yyerror


Excellent article, it explains pretty much the same (but with much more mathematical detail) as the article I wrote some months ago about why parser generator tools are mostly useless [1] .

[1] https://buguroo.com/why-parser-generator-tools-are-mostly-us... .


> some notable language implementations do use Bison (like Ruby, PHP, and Go)

Note that Go is also moving away from yacc:

https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_t...


Has anyone done a comparison of memory footprints for parsers? Say, Antlr vs Bison vs ???.

I used Antlr3 to create a NLP parser for entering a calendar event. For iOS, the C runtime's parse tree source was many megabytes and difficult to debug directly. For Android, the Java runtime parse tree ran out of memory and had to be broken into smaller pieces. Haven't tried Antlr4.

Current workaround is to parse a document directly into a token tree. Instead of separate lever + parser, merged a tweaked BNF with regex. Memory stays rather small and can download a revised grammar without recompiling a new binary.


Are you sure the issue was with the tool and not the language/grammar?


Was probably a bit of both. Using a LL parser for NLP is an odd fit. The problem was that the n-gram parsers had a fairly large file footprint. So, paging from memory would be slow and consume battery.

As for tool, the Antlr3 C runtime was very fiddly. On the forums, I saw one developer give up on porting their working Antlr2 C++ runtime to Antler's C runtime. This may have been due to C lacking exception handing, so had to implement backtracking another way. (Still, kudos to the volunteer who implemented the C runtime; in our case, it did make it into production.)

For really simple NLP parsing, one hack is combine an Island parser with an LL parser. This is essentially what Siri used several years ago (have no idea what it uses now).


Another note, recently Go stopped using yacc(bison).


I would appreciate a reference with an explanation about why that has changed. Is it to support code transformation?


I wonder if it might have something to do with error reporting. The usual way to handle error messages in a CFG is to make a superset of the grammar that contains all likely mistakes, along with custom error messages for each one. As you can imagine, the resulting grammars get pretty hairy. You also have more information available for reporting & recovery with a hand-written parser: think, for example, of how Clang can spell-correct mistyped keywords, something that's not possible without the ability to interpret an identifier as a keyword if it happens to be a certain edit-distance off.


The actual commit:

https://groups.google.com/forum/#!searchin/golang-checkins/r...

From what I recall reading about the change, it was believed it would allow them more flexibility in the frontend, and they greatly preferred to write the parser in Go instead of yacc.

Also, at this point, they felt this would also make it easier for others to contribute improvements to the project.


Handwritten recursive descent parsers offer much more control than the output of a parser generator. Most major C/C++ compiler front-ends do this.


In fact, very few production compilers use generated parsers - sooner or later they tend to end up with hand-written parsers for one reason or other. Particularly down to error reporting.

Parser generators need to get much better before that will change.


They recently did a mostly-automated rewrite of the compiler from C to Go. I expect it was easier to hand-rewrite the parser instead of writing a Go backend for yacc or converting the C parser that yacc generated.


It looks like there has been a version of yacc in the go source tree (that works with go, not c) for at least 11 months.

https://github.com/golang/go/tree/master/src/cmd/yacc



Thanks for the note, I will add this to the article when I get a chance!


For what do people use parser generators, besides writing new programming languages? I'm very curious to know about the use cases.


The sibling comment alludes to parsing calendar events (possibly .ical files). In general you need it for any kind of (de)serialization -- and as the recent java deserialization bug[1] (along with similar bugs in python, ruby) illustrates -- even in a mature language, "common practice" might not be "best practice" (everyone should've known that the various uses of blind deserialization was a bad idea -- but that doesn't really matter after the fact -- lots of high-profile libraries were completely broken). So "just use a library" might not be possible.

Any time you take native objects/data and persist them, or marshal them, you need a parser -- or a parser library. That might mean reading some CSV, parsing some XML, reading some YAML, JSON or what-not.

Then there's configuration that does more than set some variables, like for Nginx, Apache (web server), etc.

One might want to have a short-hand for defining templates -- either for general meta-programming, or just to generate a layout-language (eg: html, css). If you're doing markdown (or markdown-like) processing, you might want to have some assurances that that's all that code can do -- generate html. Or you might want do meta-programming on some higher level, like generating interfaces for some java-beans based on a CSV structure. For more on meta-programming in Java, I recommend (the now slightly dated: http://www.amazon.com/Program-Generators-Java-Craig-Cleavela... ). It's a shame he didn't name it as he claims he considered in the foreword: "Program Generators for Fun and Profit!".

And finally there's "real" programming languages - full blown DSLs of some kind. Perhaps in an effort to represent some kind of business logic (eg: authorization based on group membership, roles and time of day) in a language that is "safe" -- has no side-effects.

[1] http://fishbowl.pastiche.org/2015/11/09/java_serialization_b...

http://blog.codeclimate.com/blog/2013/01/10/rails-remote-cod...


They should be used anywhere you receive input for a defined protocol, rather than writing your own ad-hoc implementation. Security is much, much better that way.


I used one to parse XML/HTML, CSS selectors, and XPath expressions. Parser generator in question is https://github.com/YorickPeterse/ruby-ll which is an LL(1) parser generator in Ruby. Example grammars:

https://github.com/YorickPeterse/oga/blob/master/lib/oga/css...

https://github.com/YorickPeterse/oga/blob/master/lib/oga/xml...

https://github.com/YorickPeterse/oga/blob/master/lib/oga/xpa...


The article mentions that PHP uses bison for it's parser, which is true. After looking through the flex and bison grammar files in PHP source some time back I would say that it's perfect example of why you don't want to use parser generators for that.


Also, interested in how well Top Down Operator Precedence works with real-world languages: http://javascript.crockford.com/tdop/tdop.html


Seems like bison does not support running user-defined code to handle things like C's declaration vs statement problem. Is that true? Otherwise seems like Bison covers all the problems mentioned in the article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: