But in any normal design you never put the message length into the same parser, ...

tripzilch · on Jan 21, 2013

First of, this isn't about deeply nested structures, the flat example demonstrates the problem perfectly fine :)

> But in any normal design you never put the message length into the same parser, you use it to read the message, and the parser is fed the message when you have received that many bytes.

Won't work, this is a fundamental problem in computing science.

Splitting the parser into a pre-parser and a post-parser isn't going to help solve the fundamental problem, because the combination of two parsers is still a parser.

One of the problems is, you cannot distinguish <blob> bytes from <length> bytes. If the data stream gets out of sync with the parser state (hiccup, dropped packet), you have a very non-trivial problem on your hands. A context-free grammer however, is free of context (ohh!) and can therefore resync in time proportional to how deep it's nested.

Speaking of nesting, that's another bit where I expect CSGs to become incredibly hairy: Of course you can use a hybrid approach: length-prefixed messages for the "outer stream", and a context-free XML/JSON/Lisp style format (delimiters on both sides[0]) for recursive structures. But why would you do that? If you wanted to save bytes by avoiding the delimiters on the very outer structures, there's a lot more of them to be saved if you apply the same "optimization" to any inner recursive structures. If you don't know what I'm talking about, think about how a tree-like recursive datastructure is represented in the memory of a C program. Yes pointers. Alternatively you could length-prefix them like before, C programs don't do that because you need to scan through everything and it's less efficient. Regardless, both approaches are context-sensitive and good luck on distinguishing malformed data from correct ones.

Now, this whole "formal languages and automatons" is a very complex subject matter[1], so while it may seem that the whole argument hinges on dropping a packet and desyncing the parser[2], I got the feeling from that talk that there are other (similarly fundamental) problems, but this particular one I understood and can make a compelling argument for :)

[0] afaik you might actually get away with a delimiter on just one side, but that makes it harder to parse because you need strict precedence rules to resolve ambiguities (e.g. 2+34+182+1+1+74321+0)

[1] it was considered one of the hardest courses during my CS college years (the other one being on "formal proofs of program correctness"), for various reasons I retook this course 4 times (underestimating its difficulty at first being one of those reasons), but when I finally did pass, I did so with a score of 9 out of 10, I'm kinda proud of that :P But the real* benefit of studying 4 times for the same difficult course is that you never really forget it (some parallels there with that post about "spaced repetition learning" last week).

[2] another thing they recommended that makes a lot of sense, but again is a parsing complexity (security) vs bandwidth efficiency trade-off: to make the delimiters (say, parentheses) to be out-of-band characters. so they're not allowed in binary blobs. this saves you from all sorts of escaping exploits (think XSS), makes resyncing more efficient and parsing a lot easier. of course it's really hard to step out of the "we really need all 8 bits in a byte"-paradigm, or how else can you design data formats with out-of-band characters? I don't know, and the talk I watched didn't give a solution either, just that it would be a good idea (to which I agree).