Um... Incoming Naive Questions. Why do we need yet another HTML5 Parser? What's ...

jgraham · on Aug 14, 2013

A standalone parser written in C is a great asset. Pretty much any language worth mentioning has C bindings, so they are now just a bindings implementation away from having a reasonably fast (the fact that performance was a non-goal notwithstanding), standards compliant HTML parser. This is an improvement over the status-quo where most languages have bindings to lxml which is fast but has made-up error handling and a tendency to deal poorly with quite a lot of content, and some languages have slow, native implementations of the HTML standard parsing algorithm (I wrote much of Python's html5lib so I am aware both that it is slow and that it is non-trivial to speed up).

Compared to Gecko and WebKit, this gives you just the parser, which is significantly simpler than the whole engine and all you want for many applications.

venomsnake · on Aug 14, 2013

Because WebKit and Gecko are not only parsers for starters. They are much more complex layout engines. Which is a whole other can of worms.

mjn · on Aug 14, 2013

The new Gecko parser is based on a Java->C++ translation from this standalone parser: http://about.validator.nu/htmlparser/

felixge · on Aug 14, 2013

License is Apache 2: https://github.com/google/gumbo-parser/blob/master/COPYING