I've implemented a new HTTP/1.1 request and response parser by hand

viraptor · on Dec 1, 2009

For a couple of reasons this code doesn't look like a very reliable one...

Almost identical cases don't reuse code (not even a define). There are also sections like "`if (usual[ch >> 5] & (1 << (ch & 0x1f))) break;`" without any comments. The code hardcodes the http methods for some reason so that the detection code spans ~130 lines (454..580). Looks like it will accept "HXXX/1.1" if strict checking is off. This double check is... interesting:

    if (!parser->FOR##_mark) return 0; \
    assert(parser->FOR##_mark); \

Sure - speed++, but at what cost? Otherwise... cool code - I like the MARK / CALLBACK macros.

jerf · on Dec 1, 2009

"I can make it arbitrarily fast if I don't actually have to make it work": http://blogs.msdn.com/larryosterman/archive/2009/09/29/i-can...

nostrademons · on Dec 1, 2009

I'm disappointed that the student who hardcoded the results was disqualified. You should get points for finding bugs in your professors' specification.

andreyf · on Dec 1, 2009

Not to mention the "winning" entry did exactly the same thing: found an `n' # of words which works for this input only, then created a fast algorithm that works for this input only, but is incorrect in general.

ricree · on Dec 1, 2009

Especially since the other solution that got docked for breaking the rules actually performed the calculation that was requested.

jimbokun · on Dec 1, 2009

I'm stunned that this was not for seen by the professor. Wasn't it obvious that the best way to do this was to give some sample documents for development, but keep the text that the program would be judged on a secret?

Frazzydee · on Dec 1, 2009

Well the specs say that the program must "calculate the 10 most common words". The disqualified program doesn't really calculate the 10 most common words– it simply outputs the answer.

eru · on Dec 1, 2009

It doesn't say they have to be calculated from the input.

benatkin · on Dec 1, 2009

Sounds like red/green/refactor would have come naturally to her.

Confusion · on Dec 1, 2009

The student missed the most important part of the spec: the intent. 99 Out of a 100 times, you aren't implementing a rigorous spec. The ability to understand what is requested is much more important than the ability to be a wise-ass.

viraptor · on Dec 2, 2009

> 99 Out of a 100 times, you aren't implementing a rigorous spec.

I'd argue that most programmers do, or are supposed to. Either because they are not the designers of the system they build and what they are expected to do is to "fill in the blanks and make it work", or because they implement something that operates with another system and should strictly adhere to the interface/protocol.

The ones who have anything to say about the design are usually the system designers (duh), small project contributors (because there is no spec) and UI people (because they're designers). So yes - if your spec is "1. make it work on this input; 2. make it fast", then making it work in O(1) for this and only this input is basically the best thing you can do (unless you can make it faster for every input in less time of course). Spending time on generalising it is based on second-guessing the intent.

I'd even go as far as saying that specs have absolutely no intent. You have an intent and describe it while writing the spec and you'd better formalise your intent precisely or you will end up with something you didn't request. If you want something on every input, you'd better write it down (when you give a task to someone else).

gruseom · on Dec 1, 2009

I've seen that line referenced a couple of times here recently. I'm pretty sure it derives from Jerry Weinberg's Psychology of Computer Programming, which contains a story about some programmers defending an incorrect program on the grounds that it would process more cards per minute, to which Weinberg replied, "But your code doesn't work. If mine doesn't have to work I can do a million cards a minute!"

One of my favorite software stories.

viraptor · on Dec 1, 2009

After reading that, I wonder if trie-s (i.e. judy arrays) would beat hash tables in this exercise... (and I also think that hardcoding the results wasn't cheating)

codexon · on Dec 1, 2009

Since when are judy arrays synonymous with tries?

Judy arrays were only invented and publicized in 2004 and the only public implementation is GPLed.

on Dec 1, 2009

[deleted]

weaksauce · on Dec 1, 2009

For the record, and I am sure it's what you meant, a more specific description of Judy arrays is that they are a specific type of trie(though a trie is a specialization of a tree a trie is more rigorous.)

andreyf · on Dec 1, 2009

Judy Arrays are all Tries are all Trees, but only some Trees are Tries, and only some Tries are Judy Arrays.

pstuart · on Dec 1, 2009

This of i.e. as 'that is' and e.g. as "example given'.

ramchip · on Dec 1, 2009

the only public implementation is GPLed

Actually no. Here's a BSD one: http://hackage.haskell.org/package/judy

codexon · on Dec 1, 2009

Actually no. That's only a binding if you look at the source code.

The sole C source code is only about 2-3 lines long with a "#include <Judy.h>" which isn't even included while the haskell source is littered with foreign function calls.

I discussed this issue with Don on my website a couple of months ago: http://www.codexon.com/posts/why-arent-functional-languages-...

and believe me, he would have told me if he really did have a completely reimplemented Judy array.

andreyf · on Dec 1, 2009

With some pointer arithmetic, I would imagine so (with tries, not sure what judy arrays are)... although it might depend into how quickly memory can be allocated/accessed. Assuming the text is a set of lower-case alphabet character strings, you could define a 26-ary trie with very few instructions to do each characters' lookup...

nostrademons · on Dec 1, 2009

I would be willing to bet that that implementation is slower than a hash table.

(Yeah, I do have some data to back that up - I wrote & benchmarked an autocomplete implementation for a financial software firm. String tries were roughly 10x slower than binary searching an array for a 20k word corpus, which is probably around the size you're dealing with here. They have terrible cache locality - each character makes you follow a pointer, which is probably a cache miss, and the total size of the trie is on the order of 26 * 8 * 20k = 4M. They become a bit better if the number of distinct words in the corpus is small (so that everything fits into cache) or if the number of words is huge (so that your hashtable or array blows the cache anyway).

tyler · on Dec 1, 2009

I've had the opposite experience with Tries v. Hash tables. Though that could be a result of the Trie implementation. Dual array tries have really impressive lookup times, especially compared to the naive implementation... Insertion is pretty bad though. (Though that can be mitigated: http://www.gongcaichun.info/PPT/21.pdf)

nostrademons · on Dec 1, 2009

The structure of your data matters a lot. If one will fit in cache but the other won't, the one that fits in cache will almost always win. Then they have different big-O characteristics: binary searching an array is O(log N) where N is the number of items, a trie is O(k) where k is the length of the key, a hashtable is O(1), but this is misleading because the hash function is usually O(k) and performance can degrade to O(n) if you have lots of hash collisions.

I thought that an autocompletion widget for stock tickers would be the perfect application for tries: short, dense key space, lots of elements, and a fair likelihood of hash collisions. But apparently not, because our data size just happened to be one where cache effects dominate. I talked to one of the GMail guys later and he was quite surprised, from which I inferred (but it was not stated) that GMail probably uses tries to good effect.

tyler · on Dec 1, 2009

Right on. My point was simply that the type of Trie you use has an effect on the data's size in memory. A dual-array setup is pretty good at keeping the data small... There's very little pointer overhead and proximity is fairly good as well.

ryah · on Dec 1, 2009

> Sure - speed++, but at what cost?

it only has to be written once. i think it's an important enough problem to warrant such code - definitely could use a few more macros though

brianobush · on Dec 1, 2009

At my company we rolled our own too. The hard part was tracking down all the brain-dead servers that seemingly worked fine with browsers, but were off spec.

jacobolus · on Dec 1, 2009

Are there any good sets of HTTP compatibility tests which test deviations from the spec which show up in existing servers/browsers?

wmf · on Dec 1, 2009

http://coad.measurement-factory.com/

Warning: not free.

xal · on Dec 1, 2009

It has also been merged into node. This thing is starting to be a case study for optimized C network servers.

snorkel · on Dec 1, 2009

Clever use of ## in preprocessor directives. http://en.wikipedia.org/wiki/C_preprocessor

rglullis · on Dec 1, 2009

I need to dig through my old computer and find my college lab assignments where we built a basic HTTP/1.1 server. For the request parser, we had to create a lex file for a grammar that include actions for all of the verbs, and also be robust to accept non-standard verbs, i.e, return a 400 code.

I doubt that my yacc'd program would be only 124 bytes in size, but it would be interesting to get that old code and compare the results.

pquerna · on Dec 1, 2009

I thought it was a pretty cool implementation (zero memory alloc is pretty hard most of the time), though I was disappointed how HTTP methods were hard coded in, so adding a new HTTP method (like PROPFIND?) would be adding a ton of code, though I think you could make a few macros for doing HTTP methods and keep the rest of it.

davisp · on Dec 1, 2009

Hacked together an on_method callback last night for the Python bindings I'm writing:

http://github.com/davisp/http-parser/commit/50e54f95fd4c2eac...

bumblebird · on Dec 1, 2009

By hand??? As opposed to what exactly? Isn't most code written by hand?

ionfish · on Dec 1, 2009

As opposed to using Ragel, as you'd have known if you'd read the link.

bumblebird · on Dec 1, 2009

The title is "I've implemented a new HTTP/1.1 request and response parser by hand"

My question was, how do you implement a new HTTP/1.1 request and response parser if it's not by hand. Isn't most code written using hands?

I've written one too... big whoop.

Meh anyway...

viraptor · on Dec 1, 2009

Most parsers nowadays are generated via yacc and others. You write the structure, yacc generates the parser (and flex the lexer). It's something completely different if you write the parser yourself.

bumblebird · on Dec 1, 2009

It was just a bit surprising to me. Parsing HTTP isn't exactly complex.

ryah · on Dec 1, 2009

clearly you're unfamiliar with http.

bumblebird · on Dec 1, 2009

I certainly am. It's a complete mystery to me. Is it like visual basic?

The only PITA with HTTP is chunked encoding. Whoever thought that gem up should be shot. The rest is fairly trivial. Certainly parsing headers is. This implementation looks pretty silly. Having individual states for each of the characters in "HTTP" etc? WTF?

edit: Instead of just downmodding me, why not explain exactly what part of parsing HTTP headers is non trivial?

viraptor · on Dec 1, 2009

The parser correctness is not trivial. The RFC2616 contains the complete grammar you need, so it's fairly simple to implement. OTOH, if you write a parser on your own, you're likely to miss stuff like section 4.2, which explains that header values can be multiline if they include LWS. The parser from this article will fail on (just looking at the source, I'm 99.5% sure of this):

    abc:
     def

It also doesn't like tabs and will not support comma-separated header values. It's not rocket science to write a "good enough" http parser, but writing a fully compliant one is something completely different. There are also cool parts of the spec that you can read 10 times and come to different conclusions - for example what does the "\" CR LF section mean if it's inside a quoted string and does it finish the header value or not. Writing a "correct" parser is a LOT of fun...

Keeping separate states for characters in HTTP saves you a couple of cycles probably, because you match as you go and can reject the message early and with the exact place that didn't match. It's a bit useless for a 4-letter string though.