For a couple of reasons this code doesn't look like a very reliable one...
Almost identical cases don't reuse code (not even a define). There are also sections like "`if (usual[ch >> 5] & (1 << (ch & 0x1f))) break;`" without any comments. The code hardcodes the http methods for some reason so that the detection code spans ~130 lines (454..580). Looks like it will accept "HXXX/1.1" if strict checking is off. This double check is... interesting:
if (!parser->FOR##_mark) return 0; \
assert(parser->FOR##_mark); \
Sure - speed++, but at what cost? Otherwise... cool code - I like the MARK / CALLBACK macros.
I'm disappointed that the student who hardcoded the results was disqualified. You should get points for finding bugs in your professors' specification.
Not to mention the "winning" entry did exactly the same thing: found an `n' # of words which works for this input only, then created a fast algorithm that works for this input only, but is incorrect in general.
I'm stunned that this was not for seen by the professor. Wasn't it obvious that the best way to do this was to give some sample documents for development, but keep the text that the program would be judged on a secret?
Well the specs say that the program must "calculate the 10 most common words". The disqualified program doesn't really calculate the 10 most common words– it simply outputs the answer.
The student missed the most important part of the spec: the intent. 99 Out of a 100 times, you aren't implementing a rigorous spec. The ability to understand what is requested is much more important than the ability to be a wise-ass.
> 99 Out of a 100 times, you aren't implementing a rigorous spec.
I'd argue that most programmers do, or are supposed to. Either because they are not the designers of the system they build and what they are expected to do is to "fill in the blanks and make it work", or because they implement something that operates with another system and should strictly adhere to the interface/protocol.
The ones who have anything to say about the design are usually the system designers (duh), small project contributors (because there is no spec) and UI people (because they're designers). So yes - if your spec is "1. make it work on this input; 2. make it fast", then making it work in O(1) for this and only this input is basically the best thing you can do (unless you can make it faster for every input in less time of course). Spending time on generalising it is based on second-guessing the intent.
I'd even go as far as saying that specs have absolutely no intent. You have an intent and describe it while writing the spec and you'd better formalise your intent precisely or you will end up with something you didn't request. If you want something on every input, you'd better write it down (when you give a task to someone else).
I've seen that line referenced a couple of times here recently. I'm pretty sure it derives from Jerry Weinberg's Psychology of Computer Programming, which contains a story about some programmers defending an incorrect program on the grounds that it would process more cards per minute, to which Weinberg replied, "But your code doesn't work. If mine doesn't have to work I can do a million cards a minute!"
After reading that, I wonder if trie-s (i.e. judy arrays) would beat hash tables in this exercise... (and I also think that hardcoding the results wasn't cheating)
For the record, and I am sure it's what you meant, a more specific description of Judy arrays is that they are a specific type of trie(though a trie is a specialization of a tree a trie is more rigorous.)
Actually no. That's only a binding if you look at the source code.
The sole C source code is only about 2-3 lines long with a "#include <Judy.h>" which isn't even included while the haskell source is littered with foreign function calls.
With some pointer arithmetic, I would imagine so (with tries, not sure what judy arrays are)... although it might depend into how quickly memory can be allocated/accessed. Assuming the text is a set of lower-case alphabet character strings, you could define a 26-ary trie with very few instructions to do each characters' lookup...
I would be willing to bet that that implementation is slower than a hash table.
(Yeah, I do have some data to back that up - I wrote & benchmarked an autocomplete implementation for a financial software firm. String tries were roughly 10x slower than binary searching an array for a 20k word corpus, which is probably around the size you're dealing with here. They have terrible cache locality - each character makes you follow a pointer, which is probably a cache miss, and the total size of the trie is on the order of 26 * 8 * 20k = 4M. They become a bit better if the number of distinct words in the corpus is small (so that everything fits into cache) or if the number of words is huge (so that your hashtable or array blows the cache anyway).
I've had the opposite experience with Tries v. Hash tables. Though that could be a result of the Trie implementation. Dual array tries have really impressive lookup times, especially compared to the naive implementation... Insertion is pretty bad though. (Though that can be mitigated: http://www.gongcaichun.info/PPT/21.pdf)
The structure of your data matters a lot. If one will fit in cache but the other won't, the one that fits in cache will almost always win. Then they have different big-O characteristics: binary searching an array is O(log N) where N is the number of items, a trie is O(k) where k is the length of the key, a hashtable is O(1), but this is misleading because the hash function is usually O(k) and performance can degrade to O(n) if you have lots of hash collisions.
I thought that an autocompletion widget for stock tickers would be the perfect application for tries: short, dense key space, lots of elements, and a fair likelihood of hash collisions. But apparently not, because our data size just happened to be one where cache effects dominate. I talked to one of the GMail guys later and he was quite surprised, from which I inferred (but it was not stated) that GMail probably uses tries to good effect.
Right on. My point was simply that the type of Trie you use has an effect on the data's size in memory. A dual-array setup is pretty good at keeping the data small... There's very little pointer overhead and proximity is fairly good as well.
At my company we rolled our own too.
The hard part was tracking down all the brain-dead servers that seemingly worked fine with browsers, but were off spec.
I need to dig through my old computer and find my college lab assignments where we built a basic HTTP/1.1 server. For the request parser, we had to create a lex file for a grammar that include actions for all of the verbs, and also be robust to accept non-standard verbs, i.e, return a 400 code.
I doubt that my yacc'd program would be only 124 bytes in size, but it would be interesting to get that old code and compare the results.
I thought it was a pretty cool implementation (zero memory alloc is pretty hard most of the time), though I was disappointed how HTTP methods were hard coded in, so adding a new HTTP method (like PROPFIND?) would be adding a ton of code, though I think you could make a few macros for doing HTTP methods and keep the rest of it.
Most parsers nowadays are generated via yacc and others. You write the structure, yacc generates the parser (and flex the lexer). It's something completely different if you write the parser yourself.
I certainly am. It's a complete mystery to me. Is it like visual basic?
The only PITA with HTTP is chunked encoding. Whoever thought that gem up should be shot. The rest is fairly trivial. Certainly parsing headers is. This implementation looks pretty silly. Having individual states for each of the characters in "HTTP" etc? WTF?
edit: Instead of just downmodding me, why not explain exactly what part of parsing HTTP headers is non trivial?
The parser correctness is not trivial. The RFC2616 contains the complete grammar you need, so it's fairly simple to implement. OTOH, if you write a parser on your own, you're likely to miss stuff like section 4.2, which explains that header values can be multiline if they include LWS. The parser from this article will fail on (just looking at the source, I'm 99.5% sure of this):
abc:
def
It also doesn't like tabs and will not support comma-separated header values. It's not rocket science to write a "good enough" http parser, but writing a fully compliant one is something completely different. There are also cool parts of the spec that you can read 10 times and come to different conclusions - for example what does the "\" CR LF section mean if it's inside a quoted string and does it finish the header value or not. Writing a "correct" parser is a LOT of fun...
Keeping separate states for characters in HTTP saves you a couple of cycles probably, because you match as you go and can reject the message early and with the exact place that didn't match. It's a bit useless for a 4-letter string though.
Almost identical cases don't reuse code (not even a define). There are also sections like "`if (usual[ch >> 5] & (1 << (ch & 0x1f))) break;`" without any comments. The code hardcodes the http methods for some reason so that the detection code spans ~130 lines (454..580). Looks like it will accept "HXXX/1.1" if strict checking is off. This double check is... interesting:
Sure - speed++, but at what cost? Otherwise... cool code - I like the MARK / CALLBACK macros.