Hacker News new | past | comments | ask | show | jobs | submit login
How gzip uses Huffman coding (jvns.ca)
163 points by StylifyYourBlog on Feb 23, 2015 | hide | past | favorite | 6 comments



When I was in college at Rice, one of the harder projects in a sophomore level software engineering class was implementing Gzip according to RFC 1951 and RFC 1952. We had about an hour lecture explaining what huffman coding was and then were left to our own devices for three weeks.

Half the challenge was just understanding what the RFCs were talking about, teaching ourselves huffman coding, and then figuring out how to build this low level stuff in Java in a reasonable way. This was definitely one of the hardest projects I had up until that point.

We had a class discussion afterwards about what was so difficult about implementing GZIP and a few of us mentioned the RFCs were tough to figure out. Professor laughed and said something along the lines of "Besides just learning computer science, we're hoping to teach you the value of good documentation!"


I've implemented gzip... I found that implementing the decompressor first (with a very verbose --debug flag) was critical to understanding the bitstream well enough to implement compression :)


Would you recommend it as a learning exercise?


Maybe. There are lots of finicky details, like inconsistent bit ordering and arbitrary lookup tables. I think it's a long week of work at least, probably more.

On the other hand, it is probably the simplest compression format in widespread use that uses variable-length encoding, and there's a few implementation around for you to look at. Things like bzip2 and LZMA2 really don't have independent specifications and multiple implementations, and are more complicated to boot.

Byte-oriented Lempel-Ziv formats like LZ4, LZO and LZJB would be a good way to dip your toes into a production compression format. The LZW encoding used by GIF files is also "simpler" in principle but I personally find it harder to wrap my head around.

If you want to experiment with binary formats, it might also be interesting to write a decoder for a simple image format, like PCX files, or even for old game file formats like Doom's [http://doom.wikia.com/wiki/WAD].


LZJB is fantastic. Here's my take on a Python implementation of both compression and decompression: https://github.com/unwind/python-lzjb. It's ~150 lines for both, and BSD 2-clause-licensed. It's not very high-performant (I wrote it for fairly small amounts of data) but hopefully clear enough to learn from.


Huffman coding is a gateway drug for data science. Once you see raw input as data, you see as much benefit from optimizing as you do from analyzing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: