Hacker News new | past | comments | ask | show | jobs | submit login
The Hitchhiker's Guide to Compression (go-compression.github.io)
159 points by ohjeez on Oct 5, 2020 | hide | past | favorite | 24 comments



I have mixed feelings about this. It's always good to teach people things they don't know, and compression is a worthy subject.

But the author seems to lament that not much of interest is happening in lossless compression. That's true, but there's a good reason for it: Lossless compression is just about as good as it's ever going to get, and it's easy to prove that. Seeking dramatic improvements in general lossless compression is like seeking faster-than-light travel: Ain't gonna happen. Except for specialized niches, any improvements to lossless compression are going to be tiny and incremental. Tiny and incremental might well matter for petabyte data stores like Google manages, but it won't matter for most people.

Lossy compression is a different story. There's still lots of unexplored territory in finding an optimum bit rate that matches the information processing capabilities of human eyes and ears.


I had a similar feeling a decade ago, but I now think otherwise. We may have reached the good-enough compression ratio but not the good-enough performance, and modern algorithms now rentlessly improve on this. On the other hand lossy compression always tends to exploit the currently available computing power; the ability to continue doing so for decades implies that improvements on lossy compression are more "incremental" than those on lossless compression.


What do you make of the Hutter prize [0]?

I think the greatest breakthrough in lossless compression would come when the development of compute outpaces storage density. Maximum possible compression (entropy i think?) isn’t impossible per say, but it is computationally impossible with today’s computing resources.

Of course, largely hypothetical now, but I envision a time in the distant future where we produce so much data that we must compress it to its principled minimum.

I do agree that current lossless compression is pretty pressed for further innovation given current resource constraints.

[0] http://prize.hutter1.net/


What if compute capability became so powerful, that the easiest way to store something was to create it, store it for a millisecond, and then simply record the exact place and time since the big bang that the record was stored, as a comparably short set of coordinates, and then delete it. Then accessing the file is as simple as simulating the universe up until that millisecond and reading the file.


Or we just store the hash and then brute-force it every time we need the original content.


Probably easier to just have every computer connected on a much faster network. Query the internet for a hash, have it sent to you after a 100ms lookup at 1TBps.

Seems impossible considering in my lifetime I’ve basically seen ISP speeds go from 5/1 on my first cable connection to 20/768 20 years later on DSL. I always find it a bit weird knowing how important the internet is to commerce and the economy and my only real way to access it is through the same phone line that has been around forever.


That'd be lossy with hash collisions, though.


You are right, although we might be able to make it lossless in practical terms with probabilistic approaches like combining multiple algorithms, selecting the hash length, and using heuristics to pick the most likely source.


I think this assumes a deterministic universe. I'm not sure if that's the case.


DRM will ruin it.


Isn't Heisenberg's uncertainty principle actually some kind of DRM on the data that we can measure?


I somewhat agree. I think that Shannon's theorem about information entropy comes into play and AFAIK we are very close already so improvements are bound to be small as we try to get closer and closer to the theoretical limit.

Still, I am reminded of the Huffman compression: there researchers already thought they reached the optimum. They were proven fenomemally wrong when the researchers relized they should look at compressing sequences rather than individual bytes (I believe it was LZ77 or LZW algorithm that was the first).

If we take that into account and following the same thought, improvements may still be possible if we could compress based on a larger body of data. For example, headers of the same file type contain the same common information. In theory, these could only be stored once on the hard drive. This is something deduplicated storage is already trying to achieve, so nothing new here, just trying to point to one avenue that can still yield future improvements :)


If you only care about compressing a megabyte a minute there's not a lot left to do. But in practice speed is usually a major consideration and better methods let you fit significantly better compression into your time budget.

Also going one byte at a time with minimal context is pretty good for text but pretty bad for many other kinds of data. There's a lot of room to improve on those fronts.


How do we know that we've hit a wall with lossless compression? As a non-expert, I've been really impressed with the likes of zstd and lz4 which are relatively new


I should clarify: There's still room for improvement in the speed of lossless compression/decompression operations, and that's where most improvements happen nowadays. But there's little room for improvement in compression ratio, because most modern algorithms produce outputs that are very close to random already.

More background: https://marknelson.us/posts/2006/06/20/million-digit-challen...


> because most modern algorithms produce outputs that are very close to random already.

Compressing well-compressed or random data would be impossible but how is it indicates there's little room?


For further reading, Data Compression Explained ( http://mattmahoney.net/dc/dce.html ) is a very good reference to data compression algorithms with a better explanation of the information theory behind it and the differences between Coding (Huffman/Arithmetic/ANS/...) and Transforms (LZ77/LZW/BWT/...).


On the topic of "simple" lossless compression, my favorite is probably LZJB from the ZFS file system [1].

I'm very far from an expert, I found that when searching a couple of years ago for something simple and usable in a semi-tight embedded setting.

When managing software installation/upgrade for an embedded solution, LZJB typically shaved 30-50% off the size of the executable which was really awesome. LZJB requires a quite small amount of RAM when decompressing (around 1 KB if I remember correctly).

I have a repo with a BSD-licensed implementation in pure Python, which is very useful for putting in the build pipeline [2].

[1] https://en.wikipedia.org/wiki/LZJB

[2] https://github.com/unwind/python-lzjb


While I like the ideas behind this guide, it's very misleading and out of date compared to what today's lossless compression algorithms care about. Anybody speaking the words "Huffman tree" will be thrown out the window. Huffman hasn't been a "tree" for years, we all use canonical Huffman.

Similarly, the difference between LZ77, LZSS, etc. aren't really of note. The modern innovations in LZ77 are 1. getting the really optimal parse from your encoder (hard!), 2. things to recognize higher order structure, like repcodes (first seen in LZX, but also a big part of LZMA), 3. allowing the CPU to parallelize the decode procedure by making the stream less serial.


Deduplicating compressed data is difficult if you want to be able to reconstruct the original. Zip for example has many ways to compress the data, so it's difficult it decompress a zip and keep the information necessary to reconstruct the exact same compressed file.

I'm aware of one tool for zip:

https://pypi.org/project/deterministic-zip/

of course that only helps if it's used to create the original zip.

Is there a list of "reversible" formats somewhere? Are there other ways to deal with this?


It would be great if this was built into zip and tar.

I've been doing an awful find, xargs, and touch combo to get deterministic zips, but the cli args vary between unix distributions.

It would be better if zip and tar could just take a single timestamp arg, and use that for all entries.


Is there any good, systematic discussion of the trade-off between computational complexity and compression? I have this intuition that the lower the entropy of signals, the more complex encoders and decoders might potentially need to be, while slightly reducing compression could afford, in some cases, making do with dumber codecs. Is this idea explored anywhere?


You may want to read about the Pareto frontier. Squash Compression Benchmark [1] shows a plot of the Pareto frontier for the particular dataset.

[1] https://quixdb.github.io/squash-benchmark/#optimal-codecs


Read up on Kolmogorov complexity. Mind-blowing stuff. Not exactly what you’re asking about but I think it’s relevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: