Hacker News new | past | comments | ask | show | jobs | submit login
Zstandard – Real-time data compression algorithm (facebook.github.io)
159 points by RafelMri on Sept 4, 2022 | hide | past | favorite | 56 comments



Interesting fact that under the hood it's based on tANS introduced by Jarosław Duda from Jagiellonian University:

some cool references:

https://www.youtube.com/watch?v=uXtmN9fE01k

https://th.if.uj.edu.pl/~dudaj/

https://demonstrations.wolfram.com/DataCompressionUsingAsymm...

https://encode.su/threads/2078-List-of-Asymmetric-Numeral-Sy...


Well, the entropy coding step is. Which is just one of multiple parts of the data compression. But entropy coding typically is the bottleneck in encoding/decoding speed, yes, and Duda's work is impressive (also because he took on Google when the latter didn't appear to keep their no-software-patents promise that they. made when they first started collaborating. The man stands up for his principles)



>PhD in Theoretical Physics, PhD in Theoretical Computer Science, MSc in Theoretical Mathematics

well, impressive


Also look at the time line! The dude started three 4-year masters courses (c.s., theo.math., theo.phys.) in a one-year stagger, so during 2001-2004 he was doing all three in parallel. The man seems to be quite the beast, even if he was probably able to reuse quite a bit of that math!


I'm not sure why this is an interesting fact, could you explain?


Zstandard really is incredible. Anywhere I would’ve previously compressed things with gzip I now universally prefer to use Zstandard instead. It’s a shame it’s not supported as a HTTP Content-Encoding in web browsers.


According to some Firefox developers, brotli is more suitable because of the standardised dictionaries for web content, like JS: https://bugzilla.mozilla.org/show_bug.cgi?id=1301878


can't zstandard use standardized dictionaries as well?


Yes, sort of: "Zstd offers a training mode, which can be used to tune the algorithm for a selected type of data. Training Zstandard is achieved by providing it with a few samples (one file per sample). The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression. Using this dictionary, the compression ratio achievable on small data improves dramatically."

https://github.com/facebook/zstd#the-case-for-small-data-com...


According to the Bugzilla thread, Facebook won't bother going through the process of publishing standardized dictionaries.


But surely anyone could publish dictionaries? Like let's say Mozilla could publish a dictionary trained on HTTP headers, HTML and Javascript, and then accept traffic compressed with it under a "Accept-Encoding: zstd/mozilla" header


I wouldn't go to the trouble of determining the privacy implications of standartized dictionaries built from user data.


Brotli is super slow and thus suitable only for static content.

zStd also supports pre-trained dictionaries, it is particularly good for specific small JSON, for example. It won't be that helpful for relatively large HTML files.


The problem is brotli compression is pretty expensive/slow to encrypt. Which isn't really a problem for static content that can be compressed ahead of time. But doesn't work very well for just in time compression.


This isn't true. Brotli compression speed is similar to Zstd at comparable compression ratios.

https://quixdb.github.io/squash-benchmark/unstable/?dataset=...

Perhaps, this common misunderstanding is due to the fact the slowest level (-11) is the default one in the brotli command line utility?

It's also faster than gzip, and gzip was used for "just in time" compression on the web for ages, so I don't see how brotli can be a problem for this use case. (I assume by "encrypt" you mean "compress".)


On a project I worked on, we tried switching from gzip to brotli for on the fly compression, and it significantly increased CPU usage. That was years ago though so it's possible things are better now, or maybe the library we used just didn't support a compression ratio as low as what we were using with gzip.


Fair enough, I think there were some issues with early releases, but they should be fixed now.


I wish Caddy would support brotil, is either gzip or zstd currently.


Just did some quick research and it looks like you can serve pre-compressed static resources in Brotil. See the precompressed option of the file_server directive[1].

The thinking seems to be due to how CPU intensive it is, Brotil is not favoured for on-the-fly compression[2].

If you're really desperate for it though, you could try [this extension][3].

[1] https://caddyserver.com/docs/caddyfile/directives/file_serve...

[2] https://caddy.community/t/caddy-v2-brotli/8805

[3] https://github.com/ueffel/caddy-brotli


The extension is not performance focused, so its slow.

Other webservers support it, mainly Nginx. But since Caddy is written in go, they need a native go port which doesn't exist yet.


> The extension is not performance focused, so its slow.

Yeah, hence my "really desperate" qualifier. I probably should have been more explicit.


Caddy 2 supports brotli static compression out of the box, I ran a test on it just last week with this exact configuration. Dynamic, I'm not sure. That use case doesn't interest me as much.


Yeah, this is definitely a real advantage and not a broke-ass feedback loop of SEO and confirmation bias. It's great that we have to ship our compressors with massive English-biased static dictionaries now.


Technically it is available, but it's just not widely supported. Though it's really unnecessary. Brotli is, in many ways, very comparable or a near equivalent to zstd.


last time I looked Brotli format do not clearly define its frame format, such as there is a good chance that a random stream of data would be decoded as valid brotli compressed data.


When is this ever left to chance? Why would a system randomly decide to try decoding a stream as brotli if it isn't?


Its not so much about suddenly starting to decode something. It’s more when your stream for some reason gets misaligned or corrupted and whatever is still streamed is happily consumed with no possible way to detect the error.

Common reason being someone forgot to check how many bytes read() actually returned and just assume it filled the buffer.

Robust data formats have double protection against this class of bugs with unambiguous SOF and EOF boundaries that the consumer can assert. I guess a sanity-byte once every X Megabytes wouldn’t hurt either.


I use brotli as a web distribution format and as an internal compression format on web stacks for various projects and for production systems without issue.

I still don't understand the use case limitation, or exactly where this is a potential issue. Why is the data being interrupted? Is this some kind of raw UDP stream use case?


There is RFC8478 for HTTP Content-Encoding, but I don’t think any web browsers currently implement it.

https://www.rfc-editor.org/rfc/rfc8478


Also try LZ4 and Snappy. Depending on your use case they can save your extra CPU too.


Recent versions of zstd mostly obsolete LZ4 and Snappy so you can use one compressor to rule them all.


The chart on the zstd site doesn't really back that up for lz4.

    zstd 1.4.5 -1        2.884  500 MB/s  1660 MB/s
    zstd 1.4.5 --fast=1  2.434  570 MB/s  2200 MB/s
    zstd 1.4.5 --fast=5  2.178  700 MB/s  2420 MB/s
    lz4 1.9.2            2.101  740 MB/s  4530 MB/s
3% larger size for a nearly 2x decompression speed is a no brainer for lots of uses.

Anecdotally I have also gotten much better decomp speed on snappy on previous generations of CPUs staying within ~10% compression ratio, though I had not benchmarked/tuned extensively nor reconfirmed in the past ~2 years. (Probably this can also vary a lot by implementation, which you're often stuck with based on some other library choice.)


The lz4 algorithm is also tiny code-wise. Pulling in a lz4 compressor / decompressor library into my wasm project only added 11kb to the bundled size. Zstd is many times larger.

For my use case, any extra benefits of zstd are far outweighed by the extra cost of downloading zstd itself in the browser.

LZ4 is smaller, faster and much more simple.


this can be a bit misleading because you only get the 2x decompression speed if you aren't bandwidth limited. this means that zstd will almost always be faster over a network.


Even if you are bandwidth-limited you can rephrase this as "3% slower for half the CPU usage" which is still a good bargain in a lot of cases.


A common use of zstd is for zfs, so decode speed matters quite a bit.


Recent versions of zstd definitely don't obsolete LZ4, or else I don't think the author would still be contributing to both...

zstd also seems to be optimizing much more, recently, for the large stream of data use cases, versus the fixed-size or smaller records you might get in, say, databases, or filesystems, so I think it's actually becoming less generally useful over time of late.

And if you're going to play with Snappy, you might find S2, which was linked on HN relatively recently, interesting. [1]

[1] - https://github.com/klauspost/compress/tree/master/s2


They are in the benchmarks on the page.


I don't really know anything about compression algos, but something that struck me as interesting with this one when it came up on KF before is the ability to create a custom dictionary - not sure if that's really common among other compression algos or not. But since I often work with compressed MySQL dumps with a lot of repetitive data from my client's site that come out to almost 50MB when compressed with xz, I wonder if I could make a dictionary with all the MySQL keywords you'd see in dumps, plus other repetitive strings unique to that site (for example, something like two-thirds of all user email addresses end with "@gmail.com") and get those dump sizes even smaller.

I imagine that xz might be good enough at making its own dictionary that what it ends up with wouldn't be much different from what I'd make manually, so I could see a decent chance that the improvements would be minimal at best. Has anyone done any experimentation along these lines?


The dictionaries mentioned here speed up for small files where the dictionary would be bigger than the raw source. You don't need that for a Mysql dump. Its many times bigger than the dictionary. You could theoretically save space for large files by sharing a dictionary but the saving is probably negligible.


Okay, interesting. I guess I'll stick with what I've got now for the time being, then. Thanks.


Sorry for the little hijacking: An alternative to using "regular" dumps might be MySQL Shells dump utility. Contrary to mysqldump etc. it doesn't save data in SQL but some CSV-like format. This won't solve the "@gmail.com" case but avoid the repetitive "INSERT INTO tablename VALUES (" for each record and what isn't there doesn't have to be compressed to begin with. (And then is paralizable etc., downside might be that it is not using a single file, but a number of files) https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-uti...

Depending on your needs you could also look at the MySQL clone plugin, which gives you a simple way to clone the data dictionary as physical backup, avoiding the textual representation, but has its restriction. https://dev.mysql.com/doc/refman/8.0/en/clone-plugin.html

Disclaimer: I work at the MySQL development team at Oracle


looks interesting, but given the stats below from the official site, what would be the motivation to use zstd rather than lz4?

zstd 1.4.5 --fast=5 2.178 700 MB/s 2420 MB/s

lz4 1.9.2 2.101 740 MB/s 4530 MB/s


Zstd achieves considerably higher compression on some kinds of data with little to no preprocessing. LZ4 hits a compression ratio ceiling at some point - algorithmically, it's "half of Zstd" (only LZ match/literal encoding, no entropy encoding at all).

Read the LZ4 format, it's very simple and educational, I'd say beautiful. It fits in half a screen of text.


Probably for the settings that give higher compression than LZ4


Zstd really seems to be a recurring trope on HN

(see, e.g., https://news.ycombinator.com/item?id=32529412#32531734)


At my previous job I noted that analytics events were passed to dwh as plain JSON. I said immediately that it should be compressed and proposed zStd as a codec. The actual implementation took less than a week including testing and gradual rollout, and cut incoming traffic to our data-center by 70%, while not at all affecting CPU on client or server.


Is this comparing the same number of cores, or is zlib single-threaded and zstd multi-threaded?

For some applications it matters how fast one operation completes, but for others, it's much more relevant how much CPU time it consumes, so if zstd needs 1 second on 8 cores for something gzip does in 8 seconds on 1 core, it would be no benefit at all.


The benchmark numbers on the linked page are from lzbench, which tests single-threaded performance.


I wish Rust would add this, but they don't seem interested at all:

https://github.com/rust-lang/rustup/issues/1858


We're interested. I'd love to see this happen. But it does require several different things to change, and rustup wouldn't be the starting place (apart from needing to have support for it).


They seem very interested. There's lots of benchmarks and discussion. It just looks like zstd would take more space than xz - it wasn't a universally better solution. The concerns noted are totally reasonable and are based on real world issues that the teams that maintain this infra have. What's obvious, however, is that there was a lot of interest.


That thread doesn't appear to be uninterested people. They're asking about rationale, saying they may be asking the wrong people, brainstorming other things that would help, and asking for data.


Is the latency of the Rust installation process important?

I can see why you would want it to be efficient, economical in bandwidth, and fast enough in all scenarios. But why does it need to be faster in situations where it is already quite fast?


In the future, it would be cool if there was an open-source data compression algorithm that used SIMD in its decoder so it could compete with Oodle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: