No, lz4 consistently decompresses slower than zstd on these particular conditions. `lz4 -d` consistently finishes in ~1.95s wall clock time while `zstd -d --single-thread` consistently finishes in ~1.50s wall clock time.
For reference, I'm using the latest versions of both programs in nixpkgs/NixOS (lz4 1.9.4 and zstd 1.5.5), both of them compiled with `-march=znver1`, I used the `-1 --single-thread` options in zstd, all files were cached in memory and the dataset was `enwik9` as I mentioned.
Except for a compression level that is competitive with lz4, I didn't cherry-pick the above scenario, it's just what I had available at hand. The reason why I chose `enwik9` as a dataset is that I wanted something that could be referenced rather than just using some random files in my computer. I chose `enwik9` specifically because that's what is mentioned in the README of this LZAV project.
There are a few things to note:
1. For some reason, zstd uses around 145-150% CPU during compression and consistently uses 113% CPU during decompression, so the `--single-thread` option doesn't seem to be doing what it advertises [1]. This may be giving zstd an unfair advantage.
2. However, even taking into account the unfair 13% extra CPU usage, zstd is still coming out ahead in terms of decompression speed vs lz4.
3. Notably, I've used one of the "normal" zstd compression levels (level 1), in which zstd seems to beat lz4 at every metric. If I use one of the "fast" zstd compression levels, zstd could be even faster. For example, with level `--fast=1`, `zstd -d --single-thread` finishes in 1.24s on my machine compared to lz4's 1.95s.
4. Not to mention, if I use multiple threads for I/O and for compression, then zstd can compress faster than lz4 by an order of magnitude. Although of course, in one sense this is quite an unfair advantage, but on the other hand the lz4 CLI tool has no option to compress with multiple threads, so it is quite relevant in terms of usability.
5. Interestingly, when I run the exact same benchmark in the exact same conditions on my AMD Zen 3 server rather than my Zen 1 laptop, then lz4 consistently decompresses almost 2x faster than zstd on this compression level. I'm not sure why there is such a large discrepancy.
My only guesses for the large discrepancy is that my Zen 1 laptop has the `RETBleed: Mitigation: untrained return thunk` CPU mitigation enabled in the Linux kernel, which can cause a very large performance degradation [2], as well as slightly different `Spectre v2` mitigations and 2x-16x smaller CPU caches, and perhaps this somehow affects lz4 more than zstd for some reason... (apparently the RETBleed mitigation is not necessary for Zen 3 CPUs).
[1] Adding `-T1` before `--single-thread` make no difference in either compression or decompression, but adding it after `--single-thread` makes no difference in decompression speed/CPU usage but it does makes compression even faster and use more CPU. My guess is that `-T1` is overriding `--single-thread` and using only one compression/decompression thread but using more threads for I/O.
> No, No and No. It's algorithmically impossible that zstd is decompressing faster than lz4 by using the same environment.
Obviously it's not impossible, since it's happening on my laptop.
> You are comparing the zstd with asynchrounous I/O + multithreading against a single thread simple lz4 cli.
As I mentioned, I used `--single-thread` both when compressing and decompressing, which according to the zstd man page, does the following:
> --single-thread: Use a single thread for both I/O and compression. As compression is serialized with I/O, this can be slightly slower. (...)
> (...) this mode is different from -T1, which spawns 1 compression thread in parallel with I/O.
As far as I understand, this shouldn't be doing asynchronous I/O nor multithreading.
Additionally, even if `zstd -d --single-thread` in fact does have multithreading due to some bug (which appears to be true to some extent, due to the >100% CPU consumption), it did not use more than 113% CPU so it could only have a 13% advantage vs lz4 in terms of CPU usage during decompression, which cannot fully explain why zstd is faster.
Furthermore, if I use `zstd -d -T4` then zstd shows this message: "Warning : decompression does not support multi-threading".
Perhaps zstd's `--single-thread` option is simply not working correctly in terms of serializing I/O with the compression/decompression activity. And even though no actual disk I/O is happening (due to the files being cached in RAM), perhaps the latency of doing I/O synchronously is significant enough to affect the results (due to the context switching + all kernel CPU mitigations).
> Additionally you're using some exotic and not representative hardware.
No, I'm using a standard Lenovo T495 laptop.
> In theory, the only single thread decompression case where zstd can beats lz4 is when you have slow (reading) i/o.
I don't think any slow I/O happened because as I mentioned, the files were cached in memory. This laptop has 24 GB of RAM and the machine was being lightly used, the files were relatively small (300-500 MB compressed each), they had been read previously several times (since I repeated the benchmark several times in a row) and no actual significant I/O happened during my benchmarks (I always verify this because I have a graphical I/O monitor).
But perhaps the memory bandwidth limit and/or the userspace <-> kernel I/O latency are significant enough to affect the results.
> lz4 & zstd are mostly used as libraries and real benchmarks are not considering I/O or multithreading.
Well, I mostly use them as command-line tools, and real programs have to do I/O (otherwise how do they input and output the data to compress/decompress?) but I understand what you mean.
> This is what I'm getting with TurboBench & enwik9. lz4 decompressing 2,8x faster than zstd.
I haven't tried TurboBench, but this is what I get when running the built-in lz4 and zstd benchmarks:
According to this benchmark, for decompression lz4 is even faster and zstd is even slower on my machine than what TurboBench shows in your machine.
However, I don't understand exactly what this metric means. Does it mean that lz4 can ingest the input at ~4 GB/s or that it produces the output at ~4 GB/s?
If it's the input, then note that since lz4 produced a compressed file that was 43% bigger than zstd's, then it means that it has to ingest more data to produce the same output, so this can also explain a large portion of the discrepancy.
But in any case, what matters to me as a user is how quickly I can compress and decompress a file (and how small the resulting compressed file is), not what the theoretical speed of an algorithm is in some synthetic benchmark that does not perform I/O nor accounts for whether I can compress/decompress in multithreaded mode.
And in this particular machine, at this particular zstd compression level, zstd beats lz4's default level on all metrics when using their respective CLI tools, even when I handicap zstd's CLI tool by not taking advantage of multiple CPUs (and multithreaded I/O, possibly).
And if I explicitly enable zstd's multithreading support then zstd beats lz4 by a multiple factor difference when compressing while producing smaller files and also decompressing faster as well (for whatever reason).
I understand that my environment may not be representative and that zstd may not be able to beat lz4 on decompression in other hardware/software configurations, but it may still beat lz4 on the other metrics and still be competitive (or at least, fast enough) on decompression, especially if I/O or memory bandwidth becomes the bottleneck rather than raw compression/decompression speed.
If you compress/decompress files frequently, you should really benchmark the command-line tools, using the appropriate options (enable multithreading when possible and test various compression levels) instead of just relying on a synthetic benchmark that for multiple reasons, does not reflect real-world performance.
And if you're using lz4/zstd in some other fashion (for example, in ZFS), you should really benchmark the whole system, because you may be surprised with the result. As an example, perhaps lz4's decompression speed advantage over zstd doesn't matter at all if the rest of the system can't process data faster than zstd can decompress it.
So when benchmarking the whole system, zstd may still win in every metric compared to lz4, including decompression -- which could happen if zstd has to process a smaller compressed input compared to lz4, thus using less memory bandwidth and I/O bandwidth for the compressed input and therefore having more memory bandwidth and I/O bandwidth available to produce the decompressed output (on a memory bandwidth or I/O bandwidth-constrained system).
> Obviously it's not impossible, since it's happening on my laptop.
Well, zstd cli is more optimized for multithreading and asynchrounous i/o. zstd is overlapping cpu decompression with reading/writing. Additionally zstd compressed data is more dense. It's possible that on your notebook you have lz4 decompression including i/o slower that zstd. This is not the way lz are compared in general. Otherwise nobody will use lz4 anymore.
> I haven't tried TurboBench, but this is what I get when running the built-in lz4 and zstd benchmarks
You can see, even on your machine lz4 is faster than zstd.
Now it's obvious that asynchrounous i/o + multithreading is making the difference in the cli case.
> However, I don't understand exactly what this metric means. Does it mean that lz4 can ingest the input at ~4 GB/s or that it produces the output at ~4 GB/s?
In data compression the metrics are always relative to the original uncompressed size.
> But in any case, what matters to me as a user is how quickly I can compress and decompress a file.
Well, this depends of the use case. I'm using 7zip (GUI) and it's working fine for me.
It's very hard to make benchmarks when multithreading or i/o is involved. You'll get often different results on different hardware configurations and the number of threads.