No, No and No. It's algorithmically impossible that zstd is decompressing faster...

someplaceguy · on July 23, 2023

> No, No and No. It's algorithmically impossible that zstd is decompressing faster than lz4 by using the same environment.

Obviously it's not impossible, since it's happening on my laptop.

> You are comparing the zstd with asynchrounous I/O + multithreading against a single thread simple lz4 cli.

As I mentioned, I used `--single-thread` both when compressing and decompressing, which according to the zstd man page, does the following:

> --single-thread: Use a single thread for both I/O and compression. As compression is serialized with I/O, this can be slightly slower. (...)

> (...) this mode is different from -T1, which spawns 1 compression thread in parallel with I/O.

As far as I understand, this shouldn't be doing asynchronous I/O nor multithreading.

Additionally, even if `zstd -d --single-thread` in fact does have multithreading due to some bug (which appears to be true to some extent, due to the >100% CPU consumption), it did not use more than 113% CPU so it could only have a 13% advantage vs lz4 in terms of CPU usage during decompression, which cannot fully explain why zstd is faster.

Furthermore, if I use `zstd -d -T4` then zstd shows this message: "Warning : decompression does not support multi-threading".

Perhaps zstd's `--single-thread` option is simply not working correctly in terms of serializing I/O with the compression/decompression activity. And even though no actual disk I/O is happening (due to the files being cached in RAM), perhaps the latency of doing I/O synchronously is significant enough to affect the results (due to the context switching + all kernel CPU mitigations).

> Additionally you're using some exotic and not representative hardware.

No, I'm using a standard Lenovo T495 laptop.

> In theory, the only single thread decompression case where zstd can beats lz4 is when you have slow (reading) i/o.

I don't think any slow I/O happened because as I mentioned, the files were cached in memory. This laptop has 24 GB of RAM and the machine was being lightly used, the files were relatively small (300-500 MB compressed each), they had been read previously several times (since I repeated the benchmark several times in a row) and no actual significant I/O happened during my benchmarks (I always verify this because I have a graphical I/O monitor).

But perhaps the memory bandwidth limit and/or the userspace <-> kernel I/O latency are significant enough to affect the results.

> lz4 & zstd are mostly used as libraries and real benchmarks are not considering I/O or multithreading.

Well, I mostly use them as command-line tools, and real programs have to do I/O (otherwise how do they input and output the data to compress/decompress?) but I understand what you mean.

> This is what I'm getting with TurboBench & enwik9. lz4 decompressing 2,8x faster than zstd.

I haven't tried TurboBench, but this is what I get when running the built-in lz4 and zstd benchmarks:

  $ lz4 -b1
   1#Synthetic 50%     :  10000000 ->   5859357 (1.707), 683.8 MB/s ,4652.3 MB/s

  $ zstd -b1
   1#Synthetic 50%     :  10000000 ->   3152996 (x3.172),  882.1 MB/s, 1338.1 MB/s

According to this benchmark, for decompression lz4 is even faster and zstd is even slower on my machine than what TurboBench shows in your machine.

However, I don't understand exactly what this metric means. Does it mean that lz4 can ingest the input at ~4 GB/s or that it produces the output at ~4 GB/s?

If it's the input, then note that since lz4 produced a compressed file that was 43% bigger than zstd's, then it means that it has to ingest more data to produce the same output, so this can also explain a large portion of the discrepancy.

But in any case, what matters to me as a user is how quickly I can compress and decompress a file (and how small the resulting compressed file is), not what the theoretical speed of an algorithm is in some synthetic benchmark that does not perform I/O nor accounts for whether I can compress/decompress in multithreaded mode.

And in this particular machine, at this particular zstd compression level, zstd beats lz4's default level on all metrics when using their respective CLI tools, even when I handicap zstd's CLI tool by not taking advantage of multiple CPUs (and multithreaded I/O, possibly).

And if I explicitly enable zstd's multithreading support then zstd beats lz4 by a multiple factor difference when compressing while producing smaller files and also decompressing faster as well (for whatever reason).

I understand that my environment may not be representative and that zstd may not be able to beat lz4 on decompression in other hardware/software configurations, but it may still beat lz4 on the other metrics and still be competitive (or at least, fast enough) on decompression, especially if I/O or memory bandwidth becomes the bottleneck rather than raw compression/decompression speed.

If you compress/decompress files frequently, you should really benchmark the command-line tools, using the appropriate options (enable multithreading when possible and test various compression levels) instead of just relying on a synthetic benchmark that for multiple reasons, does not reflect real-world performance.

And if you're using lz4/zstd in some other fashion (for example, in ZFS), you should really benchmark the whole system, because you may be surprised with the result. As an example, perhaps lz4's decompression speed advantage over zstd doesn't matter at all if the rest of the system can't process data faster than zstd can decompress it.

So when benchmarking the whole system, zstd may still win in every metric compared to lz4, including decompression -- which could happen if zstd has to process a smaller compressed input compared to lz4, thus using less memory bandwidth and I/O bandwidth for the compressed input and therefore having more memory bandwidth and I/O bandwidth available to produce the decompressed output (on a memory bandwidth or I/O bandwidth-constrained system).

powturbo · on July 23, 2023

> Obviously it's not impossible, since it's happening on my laptop. Well, zstd cli is more optimized for multithreading and asynchrounous i/o. zstd is overlapping cpu decompression with reading/writing. Additionally zstd compressed data is more dense. It's possible that on your notebook you have lz4 decompression including i/o slower that zstd. This is not the way lz are compared in general. Otherwise nobody will use lz4 anymore.

> I haven't tried TurboBench, but this is what I get when running the built-in lz4 and zstd benchmarks You can see, even on your machine lz4 is faster than zstd. Now it's obvious that asynchrounous i/o + multithreading is making the difference in the cli case.

> However, I don't understand exactly what this metric means. Does it mean that lz4 can ingest the input at ~4 GB/s or that it produces the output at ~4 GB/s? In data compression the metrics are always relative to the original uncompressed size.

> But in any case, what matters to me as a user is how quickly I can compress and decompress a file. Well, this depends of the use case. I'm using 7zip (GUI) and it's working fine for me.

It's very hard to make benchmarks when multithreading or i/o is involved. You'll get often different results on different hardware configurations and the number of threads.

aleksv · on Aug 6, 2023

Please retest lzav 2.9 on enwik9-should be better.