I assume you wanted to link to TurboBench and not that particular issue, which for some reason also contains a link for some car listing?
Secondly, did you not see my answer to ebiggers under this comment you replied to? Yes, for Silesia, libdeflate is faster, I can confirm, but there are at least two cases for which igzip is faster and one for which igzip is twice as fast. But yes, it heavily depends on the input data.
Edit: I was then wondering why I could not find any igzip benchmarks on the repository's ReadMe and then found https://github.com/powturbo/TurboBench/issues/43 , so I guess this is the one you wanted to link to and the 3 got cut off.
Well, a correct benchmarking is not done with special data, but with datasets that represent a large set of distributions. Such datasets are for ex. einwik8/9 for text, silesia for a mixed dataset.
As a corner case example, RLE-compressible data is not representative for benchmarking compression libraries.
If you provide a link for a dataset 10-100MB, I can verify your claims, because I'm not aware of a dataset where igzip is 2 times faster than libdeflate. In TurboBench there is no I/O or other overhead involved, additionally it's single threaded.
It's also possible that you're comparing two different CLI programs, one (igzip) I/O optimized and the other as a simple CLI.
EDIT: I've seen the file you're referencing is 4Gi-Base64. This file is not very compressible (75% with gzip). It's possible that igzip is simply storing the file or some parts of it without compression. This explain why it can be faster that libdeflate, because in this case igzip is using memcpy at decompression.
That is not the case. I have compressed the file with gzip and then tested decompression with multiple gzip decompressors. I.e., they all are fed the same compressed binary. Furthermore, I have used rapidgzip --analyze, to print out all deflate blocks, their compression types, and other information and statistics. All of the blocks are Dynamic Huffman compressed blocks.
I have already incorporated the ISA-L Huffman decoder into rapidgzip but that did not give full ISA-L speed. Ergo, I think the last performance comes from the inflate loop (decode Huffman, decode distance code if necessary, resolve references, repeat). This part is written in Assembler and seems to do some kind of speculative prefetching, i.e., already get the next Huffman Code symbol assuming that the current symbol is a literal or something like that. It's quite interesting but I doubt, or rather, I already tried a bit and failed to reproduce this kind of prefetching inside the C++ code but failed to do so. The compiler is probably rearranging everything anyway.
rapidgzip --analyze 4GiB-base64.gz | tail -100
== Benchmark Profile (Cumulative Times) ==
readDynamicHuffmanCoding : 0.903727 s (3.38252 %)
readData : 25.8139 s (96.6175 %)
Dynamic Huffman Initialization in Detail:
Read precode : 0.00975829 s (1.07978 %)
Create precode HC : 0.0341786 s (3.78196 %)
Apply precode HC : 0.0667695 s (7.38823 %)
Create distance HC : 0.114097 s (12.6252 %)
Create literal HC : 0.678924 s (75.1249 %)
== Alphabet Statistics ==
Precode : 123274 duplicates out of 126748 (97.2591 %)
Distance : 597 duplicates out of 126748 (0.471013 %)
Literals : 13799 duplicates out of 126748 (10.887 %)
== Precode Code Length Count Distribution ==
16 |==================== (126748)
== Distance Code Length Count Distribution ==
30 |==================== (126748)
== Literal Code Length Count Distribution ==
259 |============= (50635)
2.601250e+02 |==================== (74281)
| (1809)
262 | (23)
== Encoded Block Size Distribution ==
185942 bits | (1)
|
|
|
|
|
| (2)
207391 bits |==================== (126745)
== Decoded Block Size Distribution ==
30579 Bytes | (1)
|
|
|
|
|
| (1)
34112 Bytes |==================== (126746)
== Compression Ratio Distribution ==
1.314967e+00 Bytes | (9)
| (1327)
|====== (17906)
1.315808e+00 Bytes |==================== (52698)
|================ (42890)
|==== (10876)
| (1001)
1.316891e+00 Bytes | (41)
== Deflate Block Compression Types ==
Dynamic Huffman : 126748
Thank for your analyse. It seems it's a corner case, benchmarking a base64 encoded random file is not a typical case.
igzip has no more advantage above libdeflate. It has only fast compression at the levels 0,1,2 but with mediocre compression ratio.
Yes, especially SIMD Neon where gcc producing horrible Neon code for all versions < gcc-12 even by using simd intrinsics. From version 12 gcc is at same level as clang.
No, lz4 consistently decompresses slower than zstd on these particular conditions. `lz4 -d` consistently finishes in ~1.95s wall clock time while `zstd -d --single-thread` consistently finishes in ~1.50s wall clock time.
For reference, I'm using the latest versions of both programs in nixpkgs/NixOS (lz4 1.9.4 and zstd 1.5.5), both of them compiled with `-march=znver1`, I used the `-1 --single-thread` options in zstd, all files were cached in memory and the dataset was `enwik9` as I mentioned.
Except for a compression level that is competitive with lz4, I didn't cherry-pick the above scenario, it's just what I had available at hand. The reason why I chose `enwik9` as a dataset is that I wanted something that could be referenced rather than just using some random files in my computer. I chose `enwik9` specifically because that's what is mentioned in the README of this LZAV project.
There are a few things to note:
1. For some reason, zstd uses around 145-150% CPU during compression and consistently uses 113% CPU during decompression, so the `--single-thread` option doesn't seem to be doing what it advertises [1]. This may be giving zstd an unfair advantage.
2. However, even taking into account the unfair 13% extra CPU usage, zstd is still coming out ahead in terms of decompression speed vs lz4.
3. Notably, I've used one of the "normal" zstd compression levels (level 1), in which zstd seems to beat lz4 at every metric. If I use one of the "fast" zstd compression levels, zstd could be even faster. For example, with level `--fast=1`, `zstd -d --single-thread` finishes in 1.24s on my machine compared to lz4's 1.95s.
4. Not to mention, if I use multiple threads for I/O and for compression, then zstd can compress faster than lz4 by an order of magnitude. Although of course, in one sense this is quite an unfair advantage, but on the other hand the lz4 CLI tool has no option to compress with multiple threads, so it is quite relevant in terms of usability.
5. Interestingly, when I run the exact same benchmark in the exact same conditions on my AMD Zen 3 server rather than my Zen 1 laptop, then lz4 consistently decompresses almost 2x faster than zstd on this compression level. I'm not sure why there is such a large discrepancy.
My only guesses for the large discrepancy is that my Zen 1 laptop has the `RETBleed: Mitigation: untrained return thunk` CPU mitigation enabled in the Linux kernel, which can cause a very large performance degradation [2], as well as slightly different `Spectre v2` mitigations and 2x-16x smaller CPU caches, and perhaps this somehow affects lz4 more than zstd for some reason... (apparently the RETBleed mitigation is not necessary for Zen 3 CPUs).
[1] Adding `-T1` before `--single-thread` make no difference in either compression or decompression, but adding it after `--single-thread` makes no difference in decompression speed/CPU usage but it does makes compression even faster and use more CPU. My guess is that `-T1` is overriding `--single-thread` and using only one compression/decompression thread but using more threads for I/O.
> No, No and No. It's algorithmically impossible that zstd is decompressing faster than lz4 by using the same environment.
Obviously it's not impossible, since it's happening on my laptop.
> You are comparing the zstd with asynchrounous I/O + multithreading against a single thread simple lz4 cli.
As I mentioned, I used `--single-thread` both when compressing and decompressing, which according to the zstd man page, does the following:
> --single-thread: Use a single thread for both I/O and compression. As compression is serialized with I/O, this can be slightly slower. (...)
> (...) this mode is different from -T1, which spawns 1 compression thread in parallel with I/O.
As far as I understand, this shouldn't be doing asynchronous I/O nor multithreading.
Additionally, even if `zstd -d --single-thread` in fact does have multithreading due to some bug (which appears to be true to some extent, due to the >100% CPU consumption), it did not use more than 113% CPU so it could only have a 13% advantage vs lz4 in terms of CPU usage during decompression, which cannot fully explain why zstd is faster.
Furthermore, if I use `zstd -d -T4` then zstd shows this message: "Warning : decompression does not support multi-threading".
Perhaps zstd's `--single-thread` option is simply not working correctly in terms of serializing I/O with the compression/decompression activity. And even though no actual disk I/O is happening (due to the files being cached in RAM), perhaps the latency of doing I/O synchronously is significant enough to affect the results (due to the context switching + all kernel CPU mitigations).
> Additionally you're using some exotic and not representative hardware.
No, I'm using a standard Lenovo T495 laptop.
> In theory, the only single thread decompression case where zstd can beats lz4 is when you have slow (reading) i/o.
I don't think any slow I/O happened because as I mentioned, the files were cached in memory. This laptop has 24 GB of RAM and the machine was being lightly used, the files were relatively small (300-500 MB compressed each), they had been read previously several times (since I repeated the benchmark several times in a row) and no actual significant I/O happened during my benchmarks (I always verify this because I have a graphical I/O monitor).
But perhaps the memory bandwidth limit and/or the userspace <-> kernel I/O latency are significant enough to affect the results.
> lz4 & zstd are mostly used as libraries and real benchmarks are not considering I/O or multithreading.
Well, I mostly use them as command-line tools, and real programs have to do I/O (otherwise how do they input and output the data to compress/decompress?) but I understand what you mean.
> This is what I'm getting with TurboBench & enwik9. lz4 decompressing 2,8x faster than zstd.
I haven't tried TurboBench, but this is what I get when running the built-in lz4 and zstd benchmarks:
According to this benchmark, for decompression lz4 is even faster and zstd is even slower on my machine than what TurboBench shows in your machine.
However, I don't understand exactly what this metric means. Does it mean that lz4 can ingest the input at ~4 GB/s or that it produces the output at ~4 GB/s?
If it's the input, then note that since lz4 produced a compressed file that was 43% bigger than zstd's, then it means that it has to ingest more data to produce the same output, so this can also explain a large portion of the discrepancy.
But in any case, what matters to me as a user is how quickly I can compress and decompress a file (and how small the resulting compressed file is), not what the theoretical speed of an algorithm is in some synthetic benchmark that does not perform I/O nor accounts for whether I can compress/decompress in multithreaded mode.
And in this particular machine, at this particular zstd compression level, zstd beats lz4's default level on all metrics when using their respective CLI tools, even when I handicap zstd's CLI tool by not taking advantage of multiple CPUs (and multithreaded I/O, possibly).
And if I explicitly enable zstd's multithreading support then zstd beats lz4 by a multiple factor difference when compressing while producing smaller files and also decompressing faster as well (for whatever reason).
I understand that my environment may not be representative and that zstd may not be able to beat lz4 on decompression in other hardware/software configurations, but it may still beat lz4 on the other metrics and still be competitive (or at least, fast enough) on decompression, especially if I/O or memory bandwidth becomes the bottleneck rather than raw compression/decompression speed.
If you compress/decompress files frequently, you should really benchmark the command-line tools, using the appropriate options (enable multithreading when possible and test various compression levels) instead of just relying on a synthetic benchmark that for multiple reasons, does not reflect real-world performance.
And if you're using lz4/zstd in some other fashion (for example, in ZFS), you should really benchmark the whole system, because you may be surprised with the result. As an example, perhaps lz4's decompression speed advantage over zstd doesn't matter at all if the rest of the system can't process data faster than zstd can decompress it.
So when benchmarking the whole system, zstd may still win in every metric compared to lz4, including decompression -- which could happen if zstd has to process a smaller compressed input compared to lz4, thus using less memory bandwidth and I/O bandwidth for the compressed input and therefore having more memory bandwidth and I/O bandwidth available to produce the decompressed output (on a memory bandwidth or I/O bandwidth-constrained system).
> Obviously it's not impossible, since it's happening on my laptop.
Well, zstd cli is more optimized for multithreading and asynchrounous i/o. zstd is overlapping cpu decompression with reading/writing. Additionally zstd compressed data is more dense. It's possible that on your notebook you have lz4 decompression including i/o slower that zstd. This is not the way lz are compared in general. Otherwise nobody will use lz4 anymore.
> I haven't tried TurboBench, but this is what I get when running the built-in lz4 and zstd benchmarks
You can see, even on your machine lz4 is faster than zstd.
Now it's obvious that asynchrounous i/o + multithreading is making the difference in the cli case.
> However, I don't understand exactly what this metric means. Does it mean that lz4 can ingest the input at ~4 GB/s or that it produces the output at ~4 GB/s?
In data compression the metrics are always relative to the original uncompressed size.
> But in any case, what matters to me as a user is how quickly I can compress and decompress a file.
Well, this depends of the use case. I'm using 7zip (GUI) and it's working fine for me.
It's very hard to make benchmarks when multithreading or i/o is involved. You'll get often different results on different hardware configurations and the number of threads.