Hacker News new | past | comments | ask | show | jobs | submit login
Pigz: Parallel gzip for modern multi-processor, multi-core machines (zlib.net)
311 points by ingve on May 12, 2023 | hide | past | favorite | 190 comments



I heard of pigz in the discussions following my interview of Yann Collet, creator of LZ4 and zstd.

If you'll excuse the plug, here is the LZ4 story:

Yann was bored and working as a project manager. So he started working on a game for his old HP 48 graphing calculator.

Eventually, this hobby led him to revolutionize the field of data compression, releasing LZ4, ZStandard, and Finite State Entropy coders.

His code ended up everywhere: in games, databases, file systems, and the Linux Kernel because Yann built the world's fastest compression algorithms. And he got started just making a fun game for a graphing calculator he'd had since high school.

https://corecursive.com/data-compression-yann-collet/


Side note: In the 1990s, everyone in my engineering school had an HP48 calculator. There were a healthy selection of pretty decent games available.

One fine day, I finished my physics exam an hour early and so opened up an enjoyable game on my calculator. 45 minutes went by and so I went up and handed in my paper. It was at this point that the professor noted, “were you planning on leaving the second page blank?”

Oh.


I had an HP48, I had lots of fun with Bjorn Gahm's IR remote control which could mimic the remote control signals of most common television brands, including the ones at my high school. The HP48 also had the ability to set an alarm to execute an arbitrary piece of code, e.g. turning on the TV to some channel and turning the volume up to max at some predefined time the middle of class.

Teachers were completely stumped. Their initial suspicions were always that someone brought a universal remote control to class, but they would painstakingly search everyone's desks to find nothing. And then after asking everyone to put their hands up in the ai, the TV would still have a mind of its own.

Yeah, your TI-89 was no fun.


Still have mine. For a while on my phone I used an emulator. Eventually I found PCalc and it was customizable enough to recreate the parts of the hp48g that I cared about on a day-to-day basis.


One wild thing is how much performance wins were available compared to ZLib. Pigz is parrellel, but what if you just had a better way to compress and decompress than DEFLATE?

When zstd came out – and Brotli before it to a certain extent – they were 3x faster than ZLib with a slightly higher compression ratio. You'd think that such performance jumps in something as well explored as data compression would be hard to come by. We weren't that close to the efficiency frontier.


A part of this was that zstd and Brotli were able to use compression history windows of MBs not KBs, while DEFLATE maxes out at 32KB. RAM was thousands of times more expensive in the early 90s, so a smaller history window made sense.

There are also optimizations that only work on today's larger cores, and you have to actively update old code to get the advantages (happily some work is going into that): https://news.ycombinator.com/item?id=32533061 / https://news.ycombinator.com/item?id=32537545

That's not to minimize the clever ideas and amazing implementation work in new stuff. It's more that people were making smart decisions both then and now, more so than you might guess just from comparisons on today's hardware.


People take performance for granted. Even within gzip (and similarly .png), you can set compression level to 4 (default is 6) and get ~15-20% faster performance at the cost of ~5% larger file sizes.

No one ever tweaks that one setting even though they should, file sizes are a significantly smaller bottleneck than they were with MB hard drives and dial-up modems.

If your justification for not serving up larger .png is that not everyone has fast internet, then you should be either detecting and handling that case separately, downscaling the images, and/or serving .jpeg instead.

One time I was using Topaz AI to upscale video, and I spliced that into their ffmpeg filter and took a whole day off of a week long encode. Low hanging fruit.


In video games long loading times for levels is a serious pain point, so video game developers put a lot of effort into tuning up compression algorithms to get the best wall clock time considering both the time to fetch content from storage and the time to decompress.

If the target is a console you may know exactly what hardware is there so you can justify the effort in tuning. (it’s more complex today because you have a choice of what kind of storage to use with your XBOX). With a PC or phone your results may vary a lot more.


Don't most games/game engines use TGA format for their textures? Those are all RLE-encoded if I'm not mistaken (which is very fast but very inefficient space-wise). Or perhaps that is just at game creation and those will get baked to some other image format for distribution?


People use all kinds of compression schemes for textures

https://aras-p.info/blog/2020/12/08/Texture-Compression-in-2...


The most important thing for modern texture compression is that the GPU supports it without ever having to decompress - saves VRAM and memory bandwidth. So it's usually specialized formats like ASTC.


Economics of scale come into effect as well. Gzip decompression speed is slightly better at higher levels as well. A one time higher cost of compression can pay off pretty quickly when you are decompressing it a lot of times, or serving it to enough people.


I'm not so sure about this. Generally speaking there will be more work done on the CPU to decompress at higher levels (e.g. 6 through 9). It is possible (although unlikely) that you will get higher decompression speed, but only if the bottleneck wasn't CPU to begin with (e.g. network or disc).

My gut feeling is that if you are pulling down data faster than 40 Megabits and have a CPU made within the past 7 years (possibly including mobile), you won't be bottlenecked by I/O generally speaking.


Most compression algorithms don't take more work to decompress at higher levels, and actually perform better due to having less data to work through. Gzip consistently benchmarks better at decompression for higher levels.

It's not just about bottlenecks, but aggregate energy expenditure from millions of decompressions. On the whole, it can make a real measurable difference. My point was only really that it's not so cut and dry that it's a good trade off to take a 5% file size loss for 20% improved compression performance. You'd have to benchmark and actually estimate the total number of decompressions to see the tipping point.


That is not the case with zstd.

According to this benchmark [1] zstd does not drop its decompression speed as the compression ratio increases. It stays about the same level.

[1] https://www.truenas.com/community/threads/zstd-speed-ratio-b...


> No one ever tweaks that one setting even though they should

That entirely depends on the use-case. Most people running FFMpeg do it as a once off thing - and if those people like me, when I rip a movie I want the highest quality and lowest size I can get, and I'm happy that the default sacrifices speed for quality and size. The processing can be slow because I'm doing it only once. If you're in the business of encoding video and do it all day everyday, your calculus will be different and you won't be using the defaults regardless.


I agree, but to be pedantic, the cost of storage may work out to be lower than the cost of energy to encode even in that use case.


That depends under what conditions the energy is used. In a place where you need extra energy for cooling yes. In my home no. I live in a cold place, so I need heating the bigger part of the year. And I heat using electricity (which might be stupid, but that's the house was built 30 years ago). So whatever energy my computer wastes, I save it in my heating bill. Computer energy is free, except during some warm summer weeks.


> as well explored as data compression

What's well explored is compression rate, where indeed it's difficult to improve, and true innovations, like arithmetic coding, are rare.

Compressing speed on the other hand it's not very interesting to academics, it's more of an engineering problem. And there is plenty of work to do here, starting with stuff as simple as multi-threading and SIMD.

ZLib and ZStandard are probably in the same complexity class, but with different constant factors, which academics don't care about but which have massive practical consequences.


> What's well explored is compression rate not performance

Exactly! And this seems like a shame to me with something burning so many cpu cycles.

> true innovations, like arithmetic coding, are rare.

Yeah, Yann tried to explain arithmetic coding to me, but I didn't get it.


I think arithmetic coding is much simpler than the way most resources describe it (the hard part is making it efficient).

Consider this example: you want to transmit two completely random variables, both of which can have 5 states. The obvious way is to concatenate two bit fields of size ceil(log2(5)), so 3+3 = 6 bits.

But alternatively, you can count the total number of states possible for both variables together, 5*5 = 25 and encode it as a single integer of size ceil(log2(25)) = 5, so both variables can be stored with just 5 bits.

So we arrive at the idea that there can be a fractional number of bits, which we often round up to the nearest integer for simplicity (or, in practice, the nearest multiple of 8 since most protocols are based on bytes).

The other part is just assigning shorter sequences of bits to more common symbols, except, of course unlike in Huffman coding, our symbols can have a fractional number of bits. This allows them to match the actual symbol probabilities more closely. If your data is highly repetitive, you can fit dozens of (common) symbols per bit.

The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.


> The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.

PPM-based [0] compression can work quite well for text but on its own it's not enough to unseat the Lempel-Ziv family as the top general purpose compressor.

[0] https://en.wikipedia.org/wiki/Prediction_by_partial_matching


Yeah arithmetic coding on its own is not enough, typically you would have it as the last step, with dictionary and other compressors before it.


That must be why you left out all the technical details ;) At least you piqued my interest, I'll just ask chatgpt to explain the general concepts.


zlib/gzip did not choose DEFLATE because it was the best algorithm. Rather it was, at the time, the only decent algorithm that could be implemented in a manner not covered by patents. (See the tragedy of LZW for why that was important.)

We're now more than two decades later, so all the important data compression patents should have expired.


Probably a lot of it is due to hardware changes since zlib was first written. Mainly the importance of cache and branch prediction, which were either less of a big deal or non-existent back then. IOW, zlib probably leaves a lot more on the table now than when it was written.


Even though the opposite is recited more frequently, in some cases good can be the enemy of perfect


zlib is above all portable, and runs in small memory footprints. Size vs space and platform specific functionality all have costs associated with them.


Middle out compression changed everything.


Just listened to that episode, what a great story. The dry way he tells how he unexpectedly and almost accidentally transitioned from a project manager to a software engineer is really a treat. Thanks for your podcast!


Thanks for listening!


Fastest open source compression algorithms. RAD game tools have proprietary ones that are faster and have better compression ratios, but since you have to pay for a license, they will never be widespread.


Interesting. Are there any benchmarks you can share? On their website they only compare decompression speed and with zlib and LZMA. It would be interesting to compare to LZ4 HC mode, that unity uses.


Build or download TurboBench [1] executables for linux and windows from releases [2] ans make your own tests comparing oodle,zstd and other compressors.

[1] https://github.com/powturbo/TurboBench

[2] https://github.com/powturbo/TurboBench/releases


I've been following RAD for a long time and I love Charles Bloom's blog. They are proprietary but he also makes a lot of code public. For instance, he showed how RAD switched from arithmetic coders to Assymetrical Number Systems and added code.


This episode was fascinating. I had heard of LZ4 but not Zstd. It spurred me to make changes to our system at work that are reducing file sizes by as much as 25%. It’s great to have a podcast in which I learn practical stuff!


Probably the most underrated feature of zstd (likely because it's so unusual) is the ability to create a separate compression dictionary. This allows you to develop customized and highly efficient dictionaries that are highly specific to a type of data AND allow you to compress elements of that data without including an entire separate dictionary in every compression output.

So for example take logfiles. You can train up a dictionary on some sample log data. Then you can compress individual log rows, and all it actually stores is a diff of the compression dictionary (if any new entries were added) and the compressed data. So you get very efficient compression of small amounts of data which are part of a collection that may be very self-similar, but with the option of decompressing any individual element at will. (Of course, you'd need to hold onto the original trained dictionary for both compression and decompression, for any row you want to be able to decompress in the future. And you might want to retrain the dictionary every so often for slowly-changing types of data, which might prevent "drift" of the efficiency towards less-efficient over time)

I believe Postgres already uses this under the hood for some columnar data. It wouldn't take much to index it before compressing it and just decompress it at will. Or maybe it just got added? https://devm.io/databases/postgresql-release


I do this to make extraordinarily small UDP packets for a low latency system. I record the raw payload then build a dictionary for the data, then share it on both sides. It reduces the packet overhead by removing the dictionary and it does a much better job than other approaches.


I saw that zstd and brotli both suppport creating custom dictionaries but I couldn't find any tutorials showing how to do this. Perhaps you could share code?


I saw that zstd and brotli both suppport creating custom dictionaries but I couldn't find any tutorials showing how to do this. Perhaps you could share code?


Basically,

`zstd --train <path/to/directory/of/many/small/example/files/>`

will output a dictionary file, and then the `-D <path/to/dictionary/file>` option when used for either compression or decompression will then use that dictionary first.

You can also investigate "man zstd" or google "zstd --train" for more details. The directory for the training must consist of many small files each of which is an example artifact; if you want to split, say, a single log file into files of each line, you can use, say, a bash script like this (note that I just created this with ChatGPT and eyeballed it, it looks correct but I haven't run it yet!): https://gist.github.com/pmarreck/91124e761e45d6860834eb046d6... (Also, don't forget to set it as executable with `chmod +x split_file.bash` before you try to run it directly)


Thank you so much. I was trying to create a dictionary last night and your comment was sent by God. You're doing the Lord's work frfr! I followed you on GitHub!


remember that if you don't understand a particular line of code, you can have chatgpt explain it... have fun


There are also a parallel versions of bzip2 (pbzip2), lzip (plzip), xz (pixz).

Depending upon the data, the non-threaded versions of these utilities can have higher performance when run with some kind of dispatcher on multiple files.

The GNU xargs utility is able to do this, and the relevant features are also in busybox.


GORDON BELL, ADAM! That was a great episode. The most amazing thing to me was how this guy was just messing around, a compression hobbyist, if you will - and then he is being courted by FAANG companies. He just walked into it, almost by accident.

I work in VMWare Fusion on a Mac, in a Mint guest OS, and zipping these huge instances for backup will take forever with a single core. Pigz punishes all 12 cores on my Mac mini and saves me a ton of time.


Thanks! Part of the strength of the story is the humility of Yann. He could have presented himself as a genius who was destined for greatness but he's humble enough to look at himself as an everyman.


All the episodes of your podcast are excellent. Thank you for making it and keep it up! Just became a Patreon supporter, been meaning to for a while.


Thanks for listening and for supporting me!


I loved this episode, it was very engaging to the very end, I wish there were more episodes! (I already listened to all of them so far) Thank you for doing this podcast!


That was a great read! Very inspiring to hear about a near-middle-age person keeping the flame stoked on a little side hobby, and having it turn into something world-changing. So cool!


Thanks for reading!


If you are interested in optimizing parallel decompression and you happen to have a suitable NVIDIA GPU, GDeflate [1] is interesting. The target market for this is PC games using DirectStorage to quickly load game assets. The graph in [1] shows DirectStorage maxing out the throughput of a PCIe Gen 3 drive at about 3 GiB/s when compression is not used. When GPU GDeflate is used, the effective rate hits 12 GiB/s.

If you have suitable hardware running Windows, you can try this out for yourself using Microsoft's DirectStorage GPU decompression benchmark [2].

A reference implementation of a single threaded compressor and multi (CPU) threaded decompressor can be found at [3]. It is Apache-2 licensed.

1. https://developer.nvidia.com/blog/accelerating-load-times-fo...

2. https://github.com/microsoft/DirectStorage/tree/main/Samples...

3. https://github.com/microsoft/DirectStorage/blob/main/GDeflat...

Disclaimer: I work for NVIDIA, have nothing to do with this, and am not speaking for NVIDIA.

Edit: oops, lost the last sentence in the first paragraph during an edit.


Nvidia also has nvcomp: https://github.com/NVIDIA/nvcomp


> the effective rate hits 12 GiB/s

I assume this is for decompressing multiple independent deflate streams in parallel?

What's the throughput if you only have a single stream? I realise this is the unhappy-case for GPU acceleration, hence my question! (I've been thinking about some approaches to parallelize decompression of a single stream, it's not easy)


The data is compressed with GDeflate, not deflate. The single stream is designed to use the parallelism of a GPU. It is described here:

https://github.com/microsoft/DirectStorage/blob/main/GDeflat...

The GPU decompression benchmark I linked earlier allows you to specify a single file that it will compress with GDeflate (and zlib for comparison). The numbers presented in the docs that come with the benchmark and presented elsewhere are consistent with my own runs using a source file that is highly compressible.

Part of the trick of achieving this speedup is to read the data fast enough. I don't know of any NVMe drive that can reach full speed with a queue depth of 1. While running the benchmark in a windows VM with a GPU passed through, on the linux host I observed that the average read size was about 512k and the queue depth was sometimes over 30.


> I've been thinking about some approaches to parallelize decompression of a single stream, it's not easy

You saw this, right?

https://news.ycombinator.com/item?id=35915285


I consider using an index to be "cheating" - or rather, my intended use-case is decompression of a stream that you've never seen before, which was generated by a "dumb" compressor.

That said, the approach I intend to take is similar. The idea is that one thread is dedicated to "looking ahead", parsing as fast as it can (or even jumping far ahead and using heuristics to re-sync the parse state. There will be false-positives but you can verify them later), building an index but not actually doing decompression, while secondary threads are spawned to do decompression from the identified block start points. The hard part is dealing with missing LZ references to data that hasn't yet been decompressed. Worst-case performance will be abysmal, but I think on most real-world data, you'll be able to beat a serial decompressor if you can throw enough threads at it.


There also is this: https://github.com/mxmlnkn/pragzip I did some benchmarks on some really beefy machines with 128 cores and was able to reach almost 20 GB/s decompression bandwidth.


Interesting. It looks like https://github.com/zrajna/zindex became public about a year after my searches for parallel uncompression came up empty and I started hacking on pigz.


Posted a question (now deleted) asking if it could be done on the gpu not noticing you already posted this. Thanks for sharing.


I wonder if and if not why not ML uses this to speed up training


The model weights (the thing being updated by the training process) stay loaded in gpu memory during training (the slow part). This could be useful to serialize the model weights to disk when checkpointing or completed, but it's a drop in the bucket compared to the rest of the time spent training.


I meant it more for the image data


John Carmack just had a tweet today on this problem:

>I started a tar with bzip command on a big directory, and it has been running for two days. Of course, it is only using 1.07 cores out of the 128 available. The Unix pipeline tool philosophy often isn’t aligned with parallel performance.

https://twitter.com/ID_AA_Carmack/status/1656708636570271768...


I wasn't aware the Unix philosophy was to not use multithreading on large jobs that can be parallelized.

You can complain about philosophies but this is just using the wrong tool for the job. Complain about bzip if you feel the bzip authors should have made multithreaded implementation for you.


Unix generally favors processes over threads, at least the old school Unix. Threads are a more recent innovation.

The old approach was that programs don't need internal parallelism because you can get it by just piping stuff and relying on the kernel's buffering to keep multiple processes busy.

Eg, tar is running on one core dealing with the filesystem, gzip is running on another core compressing stuff.

In the early days, Windows would have a single program doing everything (eg, Winzip) on a single core, while Unix would have a pipeline with the task split into multiple programs that would allow for this implicit parallelism and perform noticeably better.

Today this is all old and doesn't cut it anymore on 128 core setups.


“Recent” as in the 80s or 90s, sure. Threads are older than unix was when threads were introduced.


I’m not sure I understand. Typically I’ve thought of threads called “threads” being formalized with POSIX threads. Before that, while my memory is vague on this, it was just lumped into multiprocessing under various names.


Even if you ignore threading models prior to pthreads, pthreads itself dates to 1995.


And Unix 1973


So: Unix was ~22 when Pthreads were introduced in 1995, and Pthreads are 28 years old now.


I agree with most of these points except blaming this on processes vs threads. The only difference is all memory being shared by default, vs explicitly deciding what memory to share. With all the emphasis on memory safety on HN you think this point would be appreciated.


That’s fairly new, with threads and processes becoming basically the same. Historically threads didn’t exist, then they were horrifically implemented and non standard, then they standardized and were horrifically implemented, then they were better implemented but the APIs were difficult to use safely, etc etc. Also threads were much more light weight than a process. This shifted with light weight processes, etc.


That's true, but if you're looking that far back, multicore is new too


I guess “that far back” becomes different as you get older :-) it doesn’t seem that long ago to me :-)


Well, if you split the file into chunks you could fan it across cores by compressing each individually.


Having to do things like this is exactly the problem.


Processes can still be fine. If you're lucky you can be debating over whether you'd like something to be 99x or 100x faster.


With all due respect to Carmack he’s using bzip in 2023, that’s pretty outdated on every front.


You'd be surprised. There are some workloads - for me, it's geospatial data - where bzip2 clobbers all of the alternatives.


I'm using bzip2 to compress a specific type of backup. In my case I cannot afford to steal CPU time from other running processes, so the backup process runs with severely limited CPU percentage. By crude testing I found that bzip2 used 10x less memory and finished several times faster than xz, while being very close on the compression rate.

Other algorithms like zstd and gz resulted in much lower compression rates.

I'm sure there is a more efficient solution, but changing three letters in a script was pretty much the maximum amount of effort I was going to put in.

On an unrelated note, has someone already made a meta-compression algorithm which simply picks the best performing compression algorithm for each input?


I've not seen one that picks the best compression algorithm, but I've seen ones that perform a test to try and determine if it is worth compressing. For example borg-backup software can be configured to try a light & fast compression algorithm on each chunk of data. If the chunk is compressed, then it uses a more computationally expensive algorithm to really squash it down.


I've also noticed for some text documents (was it json? I don't remember) that bzip compresses significantly better than xz (and of course gzip/pigz). Not sure if I tested zstd with high/extreme settings at that time.


For some reason, bzip compresses text incredibly well. And it has for years, I remember noticing this almost 20 years ago.


It uses the Burrows-Wheeler transform to place similar strings next to each other before using other compression tricks, so it usually does a bit better.


Oh, that's interesting, I stopped using bzip2 at the time kernel sources started shipping in xz.

Do you know if there are any tests showing which compressor is better (compression wise) for which data?


With all respect to John Carmack (and it is really a lot of respect!) I'm surprised he seems unaware of pbzip2? It's a parallel implementation scales almost linearly with the amount of cores, and has been around since ~2010, so it's not yet old enough to drive, but anyone dealing with bzip2'ing large amounts of data should have discovered it long ago.

And yes, use zstandard (or xz, where the default binary in your distro is already multithreaded) where you can.


> anyone dealing with bzip2'ing large amounts of data

... should really reconsider their choice of compression algorithm at this point.


But pigz shows that the unix pipeline philosophy works just fine. (of course compressing before tarring is probably better than compressing the tarred file, but that should be pipelinable as well)


For many small files, compressing first will compress worse, because each file has its own dictionary; compressing last means you can take advantage of similarities in files to improve compression ratio.

Compressing first can also be slower if the average file size is smaller than the block size, because the main thread cannot queue new jobs as fast as cores complete them (this happens e.g. with 7zip at fast compression settings with solid archive turned off). Tarring then compressing means small files can be aggregated into a single block, giving both good speed and compression ratio.


Zstandard has a dictionary functionality which allows you to pre-train on sample data to achieve higher compression ratios and faster compression on large numbers of small files.


Don't you then need to store that dictionary somewhere out of band? It seems like you would still need a tar-like format to manage that, at which point as an archive format it seems more complicated with a worse separation of concerns.


Interesting - does it fit that automatically or is there a manual step?


> compressing before tarring is probably better

Not if the files are similar. If you're compressing the files separately you'll start with a clean state rather than reusing previous fragments. Compressing a BMP after a TXT may not be beneficial, but compressing 3 tar'ed TXTs is definitely better than doing them separately.


That is true and seems easily half the total size with small and similar files, but it also means you have to unpack the whole archive when you need the last file in a tarball.

AFAIK the gzip command still cannot compress directory information and therefore needs tar in front of it if you want to retain a folder structure.


Sometimes you need fast indexed access to a specific file in the compressed content without decompressing the entire file (let's say JARs, that are just ZIPs).

TIL: you can use method 93 - Zstandard (zstd) Compression - with ZIPs


The only point of using .zip is for maximum compatibility and you loose that if you use anything other than deflate or no compression. If you are going to use something else you might as well use a less wasteful and better defined archive container - or something entirely custom like most games do.


93?



There is a _parallel_ bzip2:

http://compression.great-site.net/pbzip2/

which should solve the 'my cores are idle' issue.


He should just be using pbzip2 :)

https://linux.die.net/man/1/pbzip2


How big is that file... I have 2TB files compressed down to ~300GB and gunzip'ing them takes ~2-3 hours. Granted, that's still a long ass time, but not 2-3 days.

If anything, I wonder what kind of hard drive John has. If you're reading them off a network drive backed by tape drums it's probably going to take a while ;P


bzip2 is much slower than gzip.


Very bzip2 problem is that even decompression is slower.


bzip2 is not much slower at compression than gzip; gzip is far faster at decompression though. Either way since bzip2 is a block based compressor, parallelization is trivial, and parallel implementations started appearing about 20 years ago; pbzip2 is almost certainly in whichever package manager is in use for TFA.


Of course the problem there is that `tar` is outputting a single stream. You might, in similar situations, start multiple `tar` running on subsets of the input, which pipelines then become fully parallel again.


Shame he didn't discover pbzip2 before starting that job.


If it's less than 98% complete he could still stop it, start over and still finish sooner.


I'd bet tar becomes the bottleneck before pbzip2 does on that multicore monster. It can be surprisingly slow and I don't think any version of tar uses more than one core.


There are several programs to solve this problem. He just wanted to complain and write something built around that.


Is the .07 of a core a margin of error or some kind of status report done on a different core?


iirc its because core usage is an average over time, and you can get weird situations where switching between cores (which happens frequently depending on the scheduler) causes the average usage of one core by the program to not be zero yet, and the usage on another core to already be 100.

This is purely speculation though


Unless the recipient of whatever you are compressing absolutely requires gzip, you should not use gzip or pigz.

Instead you should use zstd as it compresses faster, decompresses faster, and yields smaller files. It also supports parallelism (via “-T”) which supplants the pigz use case. There literally are no trade-offs; it is better in every objective way.

In 2023, friends don’t let friends use gzip.


> There literally are no trade-offs; it is better in every objective way.

There literally are trade-offs, you started your comment describing one of them. If you want as wide out-of-the-box support as possible, you'd go with gzip.

The Compression Streams browser API only supports gzip (+ deflate) so if you wanna compress something natively in the browser without 3rd party libraries (or slow JS implementation), gzip seems to be the only option.


People had your exact sentiments, concerns and hesitations after gzip showed up in the early 90s. Eventually they moved on from pkzip/lzh/etc. to better, modern software - some on their own owed to being reasonable people, and some being dragged along with claws in ground while screaming about "breaking support".


It was easier back then, because there were fewer people developing new compressors.

I've lost count how many compressors have been marketed as a replacement for gzip over the decades. And it's always a replacement for gzip. Every time a new compressor starts getting popular, people start promoting a new even better replacement, and gzip never gets properly replaced.

zstd finally has some potential to replace gzip, but only if people accept it's good enough and stop trying to replace it with something even better.


"zstd finally has some potential to replace gzip"

bzip2 and xz have had the potential to replace gzip for the vast majority of users and use cases since more than a decade - and in many cases they have.


What I'm trying to say is that the excessive focus on cutting-edge technology is holding back progress.

gzip is still the default compressor people use when they are not sure about the appropriateness of other compressors in their specific use case, and they don't have the time or energy to find out. To replace it, the a compressor must satisfy two requirements:

* It must not be substantially worse than gzip on any relevant metric. bzip2 failed this by being slow.

* It must be ubiquitous enough that the idea of installing it no longer makes sense. xz never reached this point, before people started replacing it with better compressors.


I think these arguments are a bit contrived. bzip2 made progress on compression ratio. xz (and zstd) made progress on both compression ratio and speed. Neither hovered around the idea of cutting-edge "technology" (they all fall into the same technology: general data compression algorithms) because they aren't niched oddities like e.g. paq. But I don't understand why a successor must trump gzip in both aspects. gzip certainly didn't trump all of its predecessors on both aspects, and both aspects take turns being the more important one depending on user and scenario.


By cutting edge technology, I meant the latest products that are better than the earlier ones.

I work in bioinformatics, where people typically use either gzip or domain-specific compressors. gzip is used for the reasons I mentioned. It works, it's usually good enough, and if people in another organization you've never heard of want to use your compressed files, they can do so without bothering you with support requests.

zstd would be faster and compress better, but because you can't be sure everyone else can use it, you don't even bother thinking about it. The saved computational resources are probably not worth it. On the other hand, anything that makes gzip faster is valuable, as it allows saving computational resources without taking any interoperability risks.

I didn't say the gzip replacement must be better than gzip in every aspect. I said it must not be substantially worse. bzip2 was substantially worse, because it was substantially slower.


Well, if you care about supporting compress/decompress in the browser (natively) then you pretty much have the choice of using gzip or gzip, so there is still the limitation of support, no matter where you try to drag me.


gzip has the advantage of being ubiquitous. It's pretty much guaranteed to be available on every modern Unix-alike. And is good enough for most purposes.

Zstd is getting there but I personally don't bother with it on a daily basis except in situations where both performance and compression ratio are important, like build artifact pipelines or large archives.


You're reading old comp-sci usenet discussions from 1993 and you come across this statement: "pkzip/lzh/arc/zoo have the advantage of being ubiquitous. We should not encourage the use of gzip". You chortle.


You make it sound like installing zstd is a big deal. Which it is not.


It definitely can be on legacy systems.


zstd had a data corruption bug until quite recently. Eventually it may supplant gzip as the de facto standard, but it's too soon to declare it better in every objective way. Give it time.

https://news.ycombinator.com/item?id=35446847


Not really.

"The recipients" are for example millions of browsers that don't understand zstd.


I agree. I wanted to use Brotli for my startup because it allows creating custom dictionaries but I had to resort to gzip because Brotli was difficult to setup on my CDN.


`tar czf` is a lot easier to remember than `tar -I zstd cf`


GNU tar can autodetect the compression algorithm, both for compression and decompression.

  $ tar -caf dst.tar.zst /src
  $ tar -xaf src.tar.zst
(it's fine to omit -a for decompression)


I never remember either, so I might as well look up the latter rather than the former.


  tar --create --zstd --file


Similarly, for bzip2 there is pbzip2 (http://compression.great-site.net/pbzip2/?i=1).

zstd & xz support the "-T" argument for setting thread count. If you pass "-T 0" it will attempt to detect and use a thread per physical core.


AFAIK (not 100% sure), multithreading support is different - parallel versions split the file in multiple segments, and compress each independently, which multithreaded functionalities apply to the same stream (no hard splitting). For this reason, there's for example pzstd, in addition to zstd.


Yes, pbzip divides up the file into blocks per core. Though I think some versions (older?) of bzip are unable to handle pbzip archives.

I used pbzip2 on an old octo core xeon server with a decent sas raid and was able to compress at well over 200MB/sec, closer to 300MB in some cases.


bzip2 is already block based though, so there is no compatibility issue in that specific case (vs pbzip2) though I think pbzip2 supports larger blocks than the original bzip2.


tangential question: compressed files look like hashes in terms of if something changes in the beginning, all the other parts are different, right?


Usually but it depends on the compression scheme. There’s usually a “window” of how far back they look, so they can resync after a while, but it’s unlikely and the offsets will likely have changed so you need to handle that.

You can force this property by introducing synchronisation points though, gzip has an `—-rsyncable` which makes that a lot more likely, at a small compression cost.

Edit: apparently zstd has also had —-rsyncable for the last 5 years.


For compression efficiency, it makes sense to have one large common dictionary. For compression speed, it is easier to have a dictionary per chunk. I still hope they use the common dictionary; if so, any change in the beginning likely affects further parts if it affects the dictionary and thus the way how the later parts are compressed. Same for farther parts affecting the way earlier parts are compressed.


I have seen at least one case where pbzip2 created files which could not be opened by some .NET implementation of the decoder, but the same decoder could open files created by lbzip just fine. No idea why.


Useful with Docker, see https://github.com/moby/moby/pull/35697

I’ve integrated pigz into different build and CI pipelines a few times. Don’t expect wonders since some steps still need to run serially, but a few seconds here and there might still add up to a few minutes on a large build.


Am I reading correctly that Docker just automatically uses pigz if it’s in the system path? I’ve used both for years and had no idea. I’m definitely going to make sure it’s installed in CI pipelines going forward, I know of some bloated image builds it will definitely help with!


Correct, if it detects in unpigz it will use it. It will not compress layers with it, though.


I built a custom dpkg with parallel xz for speeding up the compression of large omni style deb packages. Totally worth it.


Turns out it was longer ago than I thought— way back in the Ubuntu 14.04 and 16.04 days:

https://launchpad.net/~mikepurvis/+archive/ubuntu/dpkg


Still blows my mind people still use gzip. 20 years ago I was expecting by this point in time for there to be lots of effort put into increasing compression and working towards getting that fast, instead its been a push for speed. It makes sense with how the internet has changed. These days gzip isn't even in the top 100 as far as compression goes, hell even something like RAR or 7zip are far back compared to the best.

Take something like enwik8 (100megs), gzip will get that down to 36megs, with LZMA down to ~24-25. The top of the line stuff will get it down to the ~15meg range. Thats a huge difference.


It shouldn't, it's still plenty fast for 95% of the stuff out there that you wanna make a zipped archive of. It's pretty much installed everywhere too


I remember moving a HUGE mysql table (>500GB) with a pipe chain of mysqldump > pigz > scp (compression disabled) > pigz > mysql

If you've ever screwed around with mysqldump -> tar -> scp -> untar -> mysql<, you'll appreciate the speedup on this, in cases where you're setting up a slave and want to have the freshest possible data before kicking off binlog replication - this is the best.


I update/upgrade/switch over to zstd (from older compressors) wherever I'm updating or revamping any of my data pipelines. Looks like a win^3 for me: 1) It's probably either in the top-X or #1 in any of the usual compression metrics size/speed/convenience/ease etc. 2) Can do --rsyncable and create rsync friendly files at tiny size cost. 3) In the rare occasion I need there's $ zstd -c file1 >file.zst; zstd -c file2 >>file.zst, then $ zstd -dc file.zst will produce out $ cat file{1,2}


The issue with pigz is that uncompressing doesn't really parallelize beyond a three stage read/uncompress/write pipeline.

This is of course more of a problem of the gz format than pigz although last time I looked hacks are possible to parallelize decompression.


I implemented parallel decompression a while back. It is in Solaris 11.3 and later.

https://github.com/oracle/solaris-userland/blob/master/compo...

Shortly after submitting a PR the code went through major surgery, and my patch then needed a similar amount of surgery. Oracle then whacked most of the Solaris org, and I don’t think this ever got updated to work with the current pigz.


Would you mind creating a fork of pigz in GitHub and add this patch? I would be interested in testing it out!


You can grab the version from the solaris userland repo I linked and use it without me completing a homework assignment. Just grab the pigz-2.3.4 source then apply the patches from [1] in the proper order. Maybe some of them aren't needed for non-Solaris.

1. https://github.com/oracle/solaris-userland/tree/master/compo...

I thought I had opened a PR for that a long while ago, but it doesn't show up on github these days. In any case, I did ask Mark Adler to review it. It was never a priority, then the code changed in ways that I don't really want to deal with.

While looking through the PRs, I noticed a PR for Blocked GZip Format (BGZF) [2]. That's very interesting, and perhaps suggests that bgzip is a tool you would be interested in.

2. https://github.com/madler/pigz/pull/19


Nice! You should be able to do it without an index by periodically restarting the dictionary on compression and then looking for something resembling the dictionary, right?


Yeah, probably so at the cost of compatibility. As implemented, the .gz file can be used with `gzip -d`.


I have not only implemented parallel decompression but also random access to offsets in the stream with https://github.com/mxmlnkn/pragzip I did some benchmarks on some really beefy machines with 128 cores and was able to reach over 10 GB/s decompression bandwidth. This works without any kind of additional metadata but if such an index file with metadata exists, it can double the decompression bandwidth and reduce the memory usage. The single-core decoder has lots of potential for optimization because I had to write it from scratch, though.


> It is not pronounced like the plural of pig.

I've got news for you buddy :)


> exploits multiple processors and multiple cores to the hilt when compressing data

As s side note, this isn't always desirable for this class of coders. In some scenarios (like web server) you might want to favor throughput over response time.


`zstd --adapt` is pretty cool as it detects how much output buffer it has and changes compression effort on the fly to try to achieve maximum throughput.


I think GP is talking about the case where you have more than one client and don’t want to throw all the server’s threads at serving just one.


Hardware accelerated [1] "GZIP Acceleration with AIX on Power Systems " pigz

[1] https://community.ibm.com/community/user/power/blogs/brian-v...


I see someone else read the Carmack post complaining about single threaded compression performance on Unix.

Hopefully my tweet response was the one to tip you off! ;P Though in all likelihood I'm quite sure a number of people commented pointing at pigz.

Hats off to all who write extraordinarily performant multithreaded versions of originally-slow-at-scale UNIX system tools.


It's a very useful piece of software for over a decade.

I am only disappointed with this one: "It is not pronounced like the plural of pig."

Me and my colleagues always pronounced it like pigs, die Schweine - and it was so much fun!


Similarly, for zipping files in JS, I have coded the possibility to compress zip files on several cores in zip.js [1]. The approach is simpler as it consists of compressing the entries in parallel. It still offers a significant performance gain though when compressing multiple files in a zip file, which is often the nominal case.

[1] https://github.com/gildas-lormeau/zip.js


While pigz is great as a general replacement for gzip, for most purposes nowadays either LZ4 or zstd are better choices for fast compression+decompression.


Because I can can never remember to use pigz I have to have this in my dotfiles:

  function ccm() {
    tar -cf - $1 | pigz > $1.tar.gz
  }


Okay, I feel that my first response was too snarky. I'm sorry. In its place, I'll say:

You wouldn't go outside without pants. You shouldn't use a variable without quotes. Put pants on all variables. Also, you shouldn't use () with 'function' it "works" in Bash but it's not standard:

  ccm() {
    tar c "${1}" | pigz > "${1}.tar.gz"
  }
You could further improve it with a loop to accept multiple files:

  ccm() {
    for i in $@; do
      tar c "${i}" | pigz > "${i}.tar.gz"
    done
  }
Now you can run it like: `ccm file1 file2 file3`

See: https://mywiki.wooledge.org/Quotes

See: https://mywiki.wooledge.org/BashGuide/CompoundCommands#Funct...


Good thing you never use file names with spaces ;)


Containerd will utilize unpigz if it’s on your PATH, thank me later: https://github.com/containerd/containerd/blob/main/archive/c...


I always install aria2c and set package manager + wget to use it for any system file downloads... basically it will open X connections to download files based on the file size and in the process give a pretty notable speed up on those slow single-connection package repos or download URLs. For reference it can cut 2-3 minutes off of an ubuntu dist-upgrade and even more if you're on a fast-but-far connection.


My understanding is that this all works because you can concat two gzip files and the outcome is the same as concatenating the original files

  $ gzip -c a > a.gz
  $ gzip -c b > b.gz
  $ cat a b > c1
  $ cat a.gz b.gz> c.gz
  $ gzip -dc c.gz > c2
  $ cmp c1 c2
  [no output, files match]


I'm a big fan of pigz. I use it in my home-grown backup script for my Linux laptop. It can compress the incremental tar output from my filesystem snapshot fast enough to saturate the I/O to my external USB3 hard drive. This is a low bar, but single-threaded gzip (or bzip2) could not do it!


Give xz and zstd a go. You'll love them.


One approach is to perform the bitstream parsing and Huffman decoding concurrently, while carrying out the LZ77 decoding sequentially. This method does not rely on a specialized encoder, such as pigz, that isolates LZ references into chunks.


I was mostly interested in the name and the pronunciation section kind of ruined it for me


If it's to be pig-zee (pixie) for the Americans, it can be pig-zed (pig's head) for the rest of us.


That's funny. Exactly, in french we pronounce z like zed and so that allows us to keep the funny part. Even funnier than "pigs". Thanks


That's really confusing since `pixz` exists and its "pixie" pronunciation actually works

https://github.com/vasi/pixz


I will pronounce it like "pigs" anyway. More fun.


> I'm glad you asked. It is pronounced “pig-zee”. It is not pronounced like the plural of pig.

Then don't spell it like that. And I'll continue to say GIF with a hard G also.


How does it compare perfomance-wise to Intel ISA-L? https://github.com/intel/isa-l


I'm starting to get sick of those cartoon project names. Not sure what the alternative would be but it's increasingly rubbing me the wrong way.


pigz ... parallel implementation of gzip - that's not even a stretch of the meaning.

Also - you'll never forget it.


"gzip-parallel"?


I am so used to bzip2 by now, It'll be hard to switch to something else.


Also works with zlib-ng


Does pigz offer any advantages over bgzip?


Best of luck with the implementation, but I do hope the authors realise they should avoid naming their software tools like an old-school pornographic film.

I thought we had learned that from the GIMP[1].

[1] https://www.theregister.com/2019/08/28/gimp_open_source_imag...


There is nothing pornographic about “pigz.”


Pigs elicits images of heavy exchange of bodily fluids between the intervenients




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: