Hacker News new | past | comments | ask | show | jobs | submit login
pigz: A parallel implementation of gzip for multi-core machines (github.com/madler)
289 points by firloop on Oct 17, 2022 | hide | past | favorite | 106 comments



Funny this comes up again so soon after I needed it! I recently did a proof-of-concept related to bioinformatics (gene assembly, etc...), and one quirk of that space is that they work with enormous text files. Think tens of gigabytes being a "normal" size. Just compressing and copying these around is a pain.

One trick I discovered is that tools like pigz can be used to both accelerate the compression step and also copy to cloud storage in parallel! E.g.:

    pigz input.fastq -c | azcopy copy --from-to PipeBlob "https://myaccountname.blob.core.windows.net/inputs/input.fastq.gz?..."
There is a similar pipeline available for s3cmd as well with the same benefit of overlapping the compression and the copy.

However, if your tools support zstd, then it's more efficient to use that instead. Try the "zstd -T0" option or the "pzstd" tool for even higher throughputs but with same minor caveats.

PS: In case anyone here is working on the above tools, I have a small request! What would be awesome is to automatically tune the compression ratio to match the available output bandwidth. With the '-c' output option, this is easy: just keep increasing the compression level by one notch whenever the output buffer is full, and reduce it by one level whenever the output buffer is empty. This will automatically tune the system to get the maximum total throughput given the available CPU performance and network bandwidth.


zstd has --adapt:

       --adapt[=min=#,max=#]
              zstd will dynamically adapt compression level to perceived I/O conditions. Compression level adaptation can be observed live by using command -v. Adaptation can be constrained between supplied
              min and max levels. The feature works when combined with multi-threading and --long mode. It does not work with --single-thread. It sets window size to 8 MB by default (can  be  changed  manu‐
              ally, see wlog). Due to the chaotic nature of dynamic adaptation, compressed result is not reproducible.


I really should have read the documentation! That feature looks awesome, but in a quick test it could only use about 50% of the available output bandwidth. My upload speed is 50 Mbps, but zstd could only send about 25 Mbps.

Similarly, on a local speed test (SSD -> SSD), using a fixed compression level was much faster than --adapt.


My copy of that manual page has additional text:

  "" note : at the time of this writing, --adapt can  remain  stuck  at  low speed when combined with multiple worker threads (>=2). ""

There are some ADVANCED COMPRESSION OPTIONS --zstd tunables that might help.

Leave wlog alone unless you're willing to store the value out of band and pass it in again during decompression.

hashLog, bigger number uses more memory to compress but is often faster.

chainLog smaller number compresses faster, but worse ratio.

In your use case monitoring general system utilization to identify bottlenecks might also help. My gut instinct is that you might already have hit a memory bandwidth limit for the platform, at which point REDUCING the hashLog until it fits within your intended performance budget might yield better bandwidth results. Reducing the chainLog value might have the same effect.


if you're running your test over the internet [ fluctuating latency, some packet losses ] - try enabling BBR [1] tcp congestion control algorithm on the sender side to utilize the available bandwidth more efficiently.

[1] https://en.wikipedia.org/wiki/TCP_congestion_control#TCP_BBR


Bioinformatician here. Consider using bgzf instead. It's fully gzip backwards compliant (it is a subset of gzip), but de/compression can be implemented to be much faster, and it's also much easier to paralellize. Compression rates are slightly lower, i.e it creates larger files.

The bgzf format was invented for bioinformatics, and BAM files are bgzipped by default.


These days, skill is not about how you use tools in IT, but simply knowing that a tool exists.


Hasn't that always been the case? When I lived in the Bay Area and networked, and was exposed to so much of how people did things, I was able to land much more contracts because I could come up with simple existing solutions. Once I moved away and wasn't exposed to cool people doing cool problem solving things, I had to switch into more bog standard consulting.

It blew people's minds when we couple implement huge projects that they could never get of the ground for years in just a matter of months because of 'this one cool trick'.

Man I miss Santa Cruz ಥ﹏ಥ worst mistake of my life to leave there.


bgzf has a clear advantage when it comes to sorted BED/VCF/GTF formats, especially if one does index these. But frankly I have no idea if it may improve IO times for the fastq files when read by say bwa or another mapper. Do you have any experience with that? I have seen some mapping times improvement using fastqs with clustered reads (by clumpify).


Yes, bgzf can be much faster in practice. Both because the underlying gzip implementation can be simplified (see e.g. libdeflate), and because it can be more effectively parallelised. It doesn't matter if it's IO-bound of course, but compression rarely is.


While the topic of compressing FASTQ comes up, you might be interested to know that kmer sorting FASTQ files can lead to around an 8% improvement in compression depending on the diversity.

I believe this is because it puts similar strings together in the tree gzip uses. clumpify.sh from bbtools is one example.


> one quirk of that space is that they work with enormous text files.

Why not use some binary format like BSON? Compressing with gzip works but then you can't query it compressed


It was probably fasta of fastq format, which is pretty much line separated strings of the DNA letters. BSON won't help with that. You could try to squeeze the each of the 4 letters into 2 bits, but they use few more letters to indicate "unknown", so it's not that easy. And even the simplest compression algorithms just do that for you.

And text is easier to work with in any hacked up script.

There are some binary formats (BAM, I think), but people often prefer the text format anyway. When compressed, the size is pretty much the same


> And even the simplest compression algorithms just do that for you.

They do, but you need to uncompress to read the N-th letter


Because scientists are woefully undereducated on the finer points of computer science. Does it (eventually) produce publishable papers? is all they care about.


Apart from tuning up the compression you may gain a lot by clustering reads in FASTQ files using clumpify from BBMap:

https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb...

You may also filter out duplicates and depending on what you do correct base errors or get rid of low complexity reads.


The bit I found most interesting was actually:

https://github.com/madler/pigz/blob/master/try.h

https://github.com/madler/pigz/blob/master/try.c

which implements try/catch for C99.


Postgres uses a similar custom implementation of try/catch [1].

[1] https://github.com/postgres/postgres/blob/master/src/include...


But why? Most modern languages try to get rid of exceptions (Go, Kotlin, Rust).


> But why? Most modern languages try to get rid of exceptions (Go, Kotlin, Rust).

All three of those languages actually have exceptions, they just don't encourage catching exceptions as a normal way of error handling

Also, while the trend now seems to be for newer languages to encourage use of things like result types, one of the main reasons for that is that in current languages it is easier to show that functions can potentially failure in the type system using result types rather than exceptions.

Otherwise, there isn't necessarily inherently a strong reason to prefer one or the other, and it's possible that future languages will go back to exceptions but have a way to express that in the type system using effects, etc.


No clue if this is the reason, but exceptions are actually really fast. With them (or longjmp-based error handling), the "happy path" gets to not even thing about errors. With error returns (be it Go's error interface, Rust's Result type, or C's "this function returns a negative number on error" pattern), you need a branch after every call which might fail, which does have a measurable (if usually small) impact on performance. Given pigz's entire raison d'etre is performance, it wouldn't surprise me if this impacted the choice of error handling style.


Golang has panic / recover / defer which are functionally similar to exceptions. It's actually a fun exercise to implement a pseudo-syntax for try/catch/finally in terms of those primitives.


Go has exceptions but its definitely not advised to use those as an error mechanism. Recover is really a last chance effort for recovery, not a standard error catching method.


That's why they're called exceptions, because some errors are exceptional, otherwise they should be handled in the non-exceptional, standard flow of the program.



They said "try to get rid of", which to me is akin to de-emphasizing.

Of course you'd have to deal with exceptions since you're running on the JVM and want to Interop with Java code, that doesn't mean it's idiomatic code.


If you really want to enable all cores for compression and decompression, give pbzip2 a try. pigz isn't as parallel as pbzip2

http://compression.ca/pbzip2/

*edit, as ac29 mentions below, just use zstdmt. In my quick testing it is approximately 8x faster than pbzip2 and gives better compression ratios. Wall clock time went from 41s to 3.5s for a 3.6GB tar of source, pdfs and images AND the resulting file was smaller.

    megs
    3781    test.tar
    3041    test.tar.zstd (default compression 3, 3.5s)
    3170    test.tar.bz2 (default compression, 8 threads, 40s)


bzip2 is very very slow though. Some types of data compress quite well with bzip, but if high compression is needed, xz is usually as good or better and natively has multithreading available.

For everything else, there's zstd (also natively multithread)



Decompression is multithreaded by default. Compression with an argument. However it is built-in.


on the other hand, bzip2 is pretty much obsoleted now by xzip


Also see: http://enwp.org/zstd

parallel-friendly, trades off compression level for speed


What is xzip? are you talking about xz?



The author of lzip has pointed criticism for the design choices of xz.

I generally use lzip for data that is important to me.

https://www.nongnu.org/lzip/xz_inadequate.html


We used this to great effect at Facebook for MySQL backups in the early 2010s. The backup hosts had far more CPU than needed so it was a very nice speed-up over gzip. Eventually we switched to zstd, of course, but pigz never failed us.


Pretty similar to that, we used pigz and netcat to bring up new MySQL read replicas in a chain at line speeds.

I recall learning the technique from Tumblr's eng blog.

https://engineering.tumblr.com/post/7658008285/efficiently-c...


I wrote that Tumblr eng blog post, glad to see it's still making the rounds! I later joined FB's mysql team a few years after that, although I can't quite remember if FB was still using pigz by that time. (also, hi Eric!)

Separately, at Tumblr I vaguely remember examining some alternative to pigz that was consistently faster at the time (11 years ago) because pigz couldn't parallelize decompression. Can't quite remember the name of the alternative, but it had licensing restrictions which made it less attractive than pigz.

Edit: the old fast alternative I was thinking of is qpress, formerly hosted at http://www.quicklz.com/ but that's no longer online. Googling it now, there are some mirrors and also looks like Percona tools used/bundled it. Not sure if they still do or if they've since switched to zstd.


Small world! Thanks for writing that, it was a really clever way to do it and saved me a bunch of time. :)


Same, except we were at a small e-commerce boutique running Magento circa 2011-2013.

SQL backups were simply a bash script using Pigz, running on a cron job. Simple times!


Hey Eric! Hope you’re well!


I chuckled at the name, since out-of-order results are a typical output of parallelization. Kudos.


I also thought the name was clever, but your comment made it even more interesting. Also, my first thought was, "is this safe to use?", I heard of gzip vulnerabilities before, but a parallel implementation sounds a lot easier to get wrong.


Gzip streams support dictionary resets which means you can concatenate individually commuters blocks together to make a while stream.

This is what pigz is doing: shooting the input into blocks, spreading the compression of these blocks over different threads so multiple cores can be used, then joining the results together in the right order.

It is the very same property of the format that gzip's own --rsyncable option makes use of to stop small changes forcing a full file send when rsync (or similar) is used to transfer updated files.

The idea is as simple as it is clever, one of those "why did I not think about that?" ideas that are obvious once someone else has thought of it, so adds little or no extra risk. A vulnerability that uses gzip (a "compression bomb") or can cause a gzip tool to errantly run arbitrary code, is no more likely to affect pigz than it is the standard gzip builds.


Given that, why wouldn't this just be upstreamed into gzip? If it's a clean, simple solution that's just expanding the use of a technique that's already in the core binary?


gzip is a pretty old, pretty core program, so I imagine it's largely in maintenance mode, and that there is a lot of friction to pushing large changes into it. At one point, pigz required the pthreads library to build. If it still does, the gzip people would need to consider if that was appropriate for them, and if not, rewrite it to be buildable without it.

There are multiple implementations of zlib that are faster than the one that ships with GNU gzip, and yet they haven't been incorporated.

There are also just better algorithms if compatibility with gzip isn't needed. zstd, for example, supports parallel compression, and is both faster and compresses better than gzip.


> Given that, why wouldn't this just be upstreamed into gzip?

I suspect to keep the standard gzip as simple, small, and stable, as possible. It does the job, does it well enough, with minimal dependencies, has done for many years, and can do so in a wide array of systems including very small environments (in part due to the minimal dependencies).

Core tools like that typically don't get major updates, just security & stability patches as needed and maybe the occasional safe & easy change for performance reasons or to widen the number of supported environments.


While I'm in agreement on all of those points; I find adoption of new tools extremely difficult. The are modern alternatives to many commands; ls, cat, grep etc but if they're not "the default" it becomes near impossible to switch to them.

Given almost all desktops/servers/mobile phones are likely to be multi core these days, if gzip gained multithreading on desktop it could save time and energy for the whole planet potentially, that seems like a worthwhile benefit?


When reading from media with high random-access latency (for instance traditional hard drives) going parallel could make things much slower, and it will reduce the compression achieved (if only by a small amount), and it will increase the amount of memory consumed during the process, so I wouldn't want it to be the default. Nor would I particularly want it to try be clever and detect the usefulness of going parallel as that could lead to unexpected inconsistency.

I think there is a case for including it as a selectable option, as with --rsyncable, unless this adds extra dependencies (pthreads was mentioned in other comments).


Interesting, I had no idea this would have such an effect, I simply assumed it would be performed in-memory and written out sequentially. I can agree with a flagged option being a nice idea, but what's so heinous about adding a dependency?

I try to avoid it generally in my coffee too, but for something that could potentially offer a measurable benefit in a world of multi core-solid state computing, would it not be "worth it"?


Ah yes, no guarantee of concurrency or ordering (in the headline, lol).

That’d be a pretty funny compression algorithm. You listen to a .mpfoo file, and you’ll hear the whole song, we promise!


Oh, that's the pun. I just saw Parallel Implementation of GZip...


Would not recommend using this in 2022, use zstandard or xzip instead.

zstandard is faster and slightly better compression at speed selection settings that are equivalent to gzip, in addition to having the ability to compress stuff at a much greater ratio, optionally, if you allow it to take more time and cpu resources.

https://gregoryszorc.com/blog/2017/03/07/better-compression-...


pigz has the advantage of producing output that can be read by standard gzip processing tools (including, of course, gzip/gunzip), which are available by default on just about every OS out there so you get the faster archive creation speed without adding requirements to those who might be accessing the results later.

It works because gzip streams can be tracked together as a single stream, at the start of each block is an instruction to reset the compression dictionary as if it is the start of a file/stream (which in practise it is) so you just have to concatenate the parts coming out of the parallel threads in the right order. These resets cause a small drop in overall compression rates but this is small and can be minimised by using large enough blocks.


yes, one consideration is whether you're creating archives for your own later use, or internal use where you also have zstandard and xz handling tools. Or to send somewhere else for wider use on unknown platforms.


Aye, pick the right tool for the target audience. If you are the target or you know everyone else who needs to read the output will have the ability to read zstd, go with that. If not consider pigz. If writing a script that others may run, have it default to gzip but use pigz if available (unless you really don't want that small % drop on compression).


I'm a little confused by this. My copy of zstd has the option --format=gzip; does choosing this option end up using a different, slower compression algorithm?


zstandard can indeed handle standard format gzip files to create and decompress them. From the zstandard compilation options:

HAVE_ZLIB : zstd can compress and decompress files in .gz format. This is ordered through command --format=gzip. Alternatively, symlinks named gzip or gunzip will mimic intended behavior. .gz support is automatically enabled when zlib library is detected at build time. It's possible to disable .gz support, by setting HAVE_ZLIB=0. Example : make zstd HAVE_ZLIB=0 It's also possible to force compilation with zlib support, using HAVE_ZLIB=1. In which case, linking stage will fail if zlib library cannot be found. This is useful to prevent silent feature disabling.


Yes, indeed I had already read that in the man page, but I did not feel it answered my question, because in my mind, and state of naivety about compression, logically at least, a compression algorithm and final file format needn't stand in a 1-1 relation. But I suppose that ZLIB is just the DEFLATE algorithm, then?


yes, zlib is the library underpinning traditional gzip

zstandard's own zstandard-own-compression format is its own separate much newer thing.

zlib, for instance

https://packages.debian.org/source/buster/zlib


This was great in 2012. In 2022, most use-cases should be using parallelized zstd.


Gzip is everywhere.


Warning for the uninitiated. Be cautious using this on a production machine. I recently caused a production system to crash because disk throughput was so high that it started delaying read/writes on a PostgreSQL server. There was panic!


I use this all the time. It's a big time saver on multi-core machines (which is pretty much every desktop made in the past 20 years). It's available in all the repos, but not included by default (at least in Ubuntu/Mint). It is most useful for compressing disk images on-the-fly while backing them up to network storage. It's usually a good idea to zero unused space first:

(unprivileged commands follow)

dd if=/dev/zero of=~/zeros bs=1M; sync; rm ~/zeros

Compressing on the fly can be slower than your network bandwidth depending on your network speed, your processor(s) speed, and the compression level, so you typically tune the compression level (because the other two variables are not so easy to change). Example backup:

(privileged commands follow)

pv < /dev/sda | pigz -9 | ssh user@remote.system dd of=compressed.sda.gz bs=1M

(Note that on slower systems the ssh encryption can also slow things down.)

Some sharp people may notice that it's not necessarily a good idea to back up a live system this way because the filesystem is changing while the system runs. It's usually just fine on an unloaded system that uses a journaling filesystem.


Alternative way of zeroing unused space without consuming all disk space: https://manpages.ubuntu.com/manpages/trusty/man8/zerofree.8....


Thanks. If run as an unprivileged user, the dd command will not consume ALL of the disk space (so privileged processes will not be disrupted). It will consume up to the free space limit (default 5%) as described here: http://blog.serverbuddies.com/using-tune2fs-to-free-up-disk-...

The zerofree command looks useful, but I don't know how portable it is. The dd method works across many platforms (such as AIX).


Protip: if you're on a massively-multicore system and need to tar/gzip a directory full of node_modules, use pigz via `tar -I pigz` or a pipe. The performance increase is incredible.


One interesting trivia is that since ~2020 Docker will transparently use pigz for decompressing container image layers if it's available on the host. This was a nice speedup for us, since we use large container images and automatic scaling for incoming traffic surges.


pigz only parallelizes compression. Decompressing with pigz is single threaded, except perhaps a separate thread is used for crc calculation.

A decade ago I implemented parallel decompresssion for pigz. This is used in Solaris kernel zone suspend and resume, which was the reason I did the work. I submitted a PR for it but madler never got around to reviewing and merging it. Since then there has been a lot of code churn that makes it a pain to apply to the current version.


Docker is indeed looking for a "unpigz" executable to use: https://github.com/moby/moby/blob/c9d2b7df777b38f7239a882c27...

So interesting if they implemented and tested that and get only a marginal CRC speedup.

edit: someone here seems to observe a ~ doubling with unpigz vs zcat: https://unix.stackexchange.com/a/363739


igzip is even faster. See also my answer that contains a quick benchmark in that linked question at the bottom.


I recently implemented pragzip for parallel gzip decompression https://github.com/mxmlnkn/pragzip . I would be interested to know how your PR back then worked to parallelize the decompression.


I think dracut also uses pigz to create the initrd when installing a new Linux kernel rpm package.


Why do you need a heavy multi-thread compressor if modern initramfs systems (like https://github.com/anatol/booster) create small image of size 2MiB and below?

You won't see any improvement from parallelization on this type of data.


Have you optimized the low-hanging fruit in your image size?

Because compression programs are as high-hanging fruit as you can get, and parallelizing them can only be done once.


If you ever run into the limitations of a single machine, dbz2 is also a fun little app for this sort of thing. You can run it across multiple machines and it'll automatically balance the workload across them.

https://github.com/hpc/mpifileutils/blob/master/man/dbz2.1


There is another nice multi-core gzip based library called BGZF[1]. It is commonly used in bioinformatics. BGZF has the added advantage that it is block compressed with built in indexing method to permit seeking in compressed files.

[1] https://github.com/samtools/htslib


Any comparative benchmarks or a write-up on the approach (other than "uses zlib and pthreads" from the README)?


Single-threaded gzip can outperform pigz, or at least come very close, when used with GNU xargs on separate files with no dependencies.

https://www.linuxjournal.com/content/parallel-shells-xargs-u...

https://news.ycombinator.com/item?id=26178257


pigz is most useful on a single stream of data, vs. the more obviously parallel case of files without dependencies.


Back in the day I wrote this about how it improved Solaris kernel zone suspend:

https://web.archive.org/web/20160313033123/https://blogs.ora...


I used it and it was noticeably faster. I didn't write down by how much.


Pigz has been around for a while. Since 2007 if the copyright on this[1] page is any indication.

[1] https://docs.oracle.com/cd/E88353_01/html/E37839/pigz-1.html


I used this recently with -0 (no compression) to pack* billions of files into a tar file before sending them over the network. It worked amazing.


Maybe I’m missing something, but why send the tar generated stream through a non-compressing compressor when you could just send the tar directly?


I didn't have the tar, I created it using:

tar --use-compress-program="pigz -0" ...


I think you are a little confused here.

Tar is a standalone archive format, you can create tar archives with the tar command and not use any compression utility at all. Here is an example that creates an uncompressed tar archive directly with no compression:

    tar -cf $directory.tar $directory
You can also pipe the created archive to stdout instead of to a file if you want to:

    tar -c $directory | wc -c
    # send to a remote system
    tar -c $directory | ssh dd of=~/$directory.tar
You can actually compress the archive by just piping it to a compression command:

    tar -c $directory | zstd --stdout > $directory.tar.zst
The above pipeline is probably very similar to what tar is doing internally when you pass "--use-compression-program".

So in your case using "pigz -0" is totally useless since tar creates an uncompressed archive by default. You can totally omit the "--use-compression-command" flag to do what you want.


I know about all that. My mistake was thinking that I could take advantage of the multicore nature of pigz even when no compression is being used (I'm not sure if pigz only uses multiple cores for the bzip algorithm, or if it somehow could speed up the packing as well). I'm pretty sure that disk is the bottleneck in most cases, but I wonder if that's not the case always, and a single core could become the bottleneck in very fast IO devices. I want to verify this and test it.


Oh yeah in that case tar itself is still creating the archive with a single thread, the compression gets applied "after" tar creates the archive almost exactly like in the pipeline in my comment.

By "after", I don't mean the whole archive is created before the compression part runs, but rather that tar can stream the the archive to stdout as it's being created. Tar needs to be able to do this since it was intended for tape drives, and seeking around on a tape drive is extremely slow.

It's actually pretty interesting that we still use tar so widely even though tape drives are not in common use! A new archive format that supports random access during reads (and writes maybe?) would be pretty interesting, but tar works pretty well so there isn't that much of a reason to create alternatives.


But if you don't specify the -z flag when using tar, then it won't be compressed. Why type all that out when omitting one flag does the same thing?


Why use tar | pigz -0 when you can just use tar?


I used tar --use-compress-program="pigz" to create the tar out of billions of files


Tar is the archiver here (putting multiple files into one file), pigz with no compression isnt doing anything besides wasting CPU time.


If you're not going to compress at all, you don't need a compressor at all. All you needed was a .tar and not a .tar.gz


But what’s confusing everyone is that tar cf - will create the tar without any external compression program needed.


I could definitely be wrong here, apologies for the confusion. I run many of these tasks automated, in some cases I used low compression, in others zero compression. For low compression, that command really shines, for zero compression, I would have bet I also got improvement over regular tar without compression, but again, I could be wrong here. I'll test it again


Even the “f -“ option is unneeded as the default is to stream to stdout. Though it’s always a bit scary to not explicitly specify the destination in case your finger slips and the first target is itself a writeable file.


Use this all the time (or did when I was doing more sysadminy stuff). Useful in all sorts of backup pipelines


Pretty sure we used or still use pigz when it's time to create a db replica...


For maximum compression, pLzip offers lzma compression in parallel: https://www.nongnu.org/lzip/plzip.html


On Linux would it Just Work™ if you aliased pigz to gzip as a drop-in replacement?


In theory, most stuff should work as it's 99% compatible, but there might well be something that breaks. Rather than symlinking it or some such, it's better to configure the necessary tools to use the pigz command instead and then you'll at least find out what works.

FWIW, I configure BackupPC to use pigz instead of gzip without any issues.


I've done that without issue in the past. It's argument compatible so never seemed to be a problem


Well, gzip works on extremely old .Z files too (compress).


Funny, I just read about this yesterday. Time to try it on my pile of archived research data.


I had used pigz for a few years, now I've replace it with `xz -T0`




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: