How are zlib, gzip and zip related?

ctur · 2023-11-27T14:16:27 1701094587

What a great historical summary. Compression has moved on now but having grown up marveling at PKZip and maximizing usable space on very early computers, as well as compression in modems (v42bis ftw!), this field has always seemed magical.

These days it generally is better to prefer Zstandard to zlib/gzip for many reasons. And if you need seekable format, consider squashfs as a reasonable choice. These stand on the shoulders of the giants of zlib and zip but do indeed stand much higher in the modern world.

michaelrpeskin · 2023-11-27T18:59:17 1701111557

I had forgotten about modem compression. Back in the BBS days when you had to upload files to get new files, you usually had a ratio (20 bytes download for every byte you uploaded). I would always use the PKZIP no compression option for the archive to upload because Z-Modem would take care of compression over the wire. So I didn't burn my daily time limit by uploading a large file and I got more credit for my download ratios.

I was a silly kid.

EvanAnderson · 2023-11-27T19:13:42 1701112422

That's really clever and likely would have gone unnoticed by a lot of sysops!

cpeterso · 2023-11-28T04:01:33 1701144093

Another download ratio trick was to use a file transfer client like Leech Modem, an XMODEM-compatible client that would, after downloading the final data block, tell the server the file transfer failed so it wouldn’t count against your download limit.

https://en.m.wikipedia.org/wiki/LeechModem

michaelrpeskin · 2023-11-28T14:07:54 1701180474

That's awesome! I totally would have used that as a young punk if I knew about it.

stavros · 2023-12-01T00:43:20 1701391400

That sounds like it can be fooled by making a zip bomb that will compress down to a few KB (by the modem), but will be many MB uncompressed. Sounds great for your ratio, and will upload in a few seconds.

lxgr · 2023-11-27T21:39:11 1701121151

> These days it generally is better to prefer Zstandard to zlib/gzip for many reasons.

I'd agree for new applications, but just like MP3, .gz files (and by extension .tar.gz/.tgz) and zlib streams will probably be around for a long time for compatibility reasons.

pvorb · 2023-11-27T22:12:56 1701123176

I think zlib/gzip still has its place these days. It's still a decent choice for most use cases. If you don't know what usage patterns your program will see, zlib still might be a good choice. Plus, it's supported virtually everywhere, which makes it interesting for long-term storage. Often, using one of the modern alternatives is not worth the hassle.

emmelaich · 2023-11-27T18:55:47 1701111347

Fun fact: in a sense. gzip can have multiple files, but not in a specially useful way ...

    $ echo meow >cat                                                            
    $ echo woof > dog                                                           
    $ gzip cat                                                                  
    $ gzip dog                                                                  
    $ cat cat.gz dog.gz >animals.gz                                             
    $ gunzip animals.gz                                                         
    $ cat animals                                                               
    meow                                                                        
    woof

koolba · 2023-11-27T19:00:56 1701111656

> ... but not in a specially useful way ...

It can be very useful: https://github.com/google/crfs#introducing-stargz

DigiDigiorno · 2023-11-27T20:43:51 1701117831

It is specially useful, it is not especially/generally useful lol

It could be a typo, though I think when we say something "isn't specially/specifically/particularly useful" we mean "compared to the set of all features, specifically this subset feature is not that useful" not that the feature isn't useful for specific things

emmelaich · 2023-11-28T01:49:51 1701136191

Indeed! I should have written "especially" not "specially"

ericpauley · 2023-11-28T02:08:42 1701137322

Imo all file formats should be concatenable when possible. Thankfully ZStandard purposefully also supports this, which is a huge boon for combining files.

Fun fact, tar-files are also (semi-) concatenable, you'll just need to `-i` when decompressing. This also means compressed (using gz/zstd) tarfiles are also (semi-)concatenable!

billyhoffman · 2023-11-28T01:35:17 1701135317

WARC files (used by the Internet Archive to power the Wayback machine, among others) use this trick too to have a a compressed file format that is seek-able to individual HTTP request/response records

lxgr · 2023-11-27T21:41:30 1701121290

Wow, that's surprising (at least to me)!

Is there a limit in the default gunzip implementation? I'm aware of the concept of ZIP/tar bombs, but I wouldn't have expected gunzip to ever produce more than one output file, at least when invoked without options.

tedunangst · 2023-11-27T22:30:49 1701124249

It only produces one output. It's just a stream of data.

lxgr · 2023-11-27T22:39:56 1701124796

Ah, I somehow imagined a second `cat` in there. That makes more sense, thank you!

abhibeckert · 2023-11-28T01:34:22 1701135262

The limit is it doesn't do filenames or other metadata — it's limited to contents.

cout · 2023-11-27T16:55:01 1701104101

Interesting -- I did not realize that the zip format supports lzma, bzip2, and zstd. What software supports those compression methods? Can Windows Explorer read zip files produced with those compression methods?

(I have been using 7zip for about 15 years to produce archive files that have an index and can quickly extract a single file and can use multiple cores for compression, but I would love to have an alternative, if one exists).

ForkMeOnTinder · 2023-11-27T17:01:47 1701104507

7zip has a dropdown called "Compression method" in the "Add to Archive" dialog that lets you choose.

pixl97 · 2023-11-27T17:47:20 1701107240

Until windows 11, no, windows zip only seems to deal with COMPRESS/DEFLATE zip files.

melagonster · 2023-11-27T13:57:49 1701093469

For people who first read this: the sweet part is in the comments :)

dcow · 2023-11-27T14:21:22 1701094882

What’s even more sad is that the SO community has since consequently destroyed SO as the home for this type of info. This post would now be considered off topic as it’s “not a good format for a Q&A site”. You’d never see it happen today. Truly sad.

barrkel · 2023-11-27T15:11:06 1701097866

Thing is, it could only be that way in its early days, when the vanguard of users came to it from word of mouth, from following Joel Spolsky or Coding Horror or their joint podcast. The audience is much bigger now and with the long tail of people, the number willing to put effort into good questions is too low, and on-topicness is a simple quality bar which can improve the signal to noise ratio.

s_dev · 2023-11-27T17:25:29 1701105929

They had a voting system. By having mods decide what was and what wasn't a 'good' question undermined the whole point of the voting system. Mods should use their powers to filter out hate/spam/trolling/egregiously off topic issues not determine relevance/usefulness. As others have pointed out SO was a site with great answers but awful for asking questions. This is why ChatGPT is eating SO for breakfast.

Even if a question was super similar to one that was previously asked has value in exactly that it might be phrased slightly better and be a closer match to what people were Googling.

dcow · 2023-11-27T15:34:38 1701099278

Except, I doubt anybody would argue that a lower signal to noise ratio has improved the site. (Plus, has the actual metric even improved and how is it measured?) And, did anybody ever stop to ask whether S:N should even be the champion metric in the first place, at a product level? With a philosophy of “Google is our homepage”, I honestly don’t understand why S:N even matters since search pretty effectively cuts out noise. I guess it makes a mod’s life easier though. The site is less useful today than it’s ever been. The road to hell…

Dalewyn · 2023-11-27T15:35:07 1701099307

Very broadly, I find the quality/value of a given thing is inversely proportional to how many people are involved.

So with regards to the internet: The 90s and early 00s were great, then the internet became mainstream and it all just became Cable TV 2.0.

zxt_tzx · 2023-11-27T15:38:26 1701099506

Relatedly, I have seen the graph showing the dip in SO traffic by ~30% if I'm not wrong (and the corresponding hot takes that attribute that to the rise of LLMs).

I know most people are pessimistic that LLMs will lead to SO and the web in general to be overrun by hallucinated content and AI-training-on-AI-ouroboros, but I wonder if it might instead allow for curious people to query an endlessly patient AI assistant about exactly this kind of information. (A custom GPT perhaps?)

dcow · 2023-11-27T15:42:07 1701099727

GPT info tools will fully replace SO in most dev workflows if it hasn’t already.

norenh · 2023-11-27T17:35:24 1701106524

And what will GPT info tools learn from, once the public curated sources are gone?

dcow · 2023-11-27T20:29:13 1701116953

Probably the great swaths of documentation out there that for most use cases people need not waste time sifting through if a computer can do it faster...

hawski · 2023-11-27T20:42:37 1701117757

Isn't it fun, that ChatGPT's success poisoned the well for everyone else? :)

dylan604 · 2023-11-27T17:55:31 1701107731

By then, AGI will be ready, right?

twic · 2023-11-27T15:38:06 1701099486

A rephrasing of this might be on-topic on retrocomputing: https://retrocomputing.stackexchange.com/q/3083/21450

But almost nobody reads that.

BeetleB · 2023-11-27T17:36:11 1701106571

This is somewhat revisionist. They would mark stuff like this as off topic even in the early days.

miyuru · 2023-11-27T16:31:16 1701102676

His stackexchange profile is a gold mine itslef.

https://stackexchange.com/users/1136690/mark-adler#top-answe...

stavros · 2023-12-01T00:45:40 1701391540

Hah, imagine asking Mark Adler for gzip history references.

dustypotato · 2023-11-27T14:33:57 1701095637

Found this hilarious:

> This post is packed with so much history and information that I feel like some citations need be added

> I am the reference

(extracted a part of the conversation)

tyingq · 2023-11-27T14:45:32 1701096332

Maybe a spoiler, but the "I" in "I am the reference" is Mark Adler:

https://en.wikipedia.org/wiki/Mark_Adler

signaru · 2023-11-27T16:56:27 1701104187

It's awesome how he is active on stack overflow for almost anything DEFLATE related. I once tried stuffing deflate compressed vector graphics into PDFs. Among other things, it turns out an Adler-32 checksum is necessary for compliance (some newer PDF viewers will ignore its absence though).

gmgmgmgmgm · 2023-11-27T15:54:21 1701100461

That's disallowed on Wikipedia. There, you must reference some "source". That "source" doesn't need to be reliable or correct, it just needs to be some random website that's not the actual person. First sources are disallowed.

bombela · 2023-11-27T16:07:08 1701101228

I learned this when I tried correcting the wikipedia page on Docker. I literally wrote the first prototype. But this wasn't enough source for wikipedia. And to this day the English page is still not truthfull (interestingly enough, the french version is closer to the truth).

nerdponx · 2023-11-27T16:32:40 1701102760

You could publish a little webpage called "An historical note about the Docker prototype" under your own name, which you could then cite on Wikipedia.

I think it makes perfect sense as a general and strict policy for an encyclopedia. It would simply be too hard to audit every case to check if it's someone like you, or a crank.

bombela · 2023-11-27T19:06:47 1701112007

Maybe I should write the story as a comment on hacker news, and link to it ;)

Joke aside, I should probably take up on your advice.

a1369209993 · 2023-11-27T22:49:05 1701125345

Yes. Do this (make sure it's not a top-level submission) and cite the HN comment specifically. Stupid rules deserve stupid compliance.

nerdponx · 2023-11-27T23:07:56 1701126476

Why would a top-level submission not be valid on Wikipedia?

a1369209993 · 2023-11-28T23:13:51 1701213231

It (presumably) would; the problem is that it could be considered more dignified and academically respectable than a random forum comment.

andrewf · 2023-11-28T02:38:45 1701139125

That may fall afoul of the "reputably published" requirement at https://en.wikipedia.org/wiki/Wikipedia:No_original_research...

Basically, Wikipedia wants a primary source's claims to be vetted by a third party, either a "reputable publisher" or a secondary source.

nerdponx · 2023-11-28T05:16:36 1701148596

Doesn't that disqualify just about any personal blog as a source, or any academic preprint (like on Arxiv)? I can't imagine that's widely enforced.

3836293648 · 2023-11-28T06:53:02 1701154382

Well, stuff like the ArXiv is exactly what they don't want cited

dTal · 2023-11-27T17:29:51 1701106191

I don't see how requiring someone to set up a little webpage filters out cranks. If anything I might expect it to favor them.

nerdponx · 2023-11-27T19:15:57 1701112557

The idea is that it's a separate, distinct source, which exists outside of and independently from the encyclopedia itself, and can be archived, mirrored, etc. Its veracity and usefulness can then be debated or discussed as needed.

gmgmgmgmgm · 2023-11-28T07:22:01 1701156121

you'd have to publish under another name. reference to the author's blog is also disallowed

kibwen · 2023-11-27T16:04:27 1701101067

And that's for good reason. Encyclopedias are supposed to be tertiary sources, not primary sources. Having an explicit cited reference makes it easier to judge the veracity of a statement compared to digging through the page history to figure out if a line was added by a person who happens to be an expert.

JadeNB · 2023-11-27T17:42:42 1701106962

> And that's for good reason. Encyclopedias are supposed to be tertiary sources, not primary sources. Having an explicit cited reference makes it easier to judge the veracity of a statement compared to digging through the page history to figure out if a line was added by a person who happens to be an expert.

But why is a reference to "[1] Blog post by XXX" (or, even worse, "[1] Blog post by YYY based on their tentative understanding of XXX") a more authoritative source than "[1] Added to Wikipedia personally by XXX"? Of course, Wikipedia potentially has no proof that the editor was actually XXX in the latter case; but they have even less proof that a blog post purporting to be by XXX actually is.

kibwen · 2023-11-27T18:25:55 1701109555

> Wikipedia potentially has no proof that the editor was actually XXX in the latter case; but they have even less proof that a blog post purporting to be by XXX actually is.

Wikipedia is not an authoritative identity layer, it provides no proof of identity and is thus strictly weaker than any other proof you can come up with. If you don't trust any arbitrary website that Wikipedia cites, then you have no more reason to trust any arbitrary Wikipedia editor.

As for what tertiary sources are and why they prefer not to cite primary sources in the first place, Wikipedia goes over this in their own guidelines: https://en.m.wikipedia.org/wiki/Wikipedia:No_original_resear...

JadeNB · 2023-11-28T22:39:32 1701211172

> Wikipedia is not an authoritative identity layer, it provides no proof of identity and is thus strictly weaker than any other proof you can come up with. If you don't trust any arbitrary website that Wikipedia cites, then you have no more reason to trust any arbitrary Wikipedia editor.

It's the "strictly" with which I take issue. I certainly have no more reason to trust that a Wikipedia user whose username is BigImportantDude is actually a particular Big Important Dude than to trust the analogous fact about a blog post purporting to be authored by the Big Important Dude; but I dispute the fact that I should trust it less.

msla · 2023-11-27T17:39:42 1701106782

And then there's impostors, which people who denigrate sourcing rules never seem to even think of.

whalesalad · 2023-11-27T15:11:54 1701097914

Reminds me of when I was inadvertently arguing here on HN with the inventor of the actor model about what actors are

demondemidi · 2023-11-27T15:50:45 1701100245

That sounds like something I’d do too. If that makes you feel better.

chupasaurus · 2023-11-28T03:28:40 1701142120

Oh, I remember that conversation, it was fun.

FartyMcFarter · 2023-11-27T15:35:57 1701099357

"I'm the one who knocks".

matheusmoreira · 2023-11-27T20:03:55 1701115435

"I am the hype."

HexDecOctBin · 2023-11-27T18:56:53 1701111413

Is there an archive format that supports appending diff's of an existing file, so that multiple versions of the same file are stored? PKZIP has a proprietary extension (supposedly), but I couldn't find any open version of that.

(I was thinking of a creating a version control system whose .git directory equivalent is basically an archive file that can easily be emailed, etc.)

pizza · 2023-11-28T01:01:02 1701133262

New versions of zstd allow you to produce patches using the trained dictionary feature

kissgyorgy · 2023-11-27T23:04:08 1701126248

If you are interested in implementation details, how to unpack/decompress them, check out these Python implementations:

- https://github.com/onekey-sec/unblob/blob/main/unblob/handle...

wiredfool · 2023-11-27T15:11:42 1701097902

The real question is: how are zlib and libz related?

o11c · 2023-11-27T16:46:57 1701103617

zlib is the name of the project. libz is an implementation-detail name of the library on Unix-like systems.

pdw · 2023-11-27T17:13:54 1701105234

Similar: Xlib and libX11.

exposition · 2023-11-27T22:09:05 1701122945

There's also pzip/punzip (https://github.com/ybirader) for those wanting more performant (concurrent) zip/unzip.

Disclaimer: I'm the author.

Dwedit · 2023-11-27T23:46:55 1701128815

See a highly upvoted answer in a question about zlib related things, suspect it was probably posted by Mark Adler, and turn out to be correct.

raggi · 2023-11-27T20:56:35 1701118595

The answer is good, but is missing a key section:

Salty form: They're all quite slow compared to modern competitors.

levzettelin · 2023-11-27T21:01:52 1701118912

What are some of those modern competitors?

raggi · 2023-11-27T21:34:14 1701120854

For zlib compatible workloads, there are cloudflare patches and chromium forks, intel forks, and zlib-ng which are compatible but >50% faster. (I think the cloudflare patches eventually made it into upstream zlib, but you may not see that in your distro for a decade).

lz4 and zstd have both been very popular since their release, they're similar and by the same author, though zstd has had more thorough testing and fuzzing, and is more featureful. lz4 maintains an extremely fast decompression speed.

Snappy also performs very well, with zstd and snappy having very close performance with tuning to achieve comparable compression levels.

In recent years Zstd has started to make heavy inroads in broader usage in OSS with a number of distro package managers moving to it and observing substantial benefits. There are HTTP extensions to make it available which Chrome originally resisted but I believe it's now finally coming there too (https://chromestatus.com/feature/6186023867908096).

In gaming circles there's also Oodle and friends from RAD tools which are now available in Unreal engine as builtin compression offerings (since 4.27+). You could see the effects of this in for example Ark Survival Evolved (250GB) -> Ark Survival Ascended (75GB, with richer models & textures), and associated improved load times.

scq · 2023-11-27T21:06:42 1701119202

zstd is over 4x faster than zlib, while having a better compression ratio.

http://facebook.github.io/zstd/

readyplayernull · 2023-11-28T02:41:44 1701139304

gzip can be used to (de)compress directories recursively in a variable:

FOO=$(tar cf - folderToCompress | gzip | base64)

echo $FOO | base64 - d | zcat | tar xf -

encom · 2023-11-27T17:11:49 1701105109

(2013)