Why not tar? Limitations of the tar file format

cperciva · on Sept 6, 2010

The first two issues -- a lack of index and the fact that you can't seek within a deflated tarball -- are true but are easily handled by smarter compression. Tarsnap, for example, splits off archive headers and stores them separately in order to speed up archive scanning.

The third issue -- lack of support for modern filesystem features -- is just plain wrong. Sure, the tar in 7th edition UNIX didn't support these, but modern tars support modern filesystem features.

The fourth issue -- general cruft -- is correct but irrelevant on modern tars since the problems caused by the cruft are eliminated via pax extension headers.

enneff · on Sept 6, 2010

"Other archive formats like WinZip..."

The guy immediately loses credibility in my eyes for referring to the most popular archive format as 'WinZip'. It's the ZIP file format, designed by Phil Katz of PKWare Inc.

http://en.wikipedia.org/wiki/ZIP_(file_format)

To add injury to insult, the rest of his proposal is pretty similar to ZIP, which also accomplishes the nice-to-have things he mentions at the end.

bl4k · on Sept 6, 2010

What this article describes has already been solved with zip, gzip, 7z, bzip and forks of tar

The problem is that at the moment there is no open standard (there are IETF proposals) since each of these is either patent, copyright or trademark encumbered.

nailer · on Sept 6, 2010

It's very difficult to talk about 'tar' per se. Do you mean:

* GNU tar?

* BSD tar?

* Solaris tar?

Or even Schilly's 'star' program?

Each of these has different limits, advantages, and disadvantages.

rarrrrrr · on Sept 6, 2010

this detail is wrong: the tar that ships on Mac OS X does indeed support resource forks.

anon_d · on Sept 6, 2010

> Because tar does not support encryption/compression on the inside of archives.

Yes it does? Just encrypt/compress all the files before tarring.

> Not indexed

The reason tar doesn't have an index is so that tarballs can be concatenated. Also IIRC, you only have to jump through the headers for all files. Still O(n) where n is the number of files, but you don't have to scan through all of the data.

gwern · on Sept 6, 2010

> The reason tar doesn't have an index is so that tarballs can be concatenated.

I'm curious, what's the use-case for this? Offhand, the only use for that ability I can think of is if I forgot a file in a tarball and have already deleted the originals; I can tar the missing file and cat the two tarballs.

dagw · on Sept 6, 2010

Don't think files, think tapes. Tar stands for Tape ARchive and was originally primarily used for backing up to tapes. When working with tapes where deleteing and re-writing archives is basically impossible, concating an archive to the end of an already backed up archive to create a new, updated archive is very useful.

gwern · on Sept 7, 2010

Ah, I see. Yes, that does sound very useful.

cybernytrix · on Sept 6, 2010

Compress before tarring is a really dumb idea and you will get terrible compression ratios - you cannot exploit data patterns across files. It could work if you ask gzip to write some sort of a global table...

Nelson69 · on Sept 6, 2010

You also make a bad block affect potentially every file following it. When if you compress pre-tar you could find the next file boundary and recover the rest.

micheljansen · on Sept 8, 2010

I think raising these concerns is fair in a world where nearly all Unix-related source code and binaries is distributed in (g/bzipped) TAR format. Unfortunately, the author does not really explain why this is and what is wrong with ZIP (e.g. why a new format is needed).

I guess that one of the reasons for TAR's dominance is the lack of a free alternative? Apparently ZIP is not free enough (as I understand from http://en.wikipedia.org/wiki/ZIP_(file_format)#Standardizati...).

TAR is old however, and if ZIP cannot take its place, coming up with something new is not such a bad idea. I think Apple's DMG/UDIF file format deserves to be mentioned as well: it addresses all the concerns mentioned (it is essentially a mountable filesystem). I'm pretty sure there is a lot to be learned from that.

farmer_ted · on Sept 11, 2010

Xar addresses a lot of the issues presented in this article.

<http://code.google.com/p/xar/wiki/xarformat>; <http://code.google.com/p/xar/wiki/whyxar>;

But not with the nice descriptive graphics found in the new archive format proposal.

bootload · on Sept 6, 2010

"... Because tar does not support encryption/compression on the inside of archives ..."

That can be an advantage. Space isn't always what I want for backups - I want the original data back and compression gone wrong (tar -zxvf) is just another way to loose data.

fhars · on Sept 6, 2010

That is exactly why the lackof in-archive compression is bad, with tar you lose tje whole rest of the archive on a single bit error, with in-archive compression you lose just the file the error is located in.

btilly · on Sept 6, 2010

Also tar's behavior gives you better compression in the common (at least on Unix) case of many small text files with related content.

In other words the trade-off is exactly the opposite of what had been guessed above.

gaius · on Sept 6, 2010

tar never cared about encryption or compression because mostly you would get them "for free" in your tape library anyway.

http://en.wikipedia.org/wiki/Linear_Tape-Open#Compression

dagw · on Sept 6, 2010

The pkzip format allows you to "zip" data uncompressed if you are worried about that. Then you can trivially unpack your files using nothing but seek and read for those cases where you also accidentally misplace your last copy of unzip.

acqq · on Sept 6, 2010

It also has file description before the copy of the file data and at the second copy at the end of the whole archive, allowing fast listing of the content or location inside of the archive, exactly as the article would want.

The only thing pkzip doesn't cover in the original format is unix/linux specific metadata, but maybe this was/can be added. I use info-zip when the metadata don't matter but tar when they do (but even tar has its limitations with working with unix/linux metadata).

dagw · on Sept 6, 2010

pkzip does reserve the posibility for an arbitrary length extra field connected to each file. According to the spec (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) this is for "additional information...for special needs or for specific platforms". All compatible zip tools are required to ignore all information in this field that it doesn't understand so you can basically write whatever you want there (although the spec does offer a recommended format for writing to this field). So if you write a special ACL preserving zip implementation, you can still unpack the file with any other zip implementation that knows nothing of your special version.

bootload · on Sept 6, 2010

"... The pkzip format allows you to "zip" data uncompressed if you are worried about that. ..."

Didn't know that & must check the docs again.

nitrogen · on Sept 6, 2010

Some Zip applications show uncompressed files as being compressed with the "Store" algorithm instead of "Deflate".

masklinn · on Sept 6, 2010

> That can be an advantage. Space isn't always what I want for backups

Most if not all "compression" formats (and software) offer a "store" compressor which stores the data as-is, without applying any compression filter.

nanairo · on Sept 6, 2010

Anyone knows how does duplicity compare to XAR? DAR? Or CFS or 7z?

hernan7 · on Sept 6, 2010

As long as they don't make me use cpio, I'm fine.

joey_bananas · on Sept 6, 2010

we should go back to that embedded shellscript thingy that was common back in the day. Its name escapes me.

gaius · on Sept 6, 2010

http://en.wikipedia.org/wiki/Shar

rwmj · on Sept 6, 2010

I was just imagining if we'd all settled on using shar files how big the virus/worm problem would be on Linux today ...

... Then I thought that effectively that's what Windows does (using *.exe for installers). No wonder they've got a problem.

_delirium · on Sept 7, 2010

Well, the existing Linux package managers aren't really safer as far as the archive formats go; for example, .debs can run arbitrary shell scripts during installation. The main thing that seems to add to the safety is the social practice of grabbing debs via trusted repositories using apt-get/aptitude/synaptic, rather than manually downloading them from random sites and doing dpkg -i. But if there is malware, it's even worse, because at least these shar installers are usually installed as non-root, while installing a .deb needs root.

_delirium · on Sept 6, 2010

Commercial games with Linux versions often still use this (or a variant). Not too sure why; perhaps because it's the closest Linux equivalent to the self-extracting installer archives they use on Windows?

nitrogen · on Sept 6, 2010

When dealing with users who may not be completely familiar with package management, and creating a single cross-platform package, it can be very helpful to prevent the data and its installer from being separated. It's a very simple way to bundle some logic with an archive as well, as the script can be modified after it is generated.