ZIP vs tar for compressed archives

While examining the XLSX file format (which uses ZIP), I noticed something odd: Excel stores all strings in a centralized "shared strings" file and references them by index. This manual deduplication surprised to me, since I would expect the compression algorithm to handle this automatically. Looking into it further helped me understand why this is necessary for XLSX, and also what the advantages and disadvantages of ZIP vs tar+compress are.

The main difference between the two formats is that in ZIP, compression is built-in and happens independently for every file in the archive, but for tar, compression is an extra step that compresses the entire archive. This means ZIP has better random access while tar+compression will generally provide better compression.

Update: There are more modern formats like pixz that can give you both random access (via an index) and between-file compression. See my post on pixz for details.

The advantage of ZIP is you have random access to the files in the ZIP, without having the decompress the whole thing, but as a side effect, files don't share their compression dictionaries.

On the other hand, tar files can get automatic deduplication because gzip and xz see the entire tar file as one continuous file. Unfortunately, being one continuous compressed file means that if you want to read the last file in a tar.gz, you have to read and decompress the whole thing.

One caveat of tar's better compression is that this depends on the compression algorithm and the size of the files. For example, gzip can't find duplicates more than 32 KB apart, but xz can (max is something like 768 MB apart). Depending on how big your archives are, it may be useful to manually sort similar files to be close together (for example, group all text files together).

Experiments

I created several archives using /usr/share/dict/words (a list of ~470,000 English words). In each archive, I put 1-3 copies of the file, and then for tar I applied several kinds of compression.

Copies	Format	Size
1	Uncompressed	4.8 MB
1	gzip	1.5 MB
1	xz	1.2 MB
1	zip	1.5 MB
2	tar	9.5 MB
2	tar + gzip	2.9 MB
2	tar + xz	1.2 MB
2	zip	2.9 MB
3	tar	15 MB
3	tar + gzip	4.3 MB
3	tar + xz	1.2 MB
3	zip	4.3 MB

For ZIP and gz, I used the -9 option to increase compression, but it didn't seem to have any effect with xz.

Takeaways:

gzip's dictionary size is just too small to deduplicate this size of file, and given that the file isn't particularly large, I suspect it's very rarely going to significantly outperform ZIP.
On the other hand, xz does notice the duplicatation and completely eliminates it. Compressing a tar file with three copies of our file is almost exactly the same size as just compressing the file by itself.
ZIP seems to do about the same as gzip on compression, and given its superior random-access, it seems strictly better then tar + gzip.

So going back to Excel, it seems like the reason they chose a more complicated file format is that it lets them get the best of both words: They deduplicate manually, but as a result of using ZIP, you can access a single sheet of an Excel workbook without having to read the entire (possibly large) file. Because of some optional padding in the file format, you can also update one piece of an XLSX file without having to rewrite the entire ZIP archive.

Note: Sorry about the capitalization inconsistency between ZIP and tar's names, but PKWARE calls the format ZIP and tar is consistently listed in lower-case, even at the start of sentences.