I've been working on an OCaml library to read XLSX files, and something I thought was odd is that all strings in an Excel workbook are listed in a "shared strings" file and then referenced by index. This seemed strange to me, since I would expect the compression algorithm to do this kind of work for you, but thinking about it made me better understand why that's necessary, and also what the advantages and disadvantages of the ZIP and tar + compression formats are.

The main difference between the two formats is that in ZIP, compression is built-in and happens independently for every file in the archive, but for tar, compression is an extra step that compresses the entire archive.

The advantage of ZIP is you have random access to the files in the ZIP, without having the decompress the whole thing, but as a side effect, files don't share their compression dictionaries.

On the other hand, tar files can get automatic deduplication because gzip and xz see the entire tar file as one continuous file. Unfortunately, being one continuous compressed file means that if you want to read the last file in a tar.gz, you have to read and decompress the whole thing.

One caveat of tar's better compression is that this depends on the compression algorithm and the size of the files. For example, gzip can't find duplicates more than 32 KB apart, but xz can (max is something like 768 MB apart). Depending on how big your archives are, it may be useful to manually sort similar files to be close together (for example, group all text files together).

Update: I haven't had time to run experiments on this, but there are more modern formats like pixz that claim to give you both random access (via an index) and between-file compression.

Experiments

I created several archives using /usr/share/dict/words (a list of ~470,000 English words). In each archive, I put 1-3 copies of the file, and then for tar I applied several kinds of compression.

Copies Format Size
1 Uncompressed 4.8 MB
1 gzip 1.5 MB
1 xz 1.2 MB
1 zip 1.5 MB
2 tar 9.5 MB
2 tar + gzip 2.9 MB
2 tar + xz 1.2 MB
2 zip 2.9 MB
3 tar 15 MB
3 tar + gzip 4.3 MB
3 tar + xz 1.2 MB
3 zip 4.3 MB

For ZIP and gz, I used the -9 option to increase compression, but it didn't seem to have any effect with xz.

Takeaways:

  • gzip's dictionary size is just too small to deduplicate this size of file, and given that the file isn't particularly large, I suspect it's very rarely going to significantly outperform ZIP.
  • On the other hand, xz does notice the duplicatation and completely eliminates it. Compressing a tar file with three copies of our file is almost exactly the same size as just compressing the file by itself.
  • ZIP seems to do about the same as gzip on compression, and given its superior random-access, it seems strictly better then tar + gzip.

So going back to Excel, it seems like the reason they chose a more complicated file format is that it lets them get the best of both words: They deduplicate manually, but as a result of using ZIP, you can access a single sheet of an Excel workbook without having to read the entire (possibly large) file. Because of some optional padding in the file format, you can also update one piece of an XLSX file without having to rewrite the entire ZIP archive.

Note: Sorry about the capitalization inconsistency between ZIP and tar's names, but PKWARE calls the format ZIP and tar is consistently listed in lower-case, even at the start of sentences.