7 years ago (!) I wrote a post comparing ZIP and tar, plus gz or xz, and concluded that ZIP is ideal if you need to quickly access individual files in the compressed archive, and tar + compression (like tar + xz) is ideal if you need maximum compression. Since then, I discovered pixz, which seems to provide the best of both worlds: Maximum compression with indexing for quick seeking.

I do a comparison below of compressing a file with 1000 copies of the dictionary using tar, tar+xz, pixz, and ZIP, and show that pixz achieves similar speed and compression to tar + xz while extracting individual files at similar speed to ZIP.

Note that even with indexed compression, extracting files at the end of the archive is fast, but editing the archive is still very slow, so ZIP still wins at random writes.

Format	Size	Compress Time	Last File Time
tar	897 MB	2 sec	0.01 sec
tar + xz	7.2 MB	37 sec	0.4 sec
pixz	11 MB	38 sec	0.04 sec
ZIP	245 MB	119 sec	0.02 sec

Why do we need indexed compression?

When you compress with tar and then xz, you're first adding all of the files to an archive (tar) and then compressing the archive as a single file (xz). When decompressing a file, you can't jump around; you have to decompress the first part to decompress what comes after. This means that to extract the last file in a tar+xz archive, you need to decompress the entire thing.

ZIP handles this by compressing each file individually and then archiving it, but that comes at the expense of much worse compression. See [my previous post][1].

Indexed compression handles this by compressing the archive as a single file, but also saving an index of what information you need to resume compression for each file in the archive. This means that if you want to read a file at the end of the archive, you can use the index to know how to jump to it without decompressing everything else.

Tedious details

Feel free to skip this section. I'm just documenting what I did exactly to make this reproducible.

To compare ZIP vs pixz, I took the contents of this words list¹ and added 1000 copies of it to a ZIP file and a tar file:

$ time ./zip-test.sh 1000

real    1m59.802s
user    0m50.797s
sys 1m8.758s

See zip-test.sh

I add each file to the ZIP individually, so it's possible the slow speed here is my fault for not using a more efficient method.

$ time ./tar-test.sh 1000

real    0m1.764s
user    0m0.212s
sys 0m1.494s

See tar-test.sh

Note that tar is much faster because it's not compressing yet.

Then I compressed the tar file with both xz and pixz:

$ time xz --keep words-1000.tar

real    0m37.453s
user    11m11.499s
sys 0m2.017s

$ time pixz -k words-1000.tar

real    0m38.306s
user    11m27.567s
sys 0m2.482s

The sizes look like this:

897M    words-1000.tar
7.3M    words-1000.tar.xz
11M words-1000.tpxz
245M    words-1000.zip

You can see that pixz compresses slightly worse than xz, probably due to the index, but it still does much better than ZIP.

Now the really interesting thing with pixz is that we should be able to access a random file extremely rapidly. So we're going to test reading the first and last file in the archive.

$ time unzip words-1000.zip words_1000
Archive:  words-1000.zip
  inflating: words_1000

real    0m0.016s
user    0m0.013s
sys 0m0.003s

$ time tar -xf words-1000.tar ./words_copy_1000

real    0m0.009s
user    0m0.000s
sys 0m0.009s

$ time tar -xf words-1000.tar.xz ./words_copy_1000

real    0m0.396s
user    0m2.039s
sys 0m1.202s

$ time pixz -x ./words_copy_1000 < words-1000.tpxz > ./words_copy_1000

real    0m0.043s
user    0m0.019s
sys 0m0.017s

From this Gist ↩