7 years ago (!) I wrote a post comparing ZIP and tar, plus gz or xz, and concluded that ZIP is ideal if you need to quickly access individual files in the compressed archive, and tar + compression (like tar + xz) is ideal if you need maximum compression. Since then, I discovered pixz, which seems to provide the best of both worlds: Maximum compression with indexing for quick seeking.
I do a comparison below of compressing a file with 1000 copies of the dictionary using tar, tar+xz, pixz, and ZIP, and show that pixz achieves similar speed and compression to tar + xz while extracting individual files at similar speed to ZIP.
Note that even with indexed compression, extracting files at the end of the archive is fast, but editing the archive is still very slow, so ZIP still wins at random writes.
Format | Size | Compress Time | Last File Time |
---|---|---|---|
tar | 897 MB | 2 sec | 0.01 sec |
tar + xz | 7.2 MB | 37 sec | 0.4 sec |
pixz | 11 MB | 38 sec | 0.04 sec |
ZIP | 245 MB | 119 sec | 0.02 sec |
Why do we need indexed compression?
When you compress with tar and then xz, you're first adding all of the files to an archive (tar) and then compressing the archive as a single file (xz). When decompressing a file, you can't jump around; you have to decompress the first part to decompress what comes after. This means that to extract the last file in a tar+xz archive, you need to decompress the entire thing.
ZIP handles this by compressing each file individually and then archiving it, but that comes at the expense of much worse compression. See [my previous post][1].
Indexed compression handles this by compressing the archive as a single file, but also saving an index of what information you need to resume compression for each file in the archive. This means that if you want to read a file at the end of the archive, you can use the index to know how to jump to it without decompressing everything else.
Tedious details
Feel free to skip this section. I'm just documenting what I did exactly to make this reproducible.
To compare ZIP vs pixz, I took the contents of this words list1 and added 1000 copies of it to a ZIP file and a tar file:
$ time ./zip-test.sh 1000
real 1m59.802s
user 0m50.797s
sys 1m8.758s
See zip-test.sh
I add each file to the ZIP individually, so it's possible the slow speed here is my fault for not using a more efficient method.
$ time ./tar-test.sh 1000
real 0m1.764s
user 0m0.212s
sys 0m1.494s
See tar-test.sh
Note that tar is much faster because it's not compressing yet.
Then I compressed the tar file with both xz and pixz:
$ time xz --keep words-1000.tar
real 0m37.453s
user 11m11.499s
sys 0m2.017s
$ time pixz -k words-1000.tar
real 0m38.306s
user 11m27.567s
sys 0m2.482s
The sizes look like this:
897M words-1000.tar
7.3M words-1000.tar.xz
11M words-1000.tpxz
245M words-1000.zip
You can see that pixz compresses slightly worse than xz, probably due to the index, but it still does much better than ZIP.
Now the really interesting thing with pixz is that we should be able to access a random file extremely rapidly. So we're going to test reading the first and last file in the archive.
$ time unzip words-1000.zip words_1000
Archive: words-1000.zip
inflating: words_1000
real 0m0.016s
user 0m0.013s
sys 0m0.003s
$ time tar -xf words-1000.tar ./words_copy_1000
real 0m0.009s
user 0m0.000s
sys 0m0.009s
$ time tar -xf words-1000.tar.xz ./words_copy_1000
real 0m0.396s
user 0m2.039s
sys 0m1.202s
$ time pixz -x ./words_copy_1000 < words-1000.tpxz > ./words_copy_1000
real 0m0.043s
user 0m0.019s
sys 0m0.017s