Poor man's parallel gzip
dev/sh
There have been several tools aiming to incorporate multiple CPU cores to accelerate large file compression.
To name a few: pigz
, zstd
, xz
.
Here I suggest a minimalist approach to parallel compression.
The core idea:
- Split a file into chunks.
- Compress the chunks in parallel.
- Bundle the results in a tarball.
This approach is remarkably easy to implement and surprisingly effective for many use-cases—especially when you just want “good enough” speed without extra software.
Here’re the hackable shell functions that implement compression and decompression:
# Write the tarball containing gzipped chunks to stdout.
# Usage: cat FILE | pgzip > OUTPUT.tar
pgzip() {
local workdir="$(mktemp -d "${TMPDIR:-/tmp}/tmpXXXXXX")"
if [ $? -ne 0 ]; then
echo "Failed to mktemp -d" >&2
return 1
fi
# Some fancy strategies to decide the number of chunks to split.
local ncpu="$(nproc)" # Get the number of processors.
ncpu=$(( $ncpu * 8 / 10 )) # Use 80% of all processors.
ncpu=$(( $ncpu < 10 ? $ncpu : 10 )) # Use at most 10 processors.
# Step 1: split stdin into chunks.
split -d -a 1 -n $ncpu - "$workdir/chunk_"
# Step 2: parallel gzip.
find "$workdir" -name 'chunk_*' | xargs -n1 -P $ncpu gzip
# Step 3: bundle into tarball, which goes to stdout.
{ echo "-C $workdir"; find "$workdir" -name 'chunk_*' | xargs basename | sort; } | tar cf - -T -
rm -rf "$workdir"
}
# Decompress the output tarball to stdout.
# Usage: pgunzip OUTPUT.tar > FILE
pgunzip() {
local INPUT="$1"
if [ -z "$INPUT" ]; then
echo "Missing input file" >&2
return 1
fi
tar xOf "$INPUT" | gzip -cd
}
That’s it. No compilation, no installation, just shell and standard tools.
What’s more, the result tarball can be easily handled by python, with builtin modules only (i.e. tarfile
and gzip
).
The limitation is obvious:
- Not robust against failures mid-process.