This answer is going to look a bit weird; please bear with me. I'm going to assume that you have 5 files, each 10GB in size, and the disk is nearly full (let's say you have 20MB free).
Furthermore I'm going to assume that each file is compressible down to 5/6ths (~ 83.33%) of its original size (that is, down to ~8533 MB), or more efficiently than that (that is, to a smaller ratio than 5/6).
The idea is that once you have the 5 files of 8533 MB size each, they're going to occupy ~41.67GB together, and then it becomes possible to concatenate compressed file#2
to compressed file#1
. That way, your cumulative disk usage will return to 50GB, but you can delete file#2
right after. And then you can individually concatenate and delete the remaining files. The idea is that the compression is supposed to free up enough space so that you can always duplicate (by way of concatenation) just one of the remaining (compressed) files.
In the end, you're supposed to have all five compressed files concatenated. This file should then be de-compressed, producing a concatenated decompressed file (all three of xz
, gzip
and bzip2
deal transparently with concatenation).
For compression of such large files, multi-threaded utilities should be used (pigz
, lbzip2
, xz -T
).
Of course, the question is that, if you have only ~20MB free in the initial state, how can you compress the first 10GB file down to ~8533 MB? There seems to be no room for storing the compressed output.
Similarly, in the end, when you have all five compressed files concatenated (and the individual files removed), your disk usage will be ~41.67GB (with ~8533 MB free); how is the full decompressed file (50GB) going to fit in that ~8533 MB free space?
The answer is that you can compress and decompress regular files in-place with the ipf utility. (I wrote ipf
.)
Here's a proof of concept with 1GB files.
First I need to generate input files. Base64 encoding blows up the data to 4/3rds of the original size, and a good compressor will invert that inflation; in other words, base64-encoded pseudo-random data will be compressed down to 3/4ths, that is, to 75% -- and that will satisfy the above requirement of 83.33%.
# generate the input files
for ((k=0; k<5; k++)); do
head -c $((1024*1024*1024*3/4)) /dev/urandom \
| base64 --wrap=0 \
> file-$k.dat
done
# check input file sizes
du -mc file-?.dat
Expected output:
1025 file-0.dat
1025 file-1.dat
1025 file-2.dat
1025 file-3.dat
1025 file-4.dat
5121 total
Check each file's compressibility before modifying anything:
for f in file-?.dat; do
origsz=$(stat -c %s -- "$f")
comprsz=$(pigz -c -- "$f" | wc -c)
# calculate the ratio (comprsz/origsz) as a percentage, rounded up
# to a whole percent
percentage=$(((comprsz*100 + origsz - 1)/origsz))
printf '%s: %u%%\n' "$f" $percentage
done
Expected output for the example files:
file-0.dat: 76%
file-1.dat: 76%
file-2.dat: 76%
file-3.dat: 76%
file-4.dat: 76%
This is better than 5/6ths (~83.33%); thus we can proceed.
Capture a checksum for the concatenated file in advance:
cat file-?.dat | sha256sum > all.sha256
Now compress each file in-place:
mkdir queue-dir
mkfifo orig-fifo filtered-fifo
(
set -e
for f in file-?.dat; do
pigz <orig-fifo >filtered-fifo &
compressor_pid=$!
ipf -f "$f" -w orig-fifo -r filtered-fifo -d queue-dir \
-s $((1024*1024))
wait $compressor_pid
mv -v -- "$f" "$f".gz
done
)
Expected output (with the example files):
renamed 'file-0.dat' -> 'file-0.dat.gz'
renamed 'file-1.dat' -> 'file-1.dat.gz'
renamed 'file-2.dat' -> 'file-2.dat.gz'
renamed 'file-3.dat' -> 'file-3.dat.gz'
renamed 'file-4.dat' -> 'file-4.dat.gz'
Concatenate one by one, checking total disk usage after each step:
(
set -e
shopt -s nullglob
first=1
for f in file-?.dat.gz; do
if [ $first -ne 0 ]; then
mv -v -- "$f" all.gz
first=0
else
cat -- "$f" >> all.gz
rm -v -- "$f"
fi
du -mc all.gz file-?.dat.gz
done
)
Expected output (with the example files):
renamed 'file-0.dat.gz' -> 'all.gz'
776 all.gz
776 file-1.dat.gz
776 file-2.dat.gz
776 file-3.dat.gz
776 file-4.dat.gz
3879 total
removed 'file-1.dat.gz'
1552 all.gz
776 file-2.dat.gz
776 file-3.dat.gz
776 file-4.dat.gz
3879 total
removed 'file-2.dat.gz'
2328 all.gz
776 file-3.dat.gz
776 file-4.dat.gz
3879 total
removed 'file-3.dat.gz'
3103 all.gz
776 file-4.dat.gz
3879 total
removed 'file-4.dat.gz'
3879 all.gz
3879 total
Decompress the concatenated file in-place, as next step.
Assuming the worst acceptable compression rate of 83.33%: the queue directory will reach its maximum disk space usage just before the decompression finishes: ~41.67GB will have been written back to the concatenated file, and ~8533MB will have been queued; in total: 50GB. The max overhead of the queue directory will be 4MB (we choose 4MB as queue file size), which can be accommodated by the initial 20MB free space. Just before decompression completes, the queue directory entry count (i.e. number of queue files held at the same time) will peak at ~8533/4≃2134 queue files. The total number of queue files that will have passed through the queue directory is 50GB/4MB=12800.
Numbers for the example files (where the compression rate is ~76%): the queue directory will reach its maximum disk space usage just before the decompression finishes: 3879MB will have been written back to the concatenated file, and ~1242MB will have been queued; in total: 5121MB. Just before decompression completes, the queue directory entry count will peak at ~1242/4≃311 queue files. The total number of queue files that will have passed through the queue directory is ~5121MB/4MB≃1280.
(
set -e
pigz -d <orig-fifo >filtered-fifo &
decompressor_pid=$!
ipf -f all.gz -w orig-fifo -r filtered-fifo -d queue-dir \
-s $((4*1024*1024))
wait $decompressor_pid
mv -v -- all.gz all
)
Verify the pre-calculated checksum:
sha256sum -c all.sha256 <all
Clean up:
rmdir queue-dir
rm orig-fifo filtered-fifo all.sha256
nbd-server
.