51

There are 5 huge files ( file1, file2, .. file5) about 10G each and extremely low free space left on the disk and I need to concatenate all this files into one. There is no need to keep original files, only the final one.

Usual concatenation is going with cat in sequence for files file2 .. file5:

cat file2 >> file1 ; rm file2

Unfortunately this way requires a at least 10G free space I don't have. Is there a way to concatenate files without actual copying it but tell filesystem somehow that file1 doesn't end at original file1 end and continues at file2 start?

ps. filesystem is ext4 if that matters.

7
  • 2
    I'd be interested to see a solution, but I suspect it's not possible without messing with the filesystem directly. Commented Jun 23, 2013 at 15:51
  • 1
    Why do you need to have a single physical file that is so large? I'm asking because maybe you can avoid concatenating—which, as current answers show, is pretty bothersome. Commented Jun 23, 2013 at 17:05
  • 7
    @rush: then this answer might help: serverfault.com/a/487692/16081 Commented Jun 23, 2013 at 20:02
  • 1
    Alternative to device-mapper, less efficient, but easier to implement and results in a partitionable device and can be used from a remote machine is to use the "multi" mode of nbd-server. Commented Jun 23, 2013 at 20:33
  • 1
    They always call me stupid when I tell that I think this should be cool. Commented Jul 12, 2013 at 13:54

5 Answers 5

21

AFAIK it is (unfortunately) not possible to truncate a file from the beginning (this may be true for the standard tools but for the syscall level see here). But with adding some complexity you can use the normal truncation (together with sparse files): You can write to the end of the target file without having written all the data in between.

Let's assume first both files are exactly 5GiB (5120 MiB) and that you want to move 100 MiB at a time. You execute a loop which consists of

  1. copying one block from the end of the source file to the end of the target file (increasing the consumed disk space)
  2. truncating the source file by one block (freeing disk space)

    for((i=5119;i>=0;i--)); do
      dd if=sourcefile of=targetfile bs=1M skip="$i" seek="$i" count=1
      dd if=/dev/zero of=sourcefile bs=1M count=0 seek="$i"
    done
    

But give it a try with smaller test files first, please...

Probably the files are neither the same size nor multiples of the block size. In that case the calculation of the offsets becomes more complicated. seek_bytes and skip_bytes should be used then.

If this is the way you want to go but need help for the details then ask again.

Warning

Depending on the dd block size the resulting file will be a fragmentation nightmare.

3
  • Looks like this is the most acceptable way to concatenate files. Thanks for advice. Commented Jun 23, 2013 at 19:53
  • 4
    if there is no sparse file support then you could block-wise reverse the second file in place and then just remove the last block and add it to the second file Commented Jun 24, 2013 at 0:04
  • 1
    I haven't tried this myself (although I'm about to), but seann.herdejurgen.com/resume/samag.com/html/v09/i08/a9_l1.htm is a Perl script that claims to implement this algorithm. Commented Nov 26, 2013 at 17:58
18

Instead of catting the files together into one file, maybe simulate a single file with a named pipe, if your program can't handle multiple files.

mkfifo /tmp/file
cat file* >/tmp/file &
blahblah /tmp/file
rm /tmp/file

As Hauke suggests, losetup/dmsetup can also work. A quick experiment; I created 'file1..file4' and with a bit of effort, did:

for i in file*;do losetup -f ~/$i;done

numchunks=3
for i in `seq 0 $numchunks`; do
        sizeinsectors=$((`ls -l file$i | awk '{print $5}'`/512))
        startsector=$(($i*$sizeinsectors))
        echo "$startsector $sizeinsectors linear /dev/loop$i 0"
done | dmsetup create joined

Then, /dev/dm-0 contains a virtual block device with your file as contents.

I haven't tested this well.

Another edit: The file size has to be divisible evenly by 512 or you'll lose some data. If it is, then you're good. I see he also noted that below.

3
  • It's a great idea to read this file once, unfortunately it has no ability to jump over fifo backward/forward, isn't it? Commented Jun 23, 2013 at 19:50
  • 7
    @rush The superior alternative may be to put a loop device on each file and combine them via dmsetup to a virtual block device (which allows normal seek operations but neither append nor truncate). If the size of the first file is not a multiple of 512 then you should copy the incomplete last sector and the first bytes from the second file (in sum 512) to a third file. The loop device for the second file would need --offset then. Commented Jun 23, 2013 at 20:30
  • elegant solutions. +1 also to Hauke Laging who suggest a way to workaround the problem if the first file(s)'s size is not a multiple of 512 Commented Jun 24, 2013 at 18:09
10

You'll have to write something that copies data in bunches that are at most as large as the amount of free space you have. It should work like this:

  • Read a block of data from file2 (using pread() by seeking before the read to the correct location).
  • Append the block to file1.
  • Use fcntl(F_FREESP) to deallocate the space from file2.
  • Repeat
5
  • 1
    I know... but I couldn't think of any way that didn't involve writing code, and I figured writing what I wrote was better that writing nothing. I didn't think of your clever trick of starting from the end! Commented Jun 23, 2013 at 16:10
  • Yours, too, wouldn't work without starting from the end, would it? Commented Jun 23, 2013 at 16:16
  • No, it works from the beginning because of fcntl(F_FREESP) which frees the space associated with a given byte range of the file (it makes it sparse). Commented Jun 23, 2013 at 16:19
  • That's pretty cool. But seems to be a very new feature. It is not mentioned in my fcntl man page (2012-04-15). Commented Jun 23, 2013 at 16:31
  • 4
    @HaukeLaging F_FREESP is the Solaris one. On Linux (since 2.6.38), it's the FALLOC_FL_PUNCH_HOLE flag of the fallocate syscall. Newer versions of the fallocate utility from util-linux have an interface to that. Commented Jun 23, 2013 at 20:12
1

I know it's more of a workaround than what you asked for, but it would take care of your problem (and with little fragmentation or headscratch):

#step 1
mount /path/to/... /the/new/fs #mount a new filesystem (from NFS? or an external usb disk?)

and then

#step 2:
cat file* > /the/new/fs/fullfile

or, if you think compression would help:

#step 2 (alternate):
cat file* | gzip -c - > /the/new/fs/fullfile.gz

Then (and ONLY then), finally

#step 3:
rm file*
mv /the/new/fs/fullfile  .   #of fullfile.gz if you compressed it
2
  • Unfortunately external usb disk requires physical access and nfs requires additional hardware and I have nothing of it. Anyway thanks. =) Commented Jun 24, 2013 at 17:58
  • I thought it would be that way... Rob Bos's answer is then what seems your best option (without risking losing data by truncating-while-copying, and without hitting FS limitations as well) Commented Jun 24, 2013 at 18:12
1

This answer is going to look a bit weird; please bear with me. I'm going to assume that you have 5 files, each 10GB in size, and the disk is nearly full (let's say you have 20MB free).

Furthermore I'm going to assume that each file is compressible down to 5/6ths (~ 83.33%) of its original size (that is, down to ~8533 MB), or more efficiently than that (that is, to a smaller ratio than 5/6).

The idea is that once you have the 5 files of 8533 MB size each, they're going to occupy ~41.67GB together, and then it becomes possible to concatenate compressed file#2 to compressed file#1. That way, your cumulative disk usage will return to 50GB, but you can delete file#2 right after. And then you can individually concatenate and delete the remaining files. The idea is that the compression is supposed to free up enough space so that you can always duplicate (by way of concatenation) just one of the remaining (compressed) files.

In the end, you're supposed to have all five compressed files concatenated. This file should then be de-compressed, producing a concatenated decompressed file (all three of xz, gzip and bzip2 deal transparently with concatenation).

For compression of such large files, multi-threaded utilities should be used (pigz, lbzip2, xz -T).

Of course, the question is that, if you have only ~20MB free in the initial state, how can you compress the first 10GB file down to ~8533 MB? There seems to be no room for storing the compressed output.

Similarly, in the end, when you have all five compressed files concatenated (and the individual files removed), your disk usage will be ~41.67GB (with ~8533 MB free); how is the full decompressed file (50GB) going to fit in that ~8533 MB free space?

The answer is that you can compress and decompress regular files in-place with the ipf utility. (I wrote ipf.)

Here's a proof of concept with 1GB files.

First I need to generate input files. Base64 encoding blows up the data to 4/3rds of the original size, and a good compressor will invert that inflation; in other words, base64-encoded pseudo-random data will be compressed down to 3/4ths, that is, to 75% -- and that will satisfy the above requirement of 83.33%.

# generate the input files
for ((k=0; k<5; k++)); do
  head -c $((1024*1024*1024*3/4)) /dev/urandom \
  | base64 --wrap=0 \
  > file-$k.dat
done

# check input file sizes
du -mc file-?.dat

Expected output:

1025    file-0.dat
1025    file-1.dat
1025    file-2.dat
1025    file-3.dat
1025    file-4.dat
5121    total

Check each file's compressibility before modifying anything:

for f in file-?.dat; do
  origsz=$(stat -c %s -- "$f")
  comprsz=$(pigz -c -- "$f" | wc -c)
  # calculate the ratio (comprsz/origsz) as a percentage, rounded up
  # to a whole percent
  percentage=$(((comprsz*100 + origsz - 1)/origsz))
  printf '%s: %u%%\n' "$f" $percentage
done

Expected output for the example files:

file-0.dat: 76%
file-1.dat: 76%
file-2.dat: 76%
file-3.dat: 76%
file-4.dat: 76%

This is better than 5/6ths (~83.33%); thus we can proceed.

Capture a checksum for the concatenated file in advance:

cat file-?.dat | sha256sum > all.sha256

Now compress each file in-place:

mkdir queue-dir
mkfifo orig-fifo filtered-fifo

(
  set -e
  for f in file-?.dat; do
    pigz <orig-fifo >filtered-fifo &
    compressor_pid=$!
    ipf -f "$f" -w orig-fifo -r filtered-fifo -d queue-dir \
        -s $((1024*1024))
    wait $compressor_pid
    mv -v -- "$f" "$f".gz
  done
)

Expected output (with the example files):

renamed 'file-0.dat' -> 'file-0.dat.gz'
renamed 'file-1.dat' -> 'file-1.dat.gz'
renamed 'file-2.dat' -> 'file-2.dat.gz'
renamed 'file-3.dat' -> 'file-3.dat.gz'
renamed 'file-4.dat' -> 'file-4.dat.gz'

Concatenate one by one, checking total disk usage after each step:

(
  set -e
  shopt -s nullglob
  first=1
  for f in file-?.dat.gz; do
    if [ $first -ne 0 ]; then
      mv -v -- "$f" all.gz
      first=0
    else
      cat -- "$f" >> all.gz
      rm -v -- "$f"
    fi

    du -mc all.gz file-?.dat.gz
  done
)

Expected output (with the example files):

renamed 'file-0.dat.gz' -> 'all.gz'
776     all.gz
776     file-1.dat.gz
776     file-2.dat.gz
776     file-3.dat.gz
776     file-4.dat.gz
3879    total
removed 'file-1.dat.gz'
1552    all.gz
776     file-2.dat.gz
776     file-3.dat.gz
776     file-4.dat.gz
3879    total
removed 'file-2.dat.gz'
2328    all.gz
776     file-3.dat.gz
776     file-4.dat.gz
3879    total
removed 'file-3.dat.gz'
3103    all.gz
776     file-4.dat.gz
3879    total
removed 'file-4.dat.gz'
3879    all.gz
3879    total

Decompress the concatenated file in-place, as next step.

Assuming the worst acceptable compression rate of 83.33%: the queue directory will reach its maximum disk space usage just before the decompression finishes: ~41.67GB will have been written back to the concatenated file, and ~8533MB will have been queued; in total: 50GB. The max overhead of the queue directory will be 4MB (we choose 4MB as queue file size), which can be accommodated by the initial 20MB free space. Just before decompression completes, the queue directory entry count (i.e. number of queue files held at the same time) will peak at ~8533/4≃2134 queue files. The total number of queue files that will have passed through the queue directory is 50GB/4MB=12800.

Numbers for the example files (where the compression rate is ~76%): the queue directory will reach its maximum disk space usage just before the decompression finishes: 3879MB will have been written back to the concatenated file, and ~1242MB will have been queued; in total: 5121MB. Just before decompression completes, the queue directory entry count will peak at ~1242/4≃311 queue files. The total number of queue files that will have passed through the queue directory is ~5121MB/4MB≃1280.

(
  set -e
  pigz -d <orig-fifo >filtered-fifo &
  decompressor_pid=$!
  ipf -f all.gz -w orig-fifo -r filtered-fifo -d queue-dir \
      -s $((4*1024*1024))
  wait $decompressor_pid
  mv -v -- all.gz all
)

Verify the pre-calculated checksum:

sha256sum -c all.sha256 <all

Clean up:

rmdir queue-dir
rm orig-fifo filtered-fifo all.sha256

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.