I have a directory that contains 400 million files. Using find, I created a list of all the files, which looks like this:
/output/custom/31/7/31767937
/output/custom/31/7/317537a
/output/custom/31/7/317537
/output/custom/31/7/317ab
/output/custom/31/7/317bo
/output/custom/31/7/317je
/output/custom/31/7/317ma
/output/custom/31/7/31763
I then split the file into 20 different files, and ran a script to create 20 different tarballs:
for i in $(ls x*)
do
tar -cf /tar/$i.tar -T $i &
done
The input files are on a different drive than the /tar mount point. The script has now been running for 2 days, and it's about 1/4 of the way done. I'll probably just leave it running at this point. However, for future reference, I'm wondering if there's a better way to do this than using tar?
My end goal here is to move these tarballs to 20 different servers, untar them and run some scripts on the files. Oh, and since I then have the tarballs I'll be putting them on S3 storage too.
iostat 5while gradually starting additional tar processes until one of the disks' throughput tops out.ioniceon all the tar processes except one, or evenkill -stopon most of the tar processes. Try to ensure only one process is doing i/o on one disk.ls! If you don't quote its output, your script breaks when any filename contains whitespace; when you do quote it, you can't process more than one file. You can't win. Just usefor i in *instead and make it a habit to quote every variable you use,always:tar -cf "/tar/$i.tar" -T "$i" &