Any improvements to this welcome, I have got it working and I'm happy with it. I'm not really all that proficient with bash shell scripts though.
Problem: AWS can copy multiple files pretty much as quickly as it can copy one file, so if you have a lot of large files then the best way is to get Amazon to copy them in parallel.
Solution:
Run the aws s3 cp or mv commands as background processes, and monitor them for completion. Limit the number that can be run in parallel.
# Check for aws s3 commands running that are owned by this shell process
function counts3 #
{
local plist=( $( ps -ef | grep -E "aws s3 \w+ " | awk "\$3 == $$ { print \$2 }" ) )
echo ${#plist[@]}
}
# Submit an aws s3 process in the background, if there are fewer than "n" currently running
function multis3 # (threads cp|mv source target [other params]...)
{
local threads=$1;shift
# Turn off monitor mode so you don't get the nohup completion output
set +m
local pcount=$(counts3)
# Wait until there are fewer s3 commands running than the threads limit
while (( ${pcount} > ${threads} )) || (( ${pcount} == ${threads} )) ; do
sleep 1
pcount=`counts3`
done
# Run the command, as there is now below the limit
nohup aws s3 $@ >& /dev/null &
# Introduce a small delay to stagger the start of the cp/mv commands
sleep 1
}
# Check if there are any aws s3 processes running, and wait until there aren't
function multis3wait #
{
local pcount=$(counts3)
# Wait until there are no s3 commands running
while (( ${pcount} > 0 )) ; do
sleep 1
pcount=`counts3`
done
# Turn monitor mode back on again
set -m
}
Usage example, instead of this:
aws s3 cp s3://from_bucket/from_path/file-1.txt s3://to_bucket/to_path/file-1.txt
aws s3 cp s3://from_bucket/from_path/file-2.txt s3://to_bucket/to_path/file-2.txt
aws s3 cp s3://from_bucket/from_path/file-3.txt s3://to_bucket/to_path/file-3.txt
aws s3 cp s3://from_bucket/from_path/file-4.txt s3://to_bucket/to_path/file-4.txt
aws s3 cp s3://from_bucket/from_path/file-5.txt s3://to_bucket/to_path/file-5.txt
aws s3 cp s3://from_bucket/from_path/file-6.txt s3://to_bucket/to_path/file-6.txt
aws s3 cp s3://from_bucket/from_path/file-7.txt s3://to_bucket/to_path/file-7.txt
aws s3 cp s3://from_bucket/from_path/file-8.txt s3://to_bucket/to_path/file-8.txt
aws s3 cp s3://from_bucket/from_path/file-9.txt s3://to_bucket/to_path/file-9.txt
do this:
multis3 3 cp s3://from_bucket/from_path/file-1.txt s3://to_bucket/to_path/file-1.txt
multis3 3 cp s3://from_bucket/from_path/file-2.txt s3://to_bucket/to_path/file-2.txt
multis3 3 cp s3://from_bucket/from_path/file-3.txt s3://to_bucket/to_path/file-3.txt
multis3 3 cp s3://from_bucket/from_path/file-4.txt s3://to_bucket/to_path/file-4.txt
multis3 3 cp s3://from_bucket/from_path/file-5.txt s3://to_bucket/to_path/file-5.txt
multis3 3 cp s3://from_bucket/from_path/file-6.txt s3://to_bucket/to_path/file-6.txt
multis3 3 cp s3://from_bucket/from_path/file-7.txt s3://to_bucket/to_path/file-7.txt
multis3 3 cp s3://from_bucket/from_path/file-8.txt s3://to_bucket/to_path/file-8.txt
multis3 3 cp s3://from_bucket/from_path/file-9.txt s3://to_bucket/to_path/file-9.txt
multis3wait
The 3 is the number of permitted threads. This can be as high as you like, 3 is just an example, but I have found that going over 50 doesn't really gain much. You can also do mv instead of or as well as cp.
Remember, if you put these functions in a shell library script (e.g. multis3.sh) you need to do this first to load the functions:
. multis3.sh
Perhaps I should remove the set +m and set -m and leave that up to the calling script to decide.
Alternative approaches
- Gnu
parallelcan be used to run theaws s3commands - The
aws s3 synccommand will use themax_concurrent_requestssetting to copy multiple files in parallel
I still feel that this library has a place, as it allows finer control over what gets copied than aws s3 sync (and can do mv as well as cp), and the advantage over parallel is that you can kick off a buch of copy commands, do something else in the script, and then wait for them all to complete before doing something that needs the files.
Possible enhancements
This could be generalised to run any process, and just count the number of child processes without filtering for aws s3. The child process count would have to exclude the ps command that is counting child processes...
&andwait:parallel ... & do other stuff; wait; do stuff that need the cp to finish? \$\endgroup\$aws s3 cpuse the max_concurrent_requests setting? Have you tested setting forcp? docs.aws.amazon.com/cli/latest/topic/… \$\endgroup\$