4
\$\begingroup\$

Any improvements to this welcome, I have got it working and I'm happy with it. I'm not really all that proficient with bash shell scripts though.

Problem: AWS can copy multiple files pretty much as quickly as it can copy one file, so if you have a lot of large files then the best way is to get Amazon to copy them in parallel.

Solution: Run the aws s3 cp or mv commands as background processes, and monitor them for completion. Limit the number that can be run in parallel.

# Check for aws s3 commands running that are owned by this shell process
function counts3 #
{
    local plist=( $( ps -ef | grep -E "aws s3 \w+ " | awk "\$3 == $$ { print \$2 }" ) )
    echo ${#plist[@]}
}

# Submit an aws s3 process in the background, if there are fewer than "n" currently running
function multis3 # (threads cp|mv source target [other params]...)
{
    local threads=$1;shift

    # Turn off monitor mode so you don't get the nohup completion output
    set +m
    
    local pcount=$(counts3)
    # Wait until there are fewer s3 commands running than the threads limit
    while (( ${pcount} > ${threads} )) || (( ${pcount} == ${threads} )) ; do
        sleep 1
        pcount=`counts3`
    done
    # Run the command, as there is now below the limit
    nohup aws s3 $@ >& /dev/null &
    # Introduce a small delay to stagger the start of the cp/mv commands
    sleep 1
}

# Check if there are any aws s3 processes running, and wait until there aren't
function multis3wait #
{
    local pcount=$(counts3)
    # Wait until there are no s3 commands running
    while (( ${pcount} > 0 )) ; do
        sleep 1
        pcount=`counts3`
    done

    # Turn monitor mode back on again
    set -m
}

Usage example, instead of this:

aws s3 cp s3://from_bucket/from_path/file-1.txt s3://to_bucket/to_path/file-1.txt
aws s3 cp s3://from_bucket/from_path/file-2.txt s3://to_bucket/to_path/file-2.txt
aws s3 cp s3://from_bucket/from_path/file-3.txt s3://to_bucket/to_path/file-3.txt
aws s3 cp s3://from_bucket/from_path/file-4.txt s3://to_bucket/to_path/file-4.txt
aws s3 cp s3://from_bucket/from_path/file-5.txt s3://to_bucket/to_path/file-5.txt
aws s3 cp s3://from_bucket/from_path/file-6.txt s3://to_bucket/to_path/file-6.txt
aws s3 cp s3://from_bucket/from_path/file-7.txt s3://to_bucket/to_path/file-7.txt
aws s3 cp s3://from_bucket/from_path/file-8.txt s3://to_bucket/to_path/file-8.txt
aws s3 cp s3://from_bucket/from_path/file-9.txt s3://to_bucket/to_path/file-9.txt

do this:

multis3 3 cp s3://from_bucket/from_path/file-1.txt s3://to_bucket/to_path/file-1.txt
multis3 3 cp s3://from_bucket/from_path/file-2.txt s3://to_bucket/to_path/file-2.txt
multis3 3 cp s3://from_bucket/from_path/file-3.txt s3://to_bucket/to_path/file-3.txt
multis3 3 cp s3://from_bucket/from_path/file-4.txt s3://to_bucket/to_path/file-4.txt
multis3 3 cp s3://from_bucket/from_path/file-5.txt s3://to_bucket/to_path/file-5.txt
multis3 3 cp s3://from_bucket/from_path/file-6.txt s3://to_bucket/to_path/file-6.txt
multis3 3 cp s3://from_bucket/from_path/file-7.txt s3://to_bucket/to_path/file-7.txt
multis3 3 cp s3://from_bucket/from_path/file-8.txt s3://to_bucket/to_path/file-8.txt
multis3 3 cp s3://from_bucket/from_path/file-9.txt s3://to_bucket/to_path/file-9.txt
multis3wait

The 3 is the number of permitted threads. This can be as high as you like, 3 is just an example, but I have found that going over 50 doesn't really gain much. You can also do mv instead of or as well as cp.

Remember, if you put these functions in a shell library script (e.g. multis3.sh) you need to do this first to load the functions:

. multis3.sh

Perhaps I should remove the set +m and set -m and leave that up to the calling script to decide.

Alternative approaches

  • Gnu parallel can be used to run the aws s3 commands
  • The aws s3 sync command will use the max_concurrent_requests setting to copy multiple files in parallel

I still feel that this library has a place, as it allows finer control over what gets copied than aws s3 sync (and can do mv as well as cp), and the advantage over parallel is that you can kick off a buch of copy commands, do something else in the script, and then wait for them all to complete before doing something that needs the files.

Possible enhancements

This could be generalised to run any process, and just count the number of child processes without filtering for aws s3. The child process count would have to exclude the ps command that is counting child processes...

\$\endgroup\$
5
  • \$\begingroup\$ It is great that you implemented this as a shell library. That will give you the most flexibility with reusing this code in the long term. Kudos. \$\endgroup\$ Commented Aug 21, 2020 at 12:17
  • \$\begingroup\$ Would you not get the same advantage with GNU Parallel simply by using & and wait: parallel ... & do other stuff; wait; do stuff that need the cp to finish? \$\endgroup\$ Commented Aug 24, 2020 at 6:34
  • \$\begingroup\$ GNU parallel might work, but I'd have to prepare a file or variable with all the commands and feed them all in at the same time. With this, I can drip feed the files in a shell. I sometimes do this manually, prepare a few hundred commands and paste 50 at a time into a shell and let the multis3 command manage it. \$\endgroup\$ Commented Aug 25, 2020 at 14:03
  • \$\begingroup\$ Doesn't aws s3 cp use the max_concurrent_requests setting? Have you tested setting for cp? docs.aws.amazon.com/cli/latest/topic/… \$\endgroup\$ Commented Jan 10, 2024 at 22:28
  • \$\begingroup\$ Gosh @Munesh it's a long time since I did anything with S3, I think I've forgotten nearly everything I knew about it. \$\endgroup\$ Commented Jan 12, 2024 at 16:52

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.