GNU parallel with for loop function

Question

I would like to utilize all the cores (48) in AWS to run my job. I have 6 million lists to run and each job runs for a less than a sec [real 0m0.004s user 0m0.005s sys 0m0.000s]. My following execution uses all the cores but is NOT 100%.

gnu_parallel -a list.lst --load 100% --joblog process.log sh job_run.sh {} >>score.out

job_run.sh

#!/bin/bash
i=$1
TMP_DIR=/home/ubuntu/test/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
m=`echo $i|awk -F '-' '{print $2}'`
n=`echo $i|awk -F '-' '{print $3}'`
cp /home/ubuntu/aligned/$m $TMP_DIR/
cp /home/ubuntu/aligned/$n $TMP_DIR/
printf '%s ' "$i"
/home/ubuntu/test/prog -s1 $m -s2 $n | grep 'GA'
cd $TMP_DIR/../
rm -rf $TMP_DIR
exit 0

The slow part is almost certainly /home/ubuntu/test/prog. How are we supposed to know how to speed that up? — Barmar
– Barmar, Commented Apr 24, 2019 at 22:38
@Barmer prog is pretty fast. it runs less than a sec and time shows [real 0m0.004s user 0m0.005s sys 0m0.000s]. What I'm asking is, how can I utilize the cores 100%. — rajiv
– rajiv, Commented Apr 25, 2019 at 3:34
Try removing --load 100% AFAIK that is meant as a throttle to potentially slow things down rather than a target to speed things up to. By default it will use all cores fully anyway. Get rid of the 2 awk processes too, and use bash Parameter Substitution instead. — Mark Setchell
– Mark Setchell, Commented Apr 25, 2019 at 6:43

Ole Tange · Accepted Answer · 2019-04-25 13:15:25Z

Your problem is GNU Parallel's overhead: It takes 5-10 ms to start a job. So you will likely see GNU Parallel running at 100% on one core but the rest are idle.

But you can run multiple GNU Parallels: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Speeding-up-fast-jobs

So split the list into smaller chunks and run those in parallel:

cat list.lst | parallel --block 100k -q -I,, --pipe parallel --joblog process.log{#} sh job_run.sh {} >>score.out

This should run 48+1 GNU Parallels so it should use all your cores. Most of your cores will be used for overhead because your jobs are so fast.

If you are not using the process.log, then it can be done with less overhead:

perl -pe 's/^/sh job_run.sh /' list.lst | parallel --pipe --block 100k sh >>score.out

This will prepend each line with sh job_run.sh and give 100kb of lines to 48 shs running in parallel.

perl script produces an error : sh: 0: Can't open sh job_run.sh abc-dec-fih-44hh-hhh-odjd. Also specifying path to sh and list produces Unknown regexp modifier "/h" at -e line 1, at end of line

Collectives™ on Stack Overflow

GNU parallel with for loop function

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related