5

I have a script dataProcessing.pl that accepts a tab-delimited .txt file and performs extensive processing tasks on the contained data. Multiple input files exist (file1.txt file2.txt file3.txt) which are currently looped over as part of a bash script, that invokes perl during each iteration (i.e. input files are processed one at a time).

I wish however to run multiple instances of Perl (if possible), and process all input files simultaneously via xargs. I'm aware that you can run something akin to:

perl -e 'print "Test" x 100' | xargs -P 100

However I want to pass a different file for each parallel instance of Perl opened (one instance works on file1.txt, one works on file2.txt and so forth). File handle or file path can be passed to Perl as an argument. How can I do this? I am not sure how I would pass the file names to xargs for example.

3 Answers 3

12

Use xargs with -n 1 meaning "only pass one single argument to each invocation of the utility".

Something like:

printf '%s\n' file*.txt | xargs -n 1 -P 100 perl dataProcessing.pl

which assumes that the filenames don't contain literal newlines.

If you have GNU xargs, or an implementation of xargs that understands -0 (for reading nul-delimited arguments, which allows for filenames with newlines) and -r (for not running the utility with empty argument list, when file*.txt doesn't match anything and nullglob is in effect), you may do

printf '%s\0' file*.txt | xargs -r0 -n 1 -P 100 perl dataProcessing.pl

Note that both of these variations may start up to 100 parallel instances of the script, which may not be what you want. You may want to limit it to a reasonable number related to the number of CPUs on your machine (or related to the total amount of available RAM divided by the expected memory usage per task, if it's memory bound).

5

No need to get fancy here. In your bash for-loop, just background the perl process:

for f in file*.txt; do
    perl dataProcessing.pl "$f" &
done
# wait for them to complete
wait
echo "All done."
1
  • 4
    The issue with this is that you don't limit the number of parallel tasks, and if there are many input files and the processing is CPU or memory intensive, you will likely tank the system. Commented Mar 26, 2018 at 17:34
2

GNU Parallel is made for exactly this:

parallel some_command {} ::: *.txt

It defaults to one job per CPU core. If you want to run 100 jobs in parallel:

parallel -j100 some_command {} ::: *.txt

Knowing Perl you will feel right at home using even the more advanced features of GNU Parallel. What do you think this does:

parallel echo '{= s/(\d+)/$1*2/e; s/(.)/uc($1)/e; s/bar/baz/; s/foo/bar/ =}' \
  ::: 'my foo' 'i went to a baraar to get a 12" crowfoo'

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.