How to get nth result from find?

Question

I have access to a distributed computing/server farm with a job scheduler (Slurm) that gives each parallel job an integer ID from 1 to n (I know the value of n, in the example below, n = 10).

I am using find -maxdepth 1 -name '2019 - *' to find the list of file names I want to pass to my program as an argument.

Sample file names:

2019 - Alphabet
2019 - Foo Bar
2019 - Reddit
2019 - StackExchange

The order does not matter. All matching files should only be used once.

This is an example of a "template" script I can use:

#!/bin/bash

# in this case, from i = 1 to i = 10
#SBATCH --array=1-10

# pseudocode begins
    # it is given that filename_array has 10 unique elements
    filename_array="$(find -maxdepth 1 -name '2019 - *')"

    # SLURM_ARRAY_TASK_ID is the value of i, from i = 1 to i = 10
    filename=filename_array[$SLURM_ARRAY_TASK_ID]
# pseudocode ends

./a.out "$filename"

This is more or less what it does (but with each process running in a different computer in parallel):

./a.out "./2019 - Alphabet" &
./a.out "./2019 - Foo Bar" &
./a.out "./2019 - Reddit" &
./a.out "./2019 - StackExchange" &

How can I write a bash script that would run the template script exactly once for each of the file names given by find -maxdepth 1 -name '2019 - *'?

icarus · Accepted Answer · 2020-07-23 17:36:21Z

1

Probably using find is a mistake, particularly as you are only interested in files in the current directory. You can just use a shell glob pattern.

#/bin/sh

for f in '2019 - '*
do
    [ -f "$f" ] && ./a.out "$f" &
done

The test for it being a file is for portability. If you are using bash you could use shopt -s nullglob to make a non-matching pattern expand to nothing rather than itself, and so make the loop run zero times rather than one if there are no matching files. However portability is good, and handles cases like directory names which match the pattern.

Apparently what is required is a "template script", but I have limited idea what this means.

Perhaps

#!/bin/bash
# magic string for slurm to run on 10 hosts
#SBATCH --array=1-10

filename_array=( '2019 - '* )
filename=${filename_array[$SLURM_ARRAY_TASK_ID-1]}
./a.out "$filename"

is what is wanted?

Edit: Another requirement change. Support regular expressions for the patterns.

#!/bin/bash
# magic string for slurm to run on 10 hosts
#SBATCH --array=1-10

readarray -d '' filename_array < <( find . -maxdepth 1 -regex '.*2019 -.*' -print0 | sort -z )
filename=${filename_array[$SLURM_ARRAY_TASK_ID-1]}
./a.out "$filename"

edited Jul 23, 2020 at 17:36

answered Jul 23, 2020 at 2:54

icarus

19.1k1 gold badge42 silver badges57 bronze badges

I was not clear. I need to use a template script so that each of the process are run on a different computer.

ddlfmbqrc
– ddlfmbqrc

2020-07-23 03:04:58 +00:00
Commented Jul 23, 2020 at 3:04
OK, is the issue that you have n (10 in this case) cpus and an unknown number of files, which may be more or less than n? I am sorry but I am struggling to understand what the problem is.

icarus
– icarus

2020-07-23 04:01:11 +00:00
Commented Jul 23, 2020 at 4:01
I have (for example) exactly 10 files that match the pattern (2019 - *) and a large number of CPUs on physically different computers (for example, 1000 computers with 2 CPUs each). I want to spread the processes across the computers. I need to do this with Slurm Workload Manager.

ddlfmbqrc
– ddlfmbqrc

2020-07-23 04:19:58 +00:00
Commented Jul 23, 2020 at 4:19
Did my updated answer provide you with a solution? If this script is run on 10 different hosts with access to the same filesystem but each host has a different SLURM_ARRAY_TASK (in the range 1 to 10) then it provides a different filename to your program. Each file will be used once.

icarus
– icarus

2020-07-23 06:30:23 +00:00
Commented Jul 23, 2020 at 6:30
This works for me, but what if I have more complicated pattern than 2019 - * that can't be made in glob expansion? (or I don't know how convert find regex pattern to glob)

ddlfmbqrc
– ddlfmbqrc

2020-07-23 15:50:20 +00:00
Commented Jul 23, 2020 at 15:50

| Show 1 more comment

Ole Tange · Accepted Answer · 2020-08-04 12:27:53Z

0

Can you use $SLURM_JOB_NODELIST?

In that case GNU Parallel seems like an obvious solution:

find -maxdepth 1 -name '2019 - *' |
  parallel --slf $SLURM_JOB_NODELIST --wd . ./a.out {}

answered Aug 4, 2020 at 12:27

Ole Tange

37.5k34 gold badges119 silver badges226 bronze badges

Add a comment |

Stack Exchange Network

How to get nth result from find?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to get nth result from find?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions