1

I'm trying to come up with a script for managing jobs on a supercomputer. The details don't matter much, but a key point is that the script starts to tail -f a file once it appears. Now this would run forever, but I want to cleanly stop it and exit the script once I detect that the job is finished.

Unfortunately I'm stuck. I tried multiple solutions, but none of them exits the script, it keeps running even after the job was detected to exit. The version below seemed to be the most logical one, but this one too keeps running forever.

How should I tackle this issue? I'm OK with bash, but not really advanced.

#!/bin/bash

# get the path to the job script, print help if not passed
jobscr="$1"
[[ -z "$jobscr" ]] && echo "Usage: submit-and-follow [script to submit]" && exit -2

# submit job via SLURM (the job secluder), and get the
# job ID (4-5-digit number) from it's output, exit if failed
jobmsg=$(sbatch "$jobscr")
ret=$?
echo "$jobmsg"
if [ ! $ret -eq 0 ]; then exit $ret; fi
jobid=$(echo "$jobmsg" | cut -d " " -f 4)

# get the stdout and stderr file the job is using, we will log them in another 
# file while we `tail -f` them (this is neccessary due to a file corruption 
# bug in the supercomputer, just assume it makes sense)
outf="$(scontrol show job $jobid | awk -F= '/StdOut=/{print $2}')"
errf="$(scontrol show job $jobid | awk -F= '/StdErr=/{print $2}')"
logf="${outf}.bkp"

# wait for job to start
echo "### Waiting for job $jobid to start..."
until [ -f "$outf" ]; do sleep 5; done


# ~~~~ HERE COMES THE PART IN QUESTION ~~~~ #

# Once it started, start printing the content of stdout and stderr 
# and copy them into the log file
echo "### Job $jobid started, stdout and stderr:"
tail -f -n 100000 $outf $errf | tee $logf &
tail_pid=$! # catch the pid of the child process

# watch for job end (the job id disappears from the queue; consider this 
# detection working), and kill the tail process
while : ; do
    sleep 5
    if [[ -z "$(squeue | grep $jobid)" ]]; then
        echo "### Job $jobid finished!"
        kill -2 $tail_pid
        break
    fi  
done   

I also tried another version where tail was in the main process, and the while loop was running in a child process instead, which killed the main process once the job ended, but it didn't work out. Either way, the script never terminates.

5
  • I suspect tail | tee in Bash runs a subshell, and does not return the Pid of either the tail or the tee. SigInt (kill -2) will be ignored by that subshell. Commented May 31, 2023 at 9:50
  • @Paul_Pedant Correct! Knowing this, I found the answer: stackoverflow.com/a/8048493/5099168. Thanks for the tip! Commented May 31, 2023 at 10:09
  • I'm not sure if such things as "cross-site duplicates" exist, but turns out the core problem is exactly this: How to get the PID of a process that is piped to another process in Bash? Commented May 31, 2023 at 10:19
  • By the way - you don't need the loop waiting for $outf to appear - you can use tail's -F flag (equivalent to --retry -f) instead of -f, which will keep trying to open the file until it appears. Commented May 31, 2023 at 11:19
  • @aviro Ah, that's actually pretty cool. Commented May 31, 2023 at 11:26

1 Answer 1

3

Thanks to @Paul_Pedant's comment, I managed to find the issue. As I was piping tail to tee in my original script, $! contained the PID of tee, not tail, so only tee is killed. The latter gets a $SIGPIPE, but apparently it's not sufficient to stop it.

The solution is in the following answer: https://stackoverflow.com/a/8048493/5099168

Implemented in my script, the relevant lines take the following shape:

tail -f -n 100000 $outf $errf > >(tee $logf) & 
tail_pid=$!
2
  • It's not only the SIGPIPE; The tail command will only get a SIGPIPE if it tries to write something to the pipe. If there are no new lines in the file it follows, it will not try to write anything and won't receive the signal. Also, I'm not sure the SIGINT signal you're sending (-2) would work, since bash disables SIGINT on background processes Commented May 31, 2023 at 10:22
  • @aviro That makes even more sense. After the job is killed, there are no new lines, so tail gets nothing. Of course it didn't stop. Re: sigint, it actually worked, but you're right, there's no need for it. It was a leftover from another version where the loop was running in the background and tail in the main process. Commented May 31, 2023 at 10:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.