We have a cluster in our university controlled by SLURM. The issue is that I have observed sometimes some of the resources are not being realised even though they do not show up in the squeue command. For example I have seen that a lot of CPUs on one of the nodes are assigned to me despite the fact that I had killed the allocated resources with scancel a couple of days ago! I want to find them and kill them.
On my local machine I have generated public and private ssh keys with ssh-keygen so now I can login to any of those machines with ssh foo, ssh [email protected], ssh [email protected] ... but the names of those nodes are not in an order. If one login to one of those nodes and run sinfo this is the result:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 3 mix baz[080-081,083]
debug up infinite 2 alloc grault,baz082
debug up infinite 13 idle baz[061-070],corge,bar,quux
gpu_p100 up infinite 1 mix baz080
gpu_titan-x up infinite 2 mix baz[081,083]
gpu_titan-x up infinite 1 alloc baz082
r730 up infinite 1 mix baz080
t630 up infinite 2 mix baz[081,083]
t630 up infinite 1 alloc baz082
r930 up infinite 1 alloc grault
m610 up infinite 10 idle baz[061-070]
r720 up infinite 1 idle corge
r815 up infinite 1 idle bar
sm1u up infinite 1 idle quux
main* up infinite 3 mix baz[080-081,083]
main* up infinite 2 alloc grault,baz082
main* up infinite 12 idle baz[061-070],bar,quux
where baz[081-083] is referring to 3 nodes baz081, baz082 and baz083.
no if I ssh into any of these nodes I can list all the processes assigned to a specific user by:
ps -A | grep user1
but that would take a lot of time. How can I automate this process:
- login to one of the nodes
- run
sinfo - extract the information and make a list of strings from the last column of
sinfocommand - find all the running processes with a specific user
user1and prints to the terminal
How can I write an script, preferably compatible with Cmder/ConEmu, to automate these steps.