0

For some reason when launching an mpi job with SLURM on CentOS8 cluster, slurm ties mpi processes to CPUs always starting from CPU0.

Say there are 128 CPU cores on a compute node. I launch mpi job asking for 64 CPUs on that node. Fine, it gets allocated on first 64 cores (1st socket) and runs there fine.

Now if i submit another 64-CPU mpi job to the same node, SLURM places it again on 1st socket, so CPUs 0-63 are used by both jobs, but CPUs 64-127 of the 2nd socket are not used at all.

Played with various mpi parameters to no avail. The only way I was able to assign 2jobs to different sockets is when using rank files with openmpi. But that should not be necessary if SLURM works correctly.

Consumable resources in SLURM are CR_Core. TaskPlugin=task/affinity.

If I run the same 2 x mpi code on the same node without SLURM, the same openmpi allocates CPUs correctly.

What can make SLURM to behave in such a bizarre way?

2
  • This seems like an observation than a question. Are there logs, configuration files, error messages? Keeping secrets makes it very hard for us help you. Edit your Question to improve it. Commented Apr 14, 2021 at 20:09
  • yes... no errors whatsoever. I guess the question is what can make SLURM behave in such a bizarre way? Commented Apr 14, 2021 at 20:11

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.