I have a mpi4py program, which runs well with mpiexec -np 30 python3 -O myscript.py at 100% CPU usage on each of the 30 CPUs.
Now I am launching 8 instances with mpiexec -np 16 python3 -O myscript.py. That should be fine, I have 64 cores with 4 units each, nproc shows 256. Load is near 128 (8x16).
Nothing else is running, most of the terabyte of RAM is free, but most of my processes are at 25%-30% CPU usage in state S (interruptable sleep). My mpi4py code is a loop with first bcast of a value computed by the rank 0 node, then two times a scatter and a gather command. In between, the code run on each node uses onnx, numba, tensorflow.keras and other libraries (with OPENMPI_NUM_THREADS=1 and an onnx configuration that avoid further parallelisation). I do not use GPUs (everything is configured to use CPUs).
It is not clear to me (at least I cannot find much online) how to narrow down where in my python program a MPI execution . Normally, I would run python -c cProfile -o prof, but I am unsure how to do that with mpiexec to get sensible (non-mangled) output files?
The related question here: Process is in interruptible sleep - how to find out what it is waiting for uses a technique for C programs. Is there something similar for identifying where in a python program sleep time is spent?
htopyou might observe that only 16 out of 64 cores are busy.mpiexec -np 16 bash -c 'python3 -c cProfile -o prof.$$ -O myscript.py'. The important part is that$$must be evaluated by the bash instance started by mpiexec, so that you get unique PIDs. I'm not convinced that this will help you solving the problem, so I just put it as a comment.