1

I have a mpi4py program, which runs well with mpiexec -np 30 python3 -O myscript.py at 100% CPU usage on each of the 30 CPUs.

Now I am launching 8 instances with mpiexec -np 16 python3 -O myscript.py. That should be fine, I have 64 cores with 4 units each, nproc shows 256. Load is near 128 (8x16).

Nothing else is running, most of the terabyte of RAM is free, but most of my processes are at 25%-30% CPU usage in state S (interruptable sleep). My mpi4py code is a loop with first bcast of a value computed by the rank 0 node, then two times a scatter and a gather command. In between, the code run on each node uses onnx, numba, tensorflow.keras and other libraries (with OPENMPI_NUM_THREADS=1 and an onnx configuration that avoid further parallelisation). I do not use GPUs (everything is configured to use CPUs).

It is not clear to me (at least I cannot find much online) how to narrow down where in my python program a MPI execution . Normally, I would run python -c cProfile -o prof, but I am unsure how to do that with mpiexec to get sensible (non-mangled) output files?

The related question here: Process is in interruptible sleep - how to find out what it is waiting for uses a technique for C programs. Is there something similar for identifying where in a python program sleep time is spent?

2
  • Which MPI implementation do you use? Does your MPI bind the processes to cores? By default, the 4 instances of mpiexec are not aware of the other mpiexec instances. They might bind your processes all to the same cores. With htop you might observe that only 16 out of 64 cores are busy. Commented Aug 26 at 15:08
  • 1
    For a non-mangled output of cProfile, you can execute mpiexec -np 16 bash -c 'python3 -c cProfile -o prof.$$ -O myscript.py'. The important part is that $$ must be evaluated by the bash instance started by mpiexec, so that you get unique PIDs. I'm not convinced that this will help you solving the problem, so I just put it as a comment. Commented Aug 26 at 15:16

1 Answer 1

0

(1.) Use gdb to attach to a worker process and get a stack trace. Do this several times. It’s random sampling, but it will give you an idea of the typical call stack.

(2.) Run your current app with fewer cores, fewer processes, so there’s no sleeping. Ramp it up slowly till it sleeps, so you can explain its behavior.

(3.) Your app is complex, too complex, and uses many libraries, each of which tries to be parallel aware. Strip it to the ground, create a simple app with performance you understand, then add one library at a time, as you keep measuring.


The general rule for benchmarking is to "be gentle", exhibiting low throughput which is well understood, and then ramp up to "break it".

There is a bottleneck somewhere, perhaps related to a component requesting a mutex. As you ramp up and ramp up, at some point you'll hit an inflection point where throughput takes a hit. Study that point, identify the bottleneck. Then you can predict how throughput will scale up just past that critical point, and you can also dream up new designs that will shift the bottleneck elsewhere.

Sign up to request clarification or add additional context in comments.

2 Comments

Are you saying gdb will point me to a python code line? I am not aware of this feature.
If the sleep is caused by a libary (like mpi while waiting for a message?), you will see the corresponding MPI function ins your stack trace. You don't need to do step (1) multiple times, just continue, interrupt (ctrl+c), bt repeat multiple times.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.