4

Our Python application is hanging on these 2 particular machines after 10-20 minutes of use. Htop shows 100% CPU usage. I used Pystack to get the stack trace of the running process. The Python side of the stack trace shows nothing interesting, it was just some dictionary look up (and each time it hangs they are at different code). But at the last call Pystack shows that it is stuck at this particular line in CPython source code (the while loop):

https://github.com/python/cpython/blob/v3.12.3/Modules/_asynciomodule.c#L3594

    module_traverse(PyObject *mod, visitproc visit, void *arg)
    {
        asyncio_state *state = get_asyncio_state(mod);
    
        Py_VISIT(state->FutureIterType);
        Py_VISIT(state->TaskStepMethWrapper_Type);
        Py_VISIT(state->FutureType);
        Py_VISIT(state->TaskType);
    
        Py_VISIT(state->asyncio_mod);
        Py_VISIT(state->traceback_extract_stack);
        Py_VISIT(state->asyncio_future_repr_func);
        Py_VISIT(state->asyncio_get_event_loop_policy);
        Py_VISIT(state->asyncio_iscoroutine_func);
        Py_VISIT(state->asyncio_task_get_stack_func);
        Py_VISIT(state->asyncio_task_print_stack_func);
        Py_VISIT(state->asyncio_task_repr_func);
        Py_VISIT(state->asyncio_InvalidStateError);
        Py_VISIT(state->asyncio_CancelledError);
    
        Py_VISIT(state->scheduled_tasks);
        Py_VISIT(state->eager_tasks);
        Py_VISIT(state->current_tasks);
        Py_VISIT(state->iscoroutine_typecache);
    
        Py_VISIT(state->context_kwname);
    
        // Visit freelist.
        PyObject *next = (PyObject*) state->fi_freelist;
        while (next != NULL) {
            // stuck inside this loop
            PyObject *current = next;
            Py_VISIT(current);
            next = (PyObject*) ((futureiterobject*) current)->future;
        }
        return 0;
    }

I believe this part of the code has something to do with garbage collection. What can I learn from this to troubleshoot the issue? Where should I look next?

5
  • 1
    My guess is that you are likely hitting some kind of "deadlock" - for example, one asycnio task could be awaiting for something which indirectly is awaiting itself. My suggestion is to try to pick your full code, and strip it out in batches, mocking I/O calls for asyncio.sleep, and try to get a minimal amount of Python code wich maintains the current behavior. (them either update the question, or more likely, you will have your answer already) Commented Jul 8, 2024 at 16:21
  • 2
    @jsbueno it only happens on our test servers and only after 10-20 minutes, and redeploy takes another 30 min so it's really hard to do such things. Also the code is way too large to do such ablation study. What I realize is that if I use the pure python version of the asynciomodule instead of the C one (I just delete the asynciomodule.so library), then it seems to run without hiccup. It was also running fine before we migrate to Python 3.12. So it's likely some kind of bug in CPython Commented Jul 8, 2024 at 19:00
  • 2
    "So it's likely some kind of bug in CPython" - well, I have to agree with that. Unfortunatelly, it will be hard to pinpoint. Commented Jul 8, 2024 at 19:24
  • @jsbueno Downgrading to Python 3.11.9 fixed the issue for us. Commented Jul 15, 2024 at 11:24
  • 1
    yes. the problem is that this issue will keep loomkng there. Did you fill in a bug reoprt fro cpython? If not, please do so. Commented Jul 15, 2024 at 12:04

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.