6
\$\begingroup\$

I've written a MQTT client for embedded system with RTOS (understand not POSIX, generally FreeRTOS). There's no pthread in the system. There's only 32 bits atomic instructions supported.

The client multi-thread/task usage is like this:

  1. One thread is doing receiving/connecting/subscribing work (it's calling a callback upon message receiving). It's called the event loop thread.
  2. N threads can publish messages (this can happens in the event loop thread too, but not within the receiving code).

The client holds a pointer to a implementation (in a PImpl idiom) and that implementation is deleted upon error. A reconnect to the server creates a new instance of that implementation.

In most usage cases, there's only one thread in the system, so I don't want to add a lock/mutex/semaphore to protect the implementation's pointer, that would be taken for each access in the event loop.

So, in order to handle the case of a publish in an other thread, I've considered that:

  1. The publishing thread only need the implementation's pointer to send (thread local) data to implementation's socket. socket usage is multithread safe since only the publish thread write to the socket when the client is not in error.
  2. The event thread should detect an error and close/destruct the implementation only when no publish thread is publishing.
  3. While it's doing so, no publishing thread can publish

I've done an implementation like this pseudo code (link at the bottom of the post for a real working minimal example):

struct Impl
{
   [...]
};

struct A
{
    atomic<u32> usage_count = 0;
    atomic<u32> errored = true;
    Impl *      impl = nullptr;

    bool exchangeUsage(int step, u32 error) 
    {
       u32 previous = usage.load();
       while (true)
       {
         u32 next = (previous + step);
         if (usage.compare_exchange_strong(previous, next))
         {
            errored |= error;
            break;
         }
       }
    }

    Impl * acquire()
    {
       if (error.load()) return nullptr; // Precheck: prevent acquiring impl if an error is already signaled
       exchangeUsage(1, false);
       // Postcheck: If the error happened during CAS, we need to account for it
       if (errored.load()) { 
          exchangeUsage(-1, true); // Release our usage since we don't use the implementation anymore
          return nullptr;
       }
       return impl;
    }

    void release(bool withError) { exchangeUsage(-1, withError); }

    void closeIfError(bool errorHappened) {
        bool inError = errored.load();
        if (!inError && !errorHappened) return;

        errored.store(1); // Mark the object as errored so starting from now, no publish can acquire it anymore

        // Wait until the usage count is 1 again, meaning all publish thread are done with the implementation
        u32 previous = usage_count.load();
        while (previous != 1)
        {
            // Yield here 
            sched_yield();
            previous = usage_count.load();
        }
        // Ok, here it's not possible for a publish to acquire anymore since it's errored
        delete impl; impl = nullptr;
    }
    void resetState() { usage_count = 1; errored = false; }
};

A client;
// Usage is like this for publishing thread:
void publish()
{
    Impl * impl = client.acquire();
    if (impl) { bool error = impl->publish(...); client.release(error); }
}

// Usage is like this in the event loop thread
void eventLoop()
{
    while(true) {
      bool error = receiveAndCallCallback(client.impl);
      if (error) 
      {
          closeIfError(error);
          client.impl = reconnect(...);
          client.resetState();
      }
    }
}

// main() will call client.resetState() before creating threads.

I've tested this code with both GCC's thread sanitizer and valgrind's DRD and it runs without error (Valgrind only complains on unrelated printf's issues, not this code).

The question is therefore:

  1. Is this scheme safe to use?
  2. Do you spot a logic error in the pattern?

Minimal reproducible working example code on Godbolt, with a main which runs and prints some output.

\$\endgroup\$
2
  • 2
    \$\begingroup\$ [...] looks like a placeholder for some future code rather than real code for review. Also, atomic<> and u32 aren't defined, which leaves reviewers guessing. \$\endgroup\$ Commented Oct 17 at 15:09
  • 2
    \$\begingroup\$ usage is also never defined in exchangeUsage()error is never defined in acquire()… etc. \$\endgroup\$ Commented Oct 17 at 21:41

1 Answer 1

8
\$\begingroup\$

It's not safe

Consider the following order in which the two threads execute:

Event loop:                       Publisher:
-----------                       ----------
bool error = receive(…);
                                  Impl* impl = client.acquire();
                                    if (errored.load()) …;
if (error) {
  closeIfError(error);
                                    exchangeUsage(1, false);
  client.impl = reconnect(…);
  client.resetState();
                                    if (errored.load()) {…}
                                    return impl;

In the publisher thread, the first errored.load() returns false, because closeIferror() has not been called yet. Then the event loop thread calls that function, which sets errored to true. However, since the publisher thread did not increment usage_count yet, closeIfError() will delete impl and return. Then the publisher thread increments usage_count. Then the event loop thread assigns a new value to impl, and sets usage_count to 1 and errored to false. The second call to errored.load() in the publisher thread will therefore return false, and it returns impl.

There is nothing wrong with impl at this point; the publisher gets a valid pointer to it, and there is no way where it could have read impl before the call to delete and earlier. However, usage_count is now still 1. That means the next time an error happens in the event loop thread, it can delete the impl that the publisher is still using. Or if the publisher calls release(), then usage_count will become 0, and if an error happens then, closeIfError() will never complete.

Consider making impl a std::atomic<Impl*>, then in closeIfError(), first do a impl.store(nullptr), then wait for the usage_count to drop to 1. At that point you know there are no users of a previous impl anymore, and can then safely delete it and then impl.store(reconenct(…)). You can get rid of errored.

Consider implementing std::shared_mutex instead

Your struct A is both responsible for the pointer to some data, and dealing with atomics for granting access to that data. I would remove the data part entirely from it, and just focus on implementing the equivalent of std::shared_mutex. In particular:

  • try_lock_shared() replaces your acquire().
  • lock() is called by the event loop thread on error, and will hold it while it deletes the old impl and replaces it with a new one.

If you say: "but I wanted my code to be mutex-free!", then please realize that what you implemented is a mutex, and conversely a mutex is just an atomic variable to track if it's locked, and some way to wait for it to be unlocked (which in your case is the while-loop that calls sched_yield()).

\$\endgroup\$
1
  • 1
    \$\begingroup\$ You're right that the scheme is a mutex. I've implemented this way because I can't rely on the OS (which is unknown for the library) for providing a common mutex primitive. As you've guessed, the sched_yield in the example implementation is for godbolt, in the library it's using an empty select with a timeout. Anyway, thanks for your analysis, it's very precious. \$\endgroup\$ Commented Oct 22 at 9:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.