EDIT TL;DR Anyone who might consider using my code below in production and can afford to require C++-20 standard should rather use std::barrier as suggested by G. Sliepen in his excellent answer.
I’m working on some OpenMP-parallelized C++ code, that is made of 3 parts with the constraint that no thread should begin part 3 before all threads have finished part 1. But it is perfectly acceptable to have some threads run part 2 while others are running part 1, or to have some threads run part 3 while others are running part 2. (See my question on StackOverflow for details.)
A synchronization barrier anywhere between part 1 and part 3 would satisfy the constraint but there’s no need for such “hard” synchronization. So I thought it would be nice to have a “split” barrier: no thread can pass the second half before all threads have passed the first half.
I managed to implement such a thing with the following code:
class split_barrier {
private:
    std::mutex m;
    std::condition_variable cv;
    int threads_in_section;
    int total_threads;
    bool may_enter;
    bool may_leave;
public:
    split_barrier():
        threads_in_section(0),
        may_enter(false),
        may_leave(false)
    {}
    void init(int threads) {
        std::lock_guard<std::mutex> lock(m);
        total_threads = threads;
        may_enter = true;
    }
    void enter() {
        std::unique_lock lock(m);
        cv.wait(lock, [this]{return may_enter;});
        if (++threads_in_section == total_threads) {
            may_enter = false;
            may_leave = true;
            lock.unlock();
            cv.notify_all();
        }
    }
    void leave() {
        std::unique_lock lock(m);
        cv.wait(lock, [this]{return may_leave;});
        if (--threads_in_section == 0) {
            may_leave = false;
            may_enter = true;
            lock.unlock();
            cv.notify_all();
        }
    }
};
Then my code looks like:
void main() {
    split_barrier barrier;
    #pragma omp parallel
    {
        #pragma omp single
        barrier.init(omp_get_num_threads());
        part1();
        barrier.enter();
        part2();
        barrier.leave();
        part3();
    }
}
EDIT: Since it does not invalidate the only and accepted answer, I hope I am allowed to add that my “real” use-case looks more like:
void main() {
    split_barrier barrier;
    #pragma omp parallel
    {
        #pragma omp single
        barrier.init(omp_get_num_threads());
        while (…) {
            part1();
            barrier.enter();
            part2();
            barrier.leave();
            part3();
            #pragma omp barrier
            part4();
        }
    }
}
I consider synchronization code to be very error-prone. Is my code thread-safe?