自底向上brpc（二）：无锁队列和execution_queue原理

Tannin724

已于 2023-01-19 16:57:12 修改

阅读量299

点赞数

分类专栏： C++ rpc 文章标签： c++ rpc

于 2023-01-19 16:50:46 首次发布

本文链接：https://blog.csdn.net/weixin_42162340/article/details/128726855

版权

C++ 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

rpc

2 篇文章 0 订阅

订阅专栏

本文探讨了多线程环境下临界区的处理，包括互斥量、LockFree和WaitFree的概念及优缺点。重点分析了brpc中的execution_queue，它是一个实现waitfree的多生产者单消费者队列，具有高效的生产性能，但在消费侧因断链问题不完全是LockFree。文章强调减小临界区粒度的重要性，并提供了execution_queue的基本使用和实现原理。

摘要由CSDN通过智能技术生成

本文章为百度C++工程师的那些极限优化（并发篇）以及对于brpc execution_queue源码的学习总结。

多线程下的临界区处理

多线程下可以利用多核去并行执行提升效率，但是多线程最复杂的地方在于子任务的分发与合并，往往伴随着数据的共享，由此就涉及到多线程中重要的概念，临界区的保护。

Mutual Exclusion

互斥量是基本的临界区保护技术，属于一种悲观锁的算法实现。在没有竞争的时候，开销非常小。这种互斥算法，有一个典型的全局阻塞问题，当临界区内的线程发生阻塞，或被操作系统换出时，会出现一个全局执行空窗。这个执行空窗内，不仅自身无法继续操作，未获得锁的线程也只能一同等待，造成了阻塞放大的现象。
在这里插入图片描述

Lock Free

Lock Free本质上是乐观锁的算法实现，通过cas原语来实现临界区同步，与互斥量不同的一点是，在临界区的每一个线程都在前进，而不是像互斥量只有一个线程在临界区。但问题是虽然每个线程都在前进，但实际上只有一个线程在有效的前进，其他线程都被仲裁为无效并回滚。与互斥量的优势在于，他不会因为回滚而切到其他线程，而是会继续前进。所以虽然有些线程执行失败，但每一次失败都对应着其中一个线程的成功，所以整体还是一直前进，而不是像互斥量一样存在一起阻塞的情况。
在这里插入图片描述

Wait Free

lock free在于当多个线程访问临界区时，至少有一个线程有进展。而wait free在于当出现访问临界区时，每个线程都会取得有限的进展，并非只有一个线程在前进。
在这里插入图片描述
fetch_add实际上就是wait free的实现。不过时至今日，wait free队列的实现仍比较少，wait free的项目相较于lock free有很多限制，比如boost中的spsc_queue只能做到单生产单消费，以及这篇文章后续要介绍的brpc中的execution_queue，它是一个多生产多消费的wait free队列，但是它只做到了生产者的wait free，消费者甚至连lock free都不算。

小结

在原文中，对以上三种临界区的保护进行了测试，发现互斥量和lock free在一个数量级内，wait free更快一个数量级，同时lock free的cpu开销更大。这个数据其实非常可以理解，总结下来就是Lock Free因为『无效预测执行』过多反而引起了过多的消耗。这表明了锁操作本身的开销虽然稍重于原子操作，但其实也并非洪水猛兽，而真正影响性能的，是临界区被迫串行执行所带来的并行能力折损。
所以在遇到临界区问题时，并不是要一味的追求lock free，减小临界区的粒度才是第一要务。同时cas操作属于两阶段提交，需要考虑仲裁失败后的回滚问题，对于数据结构也有更加严格的要求。
对于多线程下的并发问题，其实业界还有非常多更加优秀的解决方法，比如thread local和依赖cpu下的并发处理，作者水平有限，就不误人子弟了，这里主要作为引入引出brpc的execution_queue。

execution_queue

execution_queue提供了异步执行任务的能力，在brpc中用于多线程写fd的场景，所以在设计上主要是针对于多生产者单消费者的场景，它的生产是wait free的。本文不对它其他功能过多的介绍，如高优插队、取消一个任务等不再赘述，以及真正执行任务的bthread实现会在后面文章介绍。本文仅介绍最基本的使用并引出它的wait free队列的具体实现。

// Start a ExecutionQueue. If |options| is NULL, the queue will be created with
// the default options. 
// Returns 0 on success, errno otherwise
// NOTE: type |T| can be non-POD but must be copy-constructible
template <typename T>
int execution_queue_start(
        ExecutionQueueId<T>* id, 
        const ExecutionQueueOptions* options,
        int (*execute)(void* meta, TaskIterator<T>& iter),
        void* meta);

execution_queue_start是创建一个队列，并返回一个id。execute是我们需要注册的消费者如何消费的函数指针，其中的iter是任务的迭代器，在消费者的回调中通过这个迭代器去遍历任务处理。这里就不介绍具体实现了。

// Thread-safe and Wait-free.
// Execute a task with defaut TaskOptions (normal task);
template <typename T>
int execution_queue_execute(ExecutionQueueId<T> id, 
                            typename butil::add_const_reference<T>::type task);

// Thread-safe and Wait-free.
// Execute a task with options. e.g
// bthread::execution_queue_execute(queue, task, &bthread::TASK_OPTIONS_URGENT)
// If |options| is NULL, we will use default options (normal task)
// If |handle| is not NULL, we will assign it with the hanlder of this task.
template <typename T>
int execution_queue_execute(ExecutionQueueId<T> id, 
                            typename butil::add_const_reference<T>::type task,
                            const TaskOptions* options);
template <typename T>
int execution_queue_execute(ExecutionQueueId<T> id, 
                            typename butil::add_const_reference<T>::type task,
                            const TaskOptions* options,
                            TaskHandle* handle);

这是本文主要介绍的函数，该函数为异步执行一个task任务，入参为代表队列的id和task对象，options中有任务优先级以及是否异步执行的选项，继续看他内部的实现

    int execute(typename butil::add_const_reference<T>::type task,
                const TaskOptions* options, TaskHandle* handle) {
        if (stopped()) {
            return EINVAL;
        }
        TaskNode* node = allocate_node();
        if (BAIDU_UNLIKELY(node == NULL)) {
            return ENOMEM;
        }
        void* const mem = allocator::allocate(node);
        if (BAIDU_UNLIKELY(!mem)) {
            return_task_node(node);
            return ENOMEM;
        }
        new (mem) T(task);
        node->stop_task = false;
        TaskOptions opt;
        if (options) {
            opt = *options;
        }
        node->high_priority = opt.high_priority;
        node->in_place = opt.in_place_if_possible;
        if (handle) {
            handle->node = node;
            handle->version = node->version;
        }
        start_execute(node);
        return 0;
    }

execute主要是创建node并初始化，node为队列中的一个节点，最终执行真正的操作函数start_execute

void ExecutionQueueBase::start_execute(TaskNode* node) {
    node->next = TaskNode::UNCONNECTED;
    node->status = UNEXECUTED;
    node->iterated = false;
    if (node->high_priority) {
        // Add _high_priority_tasks before pushing this task into queue to
        // make sure that _execute_tasks sees the newest number when this 
        // task is in the queue. Although there might be some useless for 
        // loops in _execute_tasks if this thread is scheduled out at this 
        // point, we think it's just fine.
        _high_priority_tasks.fetch_add(1, butil::memory_order_relaxed);
    }
    TaskNode* const prev_head = _head.exchange(node, butil::memory_order_release);
    if (prev_head != NULL) {
        node->next = prev_head;
        return;
    }
    // Get the right to execute the task, start a bthread to avoid deadlock
    // or stack overflow
    node->next = NULL;
    node->q = this;

    ExecutionQueueVars* const vars = get_execq_vars();
    vars->execq_active_count << 1;
    if (node->in_place) {
        int niterated = 0;
        _execute(node, node->high_priority, &niterated);
        TaskNode* tmp = node;
        // return if no more
        if (node->high_priority) {
            _high_priority_tasks.fetch_sub(niterated, butil::memory_order_relaxed);
        }
        if (!_more_tasks(tmp, &tmp, !node->iterated)) {
            vars->execq_active_count << -1;
            return_task_node(node);
            return;
        }
    }

    if (nullptr == _options.executor) {
        bthread_t tid;
        // We start the execution thread in background instead of foreground as
        // we can't determine whether the code after execute() is urgent (like
        // unlock a pthread_mutex_t) in which case implicit context switch may
        // cause undefined behavior (e.g. deadlock)
        if (bthread_start_background(&tid, &_options.bthread_attr,
                                     _execute_tasks, node) != 0) {
            PLOG(FATAL) << "Fail to start bthread";
            _execute_tasks(node);
        }
    } else {
        if (_options.executor->submit(_execute_tasks, node) != 0) {
            PLOG(FATAL) << "Fail to submit task";
            _execute_tasks(node);
        }
    }
}

首先我们对节点的next、status、iterated进行初始化，后面的几行就是wait free算法的实现

    TaskNode* const prev_head = _head.exchange(node, butil::memory_order_release);
    if (prev_head != NULL) {
        node->next = prev_head;
        return;
    }
    // Get the right to execute the task, start a bthread to avoid deadlock
    // or stack overflow
    node->next = NULL;
    node->q = this;

这个队列是一个用链表组织的数据结构，而其中只有一个被多线程共享的变量_head，上述的代码就是一个入队的实现，通过exchange的原子操作将新节点和_head的值进行交换，旧的_head被赋值给prev_head，于是就会出现以下两种情况：

如果prev_head如果为空，那么就可以认为队列为空，这是通过_head的原子性保证的，此处的exchange以及后续将介绍的一些操作都可以确保，_head是否为空，代表着队列是否为空。当我们发现队列为空后，代表没有其他任务在执行，我们便可以直接执行当前的任务

        bthread_t tid;
        // We start the execution thread in background instead of foreground as
        // we can't determine whether the code after execute() is urgent (like
        // unlock a pthread_mutex_t) in which case implicit context switch may
        // cause undefined behavior (e.g. deadlock)
        if (bthread_start_background(&tid, &_options.bthread_attr,
                                     _execute_tasks, node) != 0) {
            PLOG(FATAL) << "Fail to start bthread";
            _execute_tasks(node);
        }
除去针对option的特殊逻辑（如优先级、同步调度等，这不是本文的重点）
默认是开启一个bthread异步的执行_execute_tasks，bthread在未来的文章将介绍
这里也可以把它理解成一个普通的线程，随后去执行_execute_tasks这个函数。

上面介绍了只有当队列为空的时候才能获得执行权，所以如果prev_head不为空，exchange已经原子的修改了_head，这时我们只需要将当前节点的next指向prev_head即可完成链表的连接，由于没有执行权，所以直接返回即可。

分析上面的实现就能发现，上述操作在多线程下不存在任何的竞争关系，不论多少的并发下，都不会影响入队的正确性，所以该队列的生产者是wait free的。但是如果你足够细心，就能发现存在一个问题。
我们的入队操作分为两个步骤，一个是exchange修改_head，然后将next指向旧_head，然而这两个操作并不是原子的，如果只执行到第一步后这个线程就被操作系统切走，这个链表便会出现断链的情况

而这些问题，只能通过消费者去解决，同时被断链的node，只有断链的node知道，所以这种情况下，只能够通过循环等待去解决，后面将会介绍消费侧是如何处理的。

接下来看_execute_tasks的实现，该函数只有获取到执行权的线程进行调用，主要用于消费侧从队列获取任务并执行的过程

void* ExecutionQueueBase::_execute_tasks(void* arg) {
    ExecutionQueueVars* vars = get_execq_vars();
    TaskNode* head = (TaskNode*)arg;
    ExecutionQueueBase* m = (ExecutionQueueBase*)head->q;
    TaskNode* cur_tail = NULL;
    bool destroy_queue = false;
    for (;;) {
        if (head->iterated) {
            CHECK(head->next != NULL);
            TaskNode* saved_head = head;
            head = head->next;
            m->return_task_node(saved_head);
        }
        int rc = 0;
        if (m->_high_priority_tasks.load(butil::memory_order_relaxed) > 0) {
            int nexecuted = 0;
            // Don't care the return value
            rc = m->_execute(head, true, &nexecuted);
            m->_high_priority_tasks.fetch_sub(
                    nexecuted, butil::memory_order_relaxed);
            if (nexecuted == 0) {
                // Some high_priority tasks are not in queue
                sched_yield();
            }
        } else {
            rc = m->_execute(head, false, NULL);
        }
        if (rc == ESTOP) {
            destroy_queue = true;
        }
        // Release TaskNode until uniterated task or last task
        while (head->next != NULL && head->iterated) {
            TaskNode* saved_head = head;
            head = head->next;
            m->return_task_node(saved_head);
        }
        if (cur_tail == NULL) {
            for (cur_tail = head; cur_tail->next != NULL; 
                    cur_tail = cur_tail->next) {}
        }
        // break when no more tasks and head has been executed
        if (!m->_more_tasks(cur_tail, &cur_tail, !head->iterated)) {
            CHECK_EQ(cur_tail, head);
            CHECK(head->iterated);
            m->return_task_node(head);
            break;
        }
    }
    if (destroy_queue) {
        CHECK(m->_head.load(butil::memory_order_relaxed) == NULL);
        CHECK(m->_stopped);
        // Add _join_butex by 2 to make it equal to the next version of the
        // ExecutionQueue from the same slot so that join with old id would
        // return immediatly.
        // 
        // 1: release fence to make join sees the newst changes when it sees
        //    the newst _join_butex
        m->_join_butex->fetch_add(2, butil::memory_order_release/*1*/);
        butex_wake_all(m->_join_butex);
        vars->execq_count << -1;
        butil::return_resource(slot_of_id(m->_this_id));
    }
    vars->execq_active_count << -1;
    return NULL;
}

inline bool ExecutionQueueBase::_more_tasks(
        TaskNode* old_head, TaskNode** new_tail, 
        bool has_uniterated) {

    CHECK(old_head->next == NULL);
    // Try to set _head to NULL to mark that the execute is done.
    TaskNode* new_head = old_head;
    TaskNode* desired = NULL;
    bool return_when_no_more = false;
    if (has_uniterated) {
        desired = old_head;
        return_when_no_more = true;
    }
    if (_head.compare_exchange_strong(
                new_head, desired, butil::memory_order_acquire)) {
        // No one added new tasks.
        return return_when_no_more;
    }
    CHECK_NE(new_head, old_head);
    // Above acquire fence pairs release fence of exchange in Write() to make
    // sure that we see all fields of requests set.

    // Someone added new requests.
    // Reverse the list until old_head.
    TaskNode* tail = NULL;
    if (new_tail) {
        *new_tail = new_head;
    }
    TaskNode* p = new_head;
    do {
        while (p->next == TaskNode::UNCONNECTED) {
            // TODO(gejun): elaborate this
            sched_yield();
        }
        TaskNode* const saved_next = p->next;
        p->next = tail;
        tail = p;
        p = saved_next;
        CHECK(p != NULL);
    } while (p != old_head);

    // Link old list with new list.
    old_head->next = tail;
    return true;
}

这里直接用图片来进行讲解：

我们有一个_head的成员变量和一个head本地变量，head是指向当前节点，如果期间没有入队的任务的话，head和_head是指向队列中唯一一个node
在这里插入图片描述

    TaskNode* head = (TaskNode*)arg;

当然会存在不断有新的任务入队的情况，但是上面介绍过，真正的消费者只有一个，即执行_execute_tasks的bthread，此时在消费者看来，队列就可能变为：
在这里插入图片描述
head一定会指向当前节点，即第一个获得消费权的节点，而_head则一定指向最新入队的节点。

接下来我们要进行消费了，由于是fifo队列，head指向的队列是最先入队所以最先消费，但是这是一个单链表，无法回头去找上一个节点，所以这里要用到反转链表的算法，而反转的头节点便是_head。

在这里插入图片描述

    TaskNode* p = new_head;
    do {
        TaskNode* const saved_next = p->next;
        p->next = tail;
        tail = p;
        p = saved_next;
        CHECK(p != NULL);
    } while (p != old_head);

但是_head是一直在变且被多线程共享去更新的，消费的期间仍会有入队的操作，所以我们需要一个新的变量去记录当前_head的位置，然后再反转链表后进行消费，而消费的过程就是执行head指向Node的任务，然后不断往前遍历的过程。

在这里插入图片描述
上图中，tail就是当时记录的_head的位置，在消费的过程中，_head不断被入队的任务更新，所以我们在每次head往前执行的过程中，都去检查tail是否等于_head，如果不等于，则代表期间有新的任务入队，如果等于且head已经没有可以迭代的任务时，变代表所有的任务都被执行完了，我们可以放弃执行权了。
在这里插入图片描述

    TaskNode* new_head = old_head;
    TaskNode* desired = NULL;
    bool return_when_no_more = false;
    if (has_uniterated) {
        desired = old_head;
        return_when_no_more = true;
    }
    if (_head.compare_exchange_strong(
                new_head, desired, butil::memory_order_acquire)) {
        // No one added new tasks.
        return return_when_no_more;
    }
    CHECK_NE(new_head, old_head);
    // Above acquire fence pairs release fence of exchange in Write() to make
    // sure that we see all fields of requests set.

    // Someone added new requests.
    // Reverse the list until old_head.
    TaskNode* tail = NULL;
    if (new_tail) {
        *new_tail = new_head;
    }
由于_head是一个多线程共享的变量，所以需要通过cas去进行判断，如果tail == _head则结束，否则更新tail

在看生产侧实现的时候有一个断链的问题，这个问题会在我们确定tail后遍历进行反转链表的时候发现，node->next为一个初始化的值，这时候其实消费侧没有好的办法，只能通过等待等生产者将链表给接上
在这里插入图片描述

        while (p->next == TaskNode::UNCONNECTED) {
            // TODO(gejun): elaborate this
            sched_yield();
        }

以上便是execution_queue的基本实现，忽略了很多细节的处理，有兴趣的同学可以结合代码去学习。

总结

execution_queue的实现非常独特，实现了wait free的入队操作，拥有极高的生产性能。但是消费侧由于断链问题，甚至不是Lock Free的，因为如果生产者在两阶段中间被换出，那么消费者会被这个阻塞传播影响，整个消费也只能先阻塞住。但是在排队写入fd的场景下，专项优化生产并发是合理，也因此可以获得更好的执行效率。总的来说execution_queue是一个非常巧妙的实现，值得我去深入学习。

Tannin724

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
自底向上brpc（二）：无锁队列和execution_queue原理

execution_queue的实现非常独特，实现了wait free的入队操作，拥有极高的生产性能。但是消费侧由于断链问题，甚至不是Lock Free的，因为如果生产者在两阶段中间被换出，那么消费者会被这个阻塞传播影响，整个消费也只能先阻塞住。但是在排队写入fd的场景下，专项优化生产并发是合理，也因此可以获得更好的执行效率。总的来说execution_queue是一个非常巧妙的实现，值得我去深入学习。
复制链接

扫一扫

专栏目录