C++无锁编程——无锁队列(lock-free queue)

C++无锁编程——无锁队列(lock-free queue)

贺志国
2023.7.11

上一篇博客给出了最简单的C++数据结构——栈的几种无锁实现方法。队列的挑战与栈的有些不同,因为Push()Pop()函数在队列中操作的不是同一个地方,同步的需求就不一样。需要保证对一端的修改是正确的,且对另一端是可见的。因此队列需要两个Node指针:head_tail_。这两个指针都是原子变量,从而可在不加锁的情形下,给多个线程同时访问。
队列示意
在我们的实现中,如果head_tail_指针指向同一个节点(称之为哑节点,dummy node),则认为队列为空

首先来分析单生产者/单消费者的情形。

一、单生产者-单消费者模型下的无锁队列

单生产者/单消费者模型就是指,在某一时刻,最多只存在一个线程调用Push()函数,最多只存在一个线程调用Pop()函数。该情形下的代码(文件命名为 lock_free_queue.h)如下:

#pragma once

#include <atomic>
#include <memory>

template <typename T>
class LockFreeQueue {
 public:
  LockFreeQueue() : head_(new Node), tail_(head_.load()) {}
  ~LockFreeQueue() {
    while (Node* old_head = head_.load()) {
      head_.store(old_head->next);
      delete old_head;
    }
  }

  LockFreeQueue(const LockFreeQueue& other) = delete;
  LockFreeQueue& operator=(const LockFreeQueue& other) = delete;

  bool IsEmpty() const { return head_.load() == tail_.load(); }

  void Push(const T& data) {
    auto new_data = std::make_shared<T>(data);
    Node* p = new Node;             // 3
    Node* old_tail = tail_.load();  // 4
    old_tail->data.swap(new_data);  // 5
    old_tail->next = p;             // 6
    tail_.store(p);                 // 7
  }

  std::shared_ptr<T> Pop() {
    Node* old_head = PopHead();
    if (old_head == nullptr) {
      return std::shared_ptr<T>();
    }

    const std::shared_ptr<T> res(old_head->data);  // 2
    delete old_head;
    return res;
  }

 private:
  // If the struct definition of Node is placed in the private data member
  // field where 'head_' is defined, the following compilation error will occur:
  //
  // error: 'Node' has not been declared ...
  //
  // It should be a bug of the compiler. The struct definition of Node is put in
  // front of the private member function `DeleteNodes` to eliminate this error.
  struct Node {
    // std::make_shared does not throw an exception.
    Node() : data(nullptr), next(nullptr) {}

    std::shared_ptr<T> data;
    Node* next;
  };

 private:
  Node* PopHead() {
    Node* old_head = head_.load();
    if (old_head == tail_.load()) {  // 1
      return nullptr;
    }
    head_.store(old_head->next);
    return old_head;
  }

 private:
  std::atomic<Node*> head_;
  std::atomic<Node*> tail_;
};

一眼望去,这个实现没什么毛病,当只有一个线程调用Push()Pop()时,这种情况下队列一点毛病没有。Push()Pop()之间的先行(happens-before )关系非常重要,直接关系到能否安全地获取到队列中的数据。对尾部节点tail_的存储⑦(对应于上述代码片段中的注释// 7,下同)同步(synchronizes with)于对tail_的加载①,存储之前节点的data指针⑤先行(happens-before )于存储tail_。并且,加载tail_先行于加载data指针②,所以对data的存储要先行于加载,一切都没问题。因此,这是一个完美的单生产者/单消费者(SPSC, single-producer, single-consume)队列。
问题在于当多线程对Push()Pop()并发调用。先看一下Push():如果有两个线程并发调用Push(),会新分配两个节点作为虚拟节点③,也会读取到相同的tail_值④,因此也会同时修改同一个节点,同时设置datanext指针⑤⑥,存在明显的数据竞争!
PopHead()函数也有类似的问题。当有两个线程并发的调用这个函数时,这两个线程就会读取到同一个head_,并且会通过next指针去修改旧值。两个线程都能索引到同一个节点——真是一场灾难!不仅要保证只有一个Pop()线程可以访问给定项,还要保证其他线程在读取head_时,可以安全的访问节点中的next,这就是和无锁栈中Pop()一样的问题了。
Pop()的问题假设已解决,那么Push()呢?问题在于为了获取Push()Pop()间的先行关系,就需要在为虚拟节点设置数据项前,更新tail_指针。并发访问Push()时,因为每个线程所读取到的是同一个tail_,所以线程会进行竞争。

说明
先行(happens-before )与同步(synchronizes with)是使用原子变量在线程间同步内存数据的两个重要关系。
Happens-before(先行)
Regardless of threads, evaluation A happens-before evaluation B if any of the following is true: 1) A is sequenced-before B; 2) A inter-thread happens before B. The implementation is required to ensure that the happens-before relation is acyclic, by introducing additional synchronization if necessary (it can only be necessary if a consume operation is involved). If one evaluation modifies a memory location, and the other reads or modifies the same memory location, and if at least one of the evaluations is not an atomic operation, the behavior of the program is undefined (the program has a data race) unless there exists a happens-before relationship between these two evaluations.
(无关乎线程,若下列任一为真,则求值 A 先行于求值 B :1) A 先序于 B;2) A 线程间先发生于 B。要求实现确保先发生于关系是非循环的,若有必要则引入额外的同步(若引入消费操作,它才可能为必要)。若一次求值修改一个内存位置,而其他求值读或修改同一内存位置,且至少一个求值不是原子操作,则程序的行为未定义(程序有数据竞争),除非这两个求值之间存在先行关系。)
Synchronizes with(同步)
If an atomic store in thread A is a release operation, an atomic load in thread B from the same variable is an acquire operation, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B. Also, some library calls may be defined to synchronize-with other library calls on other threads.
(如果在线程A上的一个原子存储是释放操作,在线程B上的对相同变量的一个原子加载是获得操作,且线程B上的加载读取由线程A上的存储写入的值,则线程A上的存储同步于线程B上的加载。此外,某些库调用也可能定义为同步于其它线程上的其它库调用。)

二、多生产者-多消费者模型下的无锁队列

2.1 不考虑放宽内存顺序

为了解决多个线程同时访问产生的数据竞争问题,可以让Node节点中的data指针原子化,通过“比较/交换”操作对其进行设置。如果“比较/交换”成功,就说明能获取tail_,并能够安全的对其next指针进行设置,也就是更新tail_。因为有其他线程对数据进行了存储,所以会导致“比较/交换”操作的失败,这时就要重新读取tail_,重新循环。如果原子操作对于std::shared_ptr<>是无锁的,那么就基本结束了。然而,目前在多数平台中std::shared_ptr<>不是无锁的,这就需要一个替代方案:让Pop()函数返回std::unique_ptr<>,并且将数据作为普通指针存储在队列中。这就需要队列支持存储std::atomic<T*>类型,对于compare_exchange_strong()的调用就很有必要了。使用类似于无锁栈中的引用计数模式,来解决多线程对Pop()Push()的访问。具体做法是:对每个节点使用两个引用计数:内部计数外部计数。两个值的总和就是对这个节点的引用数。外部记数与节点指针绑定在一起,节点指针每次被线程读到时,外部计数加1。当线程结束对节点的访问时,内部计数减1。当节点(内部包含节点指针和绑定在一起的外部计数)不被外部线程访问时,将内部计数外部计数-2相加并将结果重新赋值给内部计数,同时丢弃外部计数。一旦内部计数等于0,表明当前节点没有被外部线程访问,可安全地将节点删除。与无锁栈的区别是,队列中包含head_tail_两个节点,因此需要两个引用计数器来维护节点的内部计数,即使用
std::atomic<NodeCounter> counter 替换 std::atomic<int> internal_count(结构体NodeCounter的定义和说明见后文说明)。下面是示例代码(文件命名为 lock_free_queue.h,示例来源于C++ Concurrency In Action, 2ed 2019,修复了其中的bug):

#pragma once

#include <atomic>
#include <memory>

template <typename T>
class LockFreeQueue {
 public:
  LockFreeQueue() : head_(CountedNodePtr(new Node, 1)), tail_(head_.load()) {}
  ~LockFreeQueue();

  // Copy construct and assignment, move construct and assignment are
  // prohibited.
  LockFreeQueue(const LockFreeQueue& other) = delete;
  LockFreeQueue& operator=(const LockFreeQueue& other) = delete;
  LockFreeQueue(LockFreeQueue&& other) = delete;
  LockFreeQueue& operator=(LockFreeQueue&& other) = delete;

  bool IsEmpty() const { return head_.load().ptr == tail_.load().ptr; }
  bool IsLockFree() const {
    return std::atomic<CountedNodePtr>::is_always_lock_free;
  }

  void Push(const T& data);
  std::unique_ptr<T> Pop();

 private:
  // Forward class declaration
  struct Node;

  struct CountedNodePtr {
    explicit CountedNodePtr(Node* input_ptr = nullptr,
                            uint16_t input_external_count = 0)
        : ptr(reinterpret_cast<uint64_t>(input_ptr)),
          external_count(input_external_count) {}

    // We know that the platform has spare bits in a pointer (for example,
    // because the address space is only 48 bits but a pointer is 64 bits), we
    // can store the count inside the spare bits of the pointer to fit it all
    // back in a single machine word. Keeping the structure within a machine
    // word makes it more likely that the atomic operations can be lock-free on
    // many platforms.
    uint64_t ptr : 48;
    uint16_t external_count : 16;
  };

  struct NodeCounter {
    NodeCounter() : internal_count(0), external_counters(0) {}
    NodeCounter(const uint32_t input_internal_count,
                const uint8_t input_external_counters)
        : internal_count(input_internal_count),
          external_counters(input_external_counters) {}

    // external_counters occupies only 2 bits, where the maximum value stored
    // is 3. Note that we need only 2 bits for the external_counters because
    // there are at most two such counters. By using a bit field for this and
    // specifying internal_count as a 30-bit value, we keep the total counter
    // size to 32 bits. This gives us plenty of scope for large internal count
    // values while ensuring that the whole structure fits inside a machine word
    // on 32-bit and 64-bit machines. It’s important to update these counts
    // together as a single entity in order to avoid race conditions. Keeping
    // the structure within a machine word makes it more likely that the atomic
    // operations can be lock-free on many platforms.
    uint32_t internal_count : 30;
    uint8_t external_counters : 2;
  };

  struct Node {
    // There are only two counters in Node (counter and next), so the initial
    // value of external_counters is 2.
    Node()
        : data(nullptr), counter(NodeCounter(0, 2)), next(CountedNodePtr()) {}
    ~Node();
    void ReleaseRef();

    std::atomic<T*> data;
    std::atomic<NodeCounter> counter;
    std::atomic<CountedNodePtr> next;
  };

 private:
  static void IncreaseExternalCount(std::atomic<CountedNodePtr>* atomic_node,
                                    CountedNodePtr* old_node);
  static void FreeExternalCounter(CountedNodePtr* old_node);
  void SetNewTail(const CountedNodePtr& new_tail, CountedNodePtr* old_tail);

 private:
  std::atomic<CountedNodePtr> head_;
  std::atomic<CountedNodePtr> tail_;
};

template <typename T>
LockFreeQueue<T>::Node::~Node() {
  if (data.load() != nullptr) {
    T* old_data = data.exchange(nullptr);
    if (old_data != nullptr) {
      delete old_data;
    }
  }
}

template <typename T>
void LockFreeQueue<T>::Node::ReleaseRef() {
  NodeCounter old_node = counter.load();
  NodeCounter new_node;

  // the whole count structure has to be updated atomically, even though we
  // only want to modify the internal_count field. This therefore requires a
  // compare/exchange loop.
  do {
    new_node = old_node;
    if (new_node.internal_count > 0) {
      --new_node.internal_count;
    }
  } while (!counter.compare_exchange_strong(old_node, new_node));

  // Once we’ve decremented internal_count, if both the internal and
  // external counts are now zero, this is the last reference, so we can
  // delete the node safely.
  if (!new_node.internal_count && !new_node.external_counters) {
    delete this;
  }
}

template <typename T>
LockFreeQueue<T>::~LockFreeQueue() {
  while (Pop()) {
    // Do nothing
  }

  // Delete the last dummy node.
  if (head_.load().ptr == tail_.load().ptr) {
    CountedNodePtr old_head = head_.load();
    CountedNodePtr empty_head;
    auto* dumpy_ptr = reinterpret_cast<Node*>(old_head.ptr);
    if (head_.compare_exchange_strong(old_head, empty_head)) {
      delete dumpy_ptr;
    }
  }
}

template <typename T>
void LockFreeQueue<T>::Push(const T& data) {
  auto new_data = std::make_unique<T>(data);
  CountedNodePtr new_next(new Node, 1);
  CountedNodePtr old_tail = tail_.load();

  while (true) {
    IncreaseExternalCount(&tail_, &old_tail);

    T* old_data = nullptr;
    // We use compare_exchange_strong() to avoid looping. If the exchange
    // fails, we know that another thread has already set the next pointer, so
    // we don’t need the new node we allocated at the beginning, and we can
    // delete it. We also want to use the next value that the other thread set
    // for updating tail.
    if (reinterpret_cast<Node*>(old_tail.ptr)
            ->data.compare_exchange_strong(old_data, new_data.get())) {
      CountedNodePtr old_next;
      if (!reinterpret_cast<Node*>(old_tail.ptr)
               ->next.compare_exchange_strong(old_next, new_next)) {
        delete reinterpret_cast<Node*>(new_next.ptr);
        new_next = old_next;
      }
      SetNewTail(new_next, &old_tail);

      // Release the ownership of the managed object so that the data will not
      // be deleted beyond the scope the unique_ptr<T>.
      new_data.release();
      break;
    } else {
      // If the thread calling Push() failed to set the data pointer this time
      // through the loop, it can help the successful thread to complete the
      // update. First off, we try to update the next pointer to the new node
      // allocated on this thread. If this succeeds, we want to use the node
      // we allocated as the new tail node, and we need to allocate another
      // new node in anticipation of managing to push an item on the queue. We
      // can then try to set the tail node by calling SetNewTail before
      // looping around again.
      CountedNodePtr old_next;
      if (reinterpret_cast<Node*>(old_tail.ptr)
              ->next.compare_exchange_strong(old_next, new_next)) {
        old_next = new_next;
        new_next.ptr = reinterpret_cast<uint64_t>(new Node);
      }
      SetNewTail(old_next, &old_tail);
    }
  }
}

template <typename T>
std::unique_ptr<T> LockFreeQueue<T>::Pop() {
  // We prime the pump by loading the old_head value before we enter the loop,
  // and before we increase the external count on the loaded value.
  CountedNodePtr old_head = head_.load();
  while (true) {
    IncreaseExternalCount(&head_, &old_head);

    Node* ptr = reinterpret_cast<Node*>(old_head.ptr);
    if (ptr == nullptr) {
      return std::unique_ptr<T>();
    }
    // If the head node is the same as the tail node, we can release the
    // reference and return a null pointer because there’s no data in the
    // queue.
    if (ptr == reinterpret_cast<Node*>(tail_.load().ptr)) {
      ptr->ReleaseRef();
      return std::unique_ptr<T>();
    }

    // If there is data, we want to try to claim it and we do this with the
    // call to compare_exchange_strong(). It compares the external count and
    // pointer as a single entity; if either changes, we need to loop again,
    // after releasing the reference.
    CountedNodePtr next = ptr->next.load();
    if (head_.compare_exchange_strong(old_head, next)) {
      // If the exchange succeeded, we’ve claimed the data in the node as
      // ours, so we can return that to the caller after we’ve released the
      // external counter to the popped node.
      T* res = ptr->data.exchange(nullptr);
      FreeExternalCounter(&old_head);
      return std::unique_ptr<T>(res);
    }

    // Once both the external reference counts have been freed and the
    // internal count has dropped to zero, the node itself can be deleted.
    ptr->ReleaseRef();
  }
}

template <typename T>
void LockFreeQueue<T>::IncreaseExternalCount(
    std::atomic<CountedNodePtr>* atomic_node, CountedNodePtr* old_node) {
  CountedNodePtr new_node;

  // If `*old_node` is equal to `*atomic_node`, it means that no other thread
  // changes the `*atomic_node`, update `*atomic_node` to `new_node`. In fact
  // the `*atomic_node` is still the original node, only the `external_count`
  // of it is increased by 1. If `*old_node` is not equal to `*atomic_node`,
  // it means that another thread has changed `*atomic_node`, update
  // `*old_node` to `*atomic_node`, and keep looping until there are no
  // threads changing `*atomic_node`.
  do {
    new_node = *old_node;
    ++new_node.external_count;
  } while (!atomic_node->compare_exchange_strong(*old_node, new_node));

  old_node->external_count = new_node.external_count;
}

template <typename T>
void LockFreeQueue<T>::FreeExternalCounter(CountedNodePtr* old_node) {
  Node* ptr = reinterpret_cast<Node*>(old_node->ptr);
  // It’s important to note that the value we add is two less than the
  // external count. We’ve removed the node from the list, so we drop one off
  // the count for that, and we’re no longer accessing the node from this
  // thread, so we drop another off the count for that.
  const int increased_count = old_node->external_count - 2;
  NodeCounter old_counter = ptr->counter.load();
  NodeCounter new_counter;

  // Update two counters using a single compare_exchange_strong() on the
  // whole count structure, as we did when decreasing the internal_count
  // in ReleaseRef().
  // This has to be done as a single action (which therefore requires the
  // compare/exchange loop) to avoid a race condition. If they’re updated
  // separately, two threads may both think they are the last one and both
  // delete the node, resulting in undefined behavior.
  do {
    new_counter = old_counter;
    if (new_counter.external_counters > 0) {
      --new_counter.external_counters;
    }

    int temp_count = new_counter.internal_count + increased_count;
    // `internal_count` is a non-negative number, and if a negative number is
    // assigned to it, the result is undefined.
    new_counter.internal_count = (temp_count > 0 ? temp_count : 0);
  } while (!ptr->counter.compare_exchange_strong(old_counter, new_counter));

  // If both the values are now zero, there are no more references to the
  // node, so it can be safely deleted.
  if (!new_counter.internal_count && !new_counter.external_counters) {
    delete ptr;
  }
}

template <typename T>
void LockFreeQueue<T>::SetNewTail(const CountedNodePtr& new_tail,
                                  CountedNodePtr* old_tail) {
  // Use a compare_exchange_weak() loop to update the tail , because if other
  // threads are trying to push() a new node, the external_count part may have
  // changed, and we don’t want to lose it.
  Node* current_tail_ptr = reinterpret_cast<Node*>(old_tail->ptr);
  // Due to the scheduling of the operating system, it may cause a false failure
  // of the `compare_exchange_weak` function, that is, the failure of
  // `compare_exchange_weak` function is not caused by other thread preemption,
  // but the operating system schedules the CPU to execute other instructions in
  // this thread. At this point, we need to compare `old_tail->ptr` and
  // `current_tail_ptr`. If they are equal, then the `compare_exchange_weak`
  // function is a false failure and the loop is continued; If they are
  // different, other threads have preempted and the loop is stopped.
  while (!tail_.compare_exchange_weak(*old_tail, new_tail) &&
         reinterpret_cast<Node*>(old_tail->ptr) == current_tail_ptr) {
    // Do nothing
  }

  // We also need to take care that we don’t replace the value if another
  // thread has successfully changed it already; otherwise, we may end up with
  // loops in the queue, which would be a rather bad idea. Consequently, we
  // need to ensure that the ptr part of the loaded value is the same if the
  // compare/exchange fails. If the ptr is the same once the loop has exited,
  // then we must have successfully set the tail , so we need to free the old
  // external counter. If the ptr value is different, then another thread will
  // have freed the counter, so we need to release the single reference held
  // by this thread.
  if (reinterpret_cast<Node*>(old_tail->ptr) == current_tail_ptr) {
    FreeExternalCounter(old_tail);
  } else {
    current_tail_ptr->ReleaseRef();
  }
}

无锁队列的类图如下:
class

Push操作流程图如下:
Push

Pop操作流程图如下:
Pop

上述代码中,将非内联函数的定义放置于类外部,是让代码看得更加清爽(如下例所示):

template <typename T>
void LockFreeQueue<T>::Node::ReleaseRef() {
  // ...
}

Push函数中,else分支代码起到负载均衡的作用。也就是说,如果当前线程更新原尾部节点指针的old_tail.ptr->data失败时,就会帮成功的线程更新尾部节点指针的old_tail.ptr ->next,并且正确设置新的尾部节点指针。通过该操作,可避免未抢到控制权的线程反复循环,导致效率低下。事实上,即使删除该段代码也没什么问题,只是在高CPU负载下性能会下降(如下所示):

template <typename T>
void LockFreeQueue<T>::Push(const T& data) {
  auto new_data = std::make_unique<T>(data);
  CountedNodePtr new_next(new Node);
  new_next.external_count = 1;
  CountedNodePtr old_tail = tail_.load();

  while (true) {
    IncreaseExternalCount(&tail_, &old_tail);

    T* old_data = nullptr;    
    if (reinterpret_cast<Node*>(old_tail.ptr)
            ->data.compare_exchange_strong(old_data, new_data.get())) {
      // ...
    } else {
      CountedNodePtr old_next =
          reinterpret_cast<Node*>(old_tail.ptr)->next.load();
      if (reinterpret_cast<Node*>(old_tail.ptr)
              ->next.compare_exchange_strong(old_next, new_next)) {
        old_next = new_next;
        new_next.ptr = reinterpret_cast<uint64_t>(new Node);
      }
      SetNewTail(old_next, &old_tail);
    }
  }
}

值得特别指出的是,带引用计数的节点指针结构体CountedNodePtr使用了位域的概念:

  struct CountedNodePtr {
    explicit CountedNodePtr(Node* input_ptr = nullptr)
        : ptr(reinterpret_cast<uint64_t>(input_ptr)), external_count(0) {}
        
    uint64_t ptr : 48;
    uint16_t external_count : 16;
  };
  };

ptr的真实类型是Node*,但这里给出的却是占据48位内存空间的无符整型uint64_t 。为什么要这么做?现在主流的操作系统和编译器只支持最多8字节数据类型的无锁操作,即std::atomic<CountedNodePtr>的成员函数is_lock_free只有在sizeof(CountedNodePtr) <= 8时才会返回true。因此,必须将CountedNodePtr的字节数控制8以内,于是我们想到了位域。在主流的操作系统中,指针占用的空间不会超过48位(如果超过该尺寸则必须重新设计位域大小,请查阅操作系统使用手册确认),为此将external_count分配16位(最大支持65535),ptr分配48位,合计64位(8字节)。此时,std::atomic<CountedNodePtr>的成员函数is_lock_free在主流操作系统中都会返回true,是真正的无锁原子变量。为了适应上述更改,必须使用reinterpret_cast<Node*>(new_node.ptr)完成ptruint64_tNode*类型的转换,使用reinterpret_cast<uint64_t>(new Node(data)完成指针变量从ptrNode*uint64_t类型的转换,从而正常地存储于ptr中。注意:external_count的计数只自增,不自减。当没有线程访问节点时,直接丢弃external_count

同样地,另一个节点计数器结构体NodeCounter也使用了位域的概念:

  struct NodeCounter {
    NodeCounter() : internal_count(0), external_counters(0) {}
    NodeCounter(const uint32_t input_internal_count,
                const uint8_t input_external_counters)
        : internal_count(input_internal_count),
          external_counters(input_external_counters) {}
    
    uint32_t internal_count : 30;
    uint8_t external_counters : 2;
  };

理由也是让std::atomic<NodeCounter>成为真正的无锁原子变量。该结构体中,external_counters只占2位,最大支持的数值为3,因为队列中有head_tail_两个节点,只需要两个引用计数器分别对其的引用计数,因此external_counters的最大值只需为2,占两位足够。internal_count分配30位(最大支持1073741823)。两个元素合计32位(4字节)。此时,std::atomic<NodeCounter>的成员函数is_lock_free在主流操作系统中都会返回true,是真正的无锁原子变量。当internal_countexternal_counters同时为零时,表示当前的节点已无线程使用,可安全地删除以便回收内存。注意:internal_count的计数只自减,不自增,另外还与外部计数external_count - 2相加合并。

2.2 放宽内存顺序

根据内存关系顺序中最重要的先行(happens-before )与同步(synchronizes with)关系分析,给出放宽内存顺序的版本如下,不能保证所有平台都会正确:

#pragma once

#include <atomic>
#include <memory>

template <typename T>
class LockFreeQueue {
 public:
  LockFreeQueue()
      : head_(CountedNodePtr(new Node, 1)),
        tail_(head_.load(std::memory_order_relaxed)) {}
  ~LockFreeQueue();

  // Copy construct and assignment, move construct and assignment are
  // prohibited.
  LockFreeQueue(const LockFreeQueue& other) = delete;
  LockFreeQueue& operator=(const LockFreeQueue& other) = delete;
  LockFreeQueue(LockFreeQueue&& other) = delete;
  LockFreeQueue& operator=(LockFreeQueue&& other) = delete;

  bool IsEmpty() const {
    return head_.load(std::memory_order_relaxed).ptr ==
           tail_.load(std::memory_order_relaxed).ptr;
  }
  bool IsLockFree() const {
    return std::atomic<CountedNodePtr>::is_always_lock_free;
  }

  void Push(const T& data);
  std::unique_ptr<T> Pop();

 private:
  // Forward class declaration
  struct Node;

  struct CountedNodePtr {
    explicit CountedNodePtr(Node* input_ptr = nullptr,
                            uint16_t input_external_count = 0)
        : ptr(reinterpret_cast<uint64_t>(input_ptr)),
          external_count(input_external_count) {}

    // We know that the platform has spare bits in a pointer (for example,
    // because the address space is only 48 bits but a pointer is 64 bits), we
    // can store the count inside the spare bits of the pointer to fit it all
    // back in a single machine word. Keeping the structure within a machine
    // word makes it more likely that the atomic operations can be lock-free on
    // many platforms.
    uint64_t ptr : 48;
    uint16_t external_count : 16;
  };

  struct NodeCounter {
    NodeCounter() : internal_count(0), external_counters(0) {}
    NodeCounter(const uint32_t input_internal_count,
                const uint8_t input_external_counters)
        : internal_count(input_internal_count),
          external_counters(input_external_counters) {}

    // external_counters occupies only 2 bits, where the maximum value stored
    // is 3. Note that we need only 2 bits for the external_counters because
    // there are at most two such counters. By using a bit field for this and
    // specifying internal_count as a 30-bit value, we keep the total counter
    // size to 32 bits. This gives us plenty of scope for large internal count
    // values while ensuring that the whole structure fits inside a machine word
    // on 32-bit and 64-bit machines. It’s important to update these counts
    // together as a single entity in order to avoid race conditions. Keeping
    // the structure within a machine word makes it more likely that the atomic
    // operations can be lock-free on many platforms.
    uint32_t internal_count : 30;
    uint8_t external_counters : 2;
  };

  struct Node {
    // There are only two counters in Node (counter and next), so the initial
    // value of external_counters is 2.
    Node()
        : data(nullptr), counter(NodeCounter(0, 2)), next(CountedNodePtr()) {}
    ~Node();
    void ReleaseRef();

    std::atomic<T*> data;
    std::atomic<NodeCounter> counter;
    std::atomic<CountedNodePtr> next;
  };

 private:
  static void IncreaseExternalCount(std::atomic<CountedNodePtr>* atomic_node,
                                    CountedNodePtr* old_node);
  static void FreeExternalCounter(CountedNodePtr* old_node);
  void SetNewTail(const CountedNodePtr& new_tail, CountedNodePtr* old_tail);

 private:
  std::atomic<CountedNodePtr> head_;
  std::atomic<CountedNodePtr> tail_;
};

template <typename T>
LockFreeQueue<T>::Node::~Node() {
  if (data.load(std::memory_order_relaxed) != nullptr) {
    T* old_data = data.exchange(nullptr, std::memory_order_relaxed);
    if (old_data != nullptr) {
      delete old_data;
    }
  }
}

template <typename T>
void LockFreeQueue<T>::Node::ReleaseRef() {
  NodeCounter old_node = counter.load(std::memory_order_relaxed);
  NodeCounter new_node;

  // the whole count structure has to be updated atomically, even though we
  // only want to modify the internal_count field. This therefore requires a
  // compare/exchange loop.
  do {
    new_node = old_node;
    if (new_node.internal_count > 0) {
      --new_node.internal_count;
    }
  } while (!counter.compare_exchange_strong(old_node, new_node,
                                            std::memory_order_acquire,
                                            std::memory_order_relaxed));

  // Once we’ve decremented internal_count, if both the internal and
  // external counts are now zero, this is the last reference, so we can
  // delete the node safely.
  if (!new_node.internal_count && !new_node.external_counters) {
    delete this;
  }
}

template <typename T>
LockFreeQueue<T>::~LockFreeQueue() {
  while (Pop()) {
    // Do nothing
  }

  // Delete the last dummy node.
  if (head_.load(std::memory_order_relaxed).ptr ==
      tail_.load(std::memory_order_relaxed).ptr) {
    CountedNodePtr old_head = head_.load(std::memory_order_relaxed);
    CountedNodePtr empty_head;
    auto* dumpy_ptr = reinterpret_cast<Node*>(old_head.ptr);
    if (head_.compare_exchange_strong(old_head, empty_head,
                                      std::memory_order_acq_rel,
                                      std::memory_order_relaxed)) {
      delete dumpy_ptr;
    }
  }
}

template <typename T>
void LockFreeQueue<T>::Push(const T& data) {
  auto new_data = std::make_unique<T>(data);
  CountedNodePtr new_next(new Node, 1);
  CountedNodePtr old_tail = tail_.load(std::memory_order_relaxed);

  while (true) {
    IncreaseExternalCount(&tail_, &old_tail);

    T* old_data = nullptr;
    // We use compare_exchange_strong() to avoid looping. If the exchange
    // fails, we know that another thread has already set the next pointer, so
    // we don’t need the new node we allocated at the beginning, and we can
    // delete it. We also want to use the next value that the other thread set
    // for updating tail.
    if (reinterpret_cast<Node*>(old_tail.ptr)
            ->data.compare_exchange_strong(old_data, new_data.get(),
                                           std::memory_order_release,
                                           std::memory_order_relaxed)) {
      CountedNodePtr old_next;
      if (!reinterpret_cast<Node*>(old_tail.ptr)
               ->next.compare_exchange_strong(old_next, new_next,
                                              std::memory_order_acq_rel,
                                              std::memory_order_relaxed)) {
        delete reinterpret_cast<Node*>(new_next.ptr);
        new_next = old_next;
      }
      SetNewTail(new_next, &old_tail);

      // Release the ownership of the managed object so that the data will not
      // be deleted beyond the scope the unique_ptr<T>.
      new_data.release();
      break;
    } else {
      // If the thread calling Push() failed to set the data pointer this time
      // through the loop, it can help the successful thread to complete the
      // update. First off, we try to update the next pointer to the new node
      // allocated on this thread. If this succeeds, we want to use the node
      // we allocated as the new tail node, and we need to allocate another
      // new node in anticipation of managing to push an item on the queue. We
      // can then try to set the tail node by calling SetNewTail before
      // looping around again.
      CountedNodePtr old_next;
      if (reinterpret_cast<Node*>(old_tail.ptr)
              ->next.compare_exchange_strong(old_next, new_next,
                                             std::memory_order_acq_rel,
                                             std::memory_order_relaxed)) {
        old_next = new_next;
        new_next.ptr = reinterpret_cast<uint64_t>(new Node);
      }
      SetNewTail(old_next, &old_tail);
    }
  }
}

template <typename T>
std::unique_ptr<T> LockFreeQueue<T>::Pop() {
  // We prime the pump by loading the old_head value before we enter the loop,
  // and before we increase the external count on the loaded value.
  CountedNodePtr old_head = head_.load(std::memory_order_relaxed);
  while (true) {
    IncreaseExternalCount(&head_, &old_head);

    Node* ptr = reinterpret_cast<Node*>(old_head.ptr);
    if (ptr == nullptr) {
      return std::unique_ptr<T>();
    }
    // If the head node is the same as the tail node, we can release the
    // reference and return a null pointer because there’s no data in the
    // queue.
    if (ptr ==
        reinterpret_cast<Node*>(tail_.load(std::memory_order_acquire).ptr)) {
      ptr->ReleaseRef();
      return std::unique_ptr<T>();
    }

    // If there is data, we want to try to claim it and we do this with the
    // call to compare_exchange_strong(). It compares the external count and
    // pointer as a single entity; if either changes, we need to loop again,
    // after releasing the reference.
    CountedNodePtr next = ptr->next.load(std::memory_order_relaxed);
    if (head_.compare_exchange_strong(old_head, next, std::memory_order_acquire,
                                      std::memory_order_relaxed)) {
      // If the exchange succeeded, we’ve claimed the data in the node as
      // ours, so we can return that to the caller after we’ve released the
      // external counter to the popped node.
      T* res = ptr->data.exchange(nullptr, std::memory_order_acquire);
      FreeExternalCounter(&old_head);
      return std::unique_ptr<T>(res);
    }

    // Once both the external reference counts have been freed and the
    // internal count has dropped to zero, the node itself can be deleted.
    ptr->ReleaseRef();
  }
}

template <typename T>
void LockFreeQueue<T>::IncreaseExternalCount(
    std::atomic<CountedNodePtr>* atomic_node, CountedNodePtr* old_node) {
  CountedNodePtr new_node;

  // If `*old_node` is equal to `*atomic_node`, it means that no other thread
  // changes the `*atomic_node`, update `*atomic_node` to `new_node`. In fact
  // the `*atomic_node` is still the original node, only the `external_count`
  // of it is increased by 1. If `*old_node` is not equal to `*atomic_node`,
  // it means that another thread has changed `*atomic_node`, update
  // `*old_node` to `*atomic_node`, and keep looping until there are no
  // threads changing `*atomic_node`.
  do {
    new_node = *old_node;
    ++new_node.external_count;
  } while (!atomic_node->compare_exchange_strong(*old_node, new_node,
                                                 std::memory_order_acq_rel,
                                                 std::memory_order_relaxed));

  old_node->external_count = new_node.external_count;
}

template <typename T>
void LockFreeQueue<T>::FreeExternalCounter(CountedNodePtr* old_node) {
  Node* ptr = reinterpret_cast<Node*>(old_node->ptr);
  // It’s important to note that the value we add is two less than the
  // external count. We’ve removed the node from the list, so we drop one off
  // the count for that, and we’re no longer accessing the node from this
  // thread, so we drop another off the count for that.
  const int increased_count = old_node->external_count - 2;
  NodeCounter old_counter = ptr->counter.load(std::memory_order_relaxed);
  NodeCounter new_counter;

  // Update two counters using a single compare_exchange_strong() on the
  // whole count structure, as we did when decreasing the internal_count
  // in ReleaseRef().
  // This has to be done as a single action (which therefore requires the
  // compare/exchange loop) to avoid a race condition. If they’re updated
  // separately, two threads may both think they are the last one and both
  // delete the node, resulting in undefined behavior.
  do {
    new_counter = old_counter;
    if (new_counter.external_counters > 0) {
      --new_counter.external_counters;
    }

    int temp_count = new_counter.internal_count + increased_count;
    // `internal_count` is a non-negative number, and if a negative number is
    // assigned to it, the result is undefined.
    new_counter.internal_count = (temp_count > 0 ? temp_count : 0);
  } while (!ptr->counter.compare_exchange_strong(old_counter, new_counter,
                                                 std::memory_order_release,
                                                 std::memory_order_relaxed));

  // If both the values are now zero, there are no more references to the
  // node, so it can be safely deleted.
  if (!new_counter.internal_count && !new_counter.external_counters) {
    delete ptr;
  }
}

template <typename T>
void LockFreeQueue<T>::SetNewTail(const CountedNodePtr& new_tail,
                                  CountedNodePtr* old_tail) {
  // Use a compare_exchange_weak() loop to update the tail , because if other
  // threads are trying to push() a new node, the external_count part may have
  // changed, and we don’t want to lose it.
  Node* current_tail_ptr = reinterpret_cast<Node*>(old_tail->ptr);
  // Due to the scheduling of the operating system, it may cause a false failure
  // of the `compare_exchange_weak` function, that is, the failure of
  // `compare_exchange_weak` function is not caused by other thread preemption,
  // but the operating system schedules the CPU to execute other instructions in
  // this thread. At this point, we need to compare `old_tail->ptr` and
  // `current_tail_ptr`. If they are equal, then the `compare_exchange_weak`
  // function is a false failure and the loop is continued; If they are
  // different, other threads have preempted and the loop is stopped.
  while (!tail_.compare_exchange_weak(*old_tail, new_tail,
                                      std::memory_order_release,
                                      std::memory_order_relaxed) &&
         reinterpret_cast<Node*>(old_tail->ptr) == current_tail_ptr) {
    // Do nothing
  }

  // We also need to take care that we don’t replace the value if another
  // thread has successfully changed it already; otherwise, we may end up with
  // loops in the queue, which would be a rather bad idea. Consequently, we
  // need to ensure that the ptr part of the loaded value is the same if the
  // compare/exchange fails. If the ptr is the same once the loop has exited,
  // then we must have successfully set the tail , so we need to free the old
  // external counter. If the ptr value is different, then another thread will
  // have freed the counter, so we need to release the single reference held
  // by this thread.
  if (reinterpret_cast<Node*>(old_tail->ptr) == current_tail_ptr) {
    FreeExternalCounter(old_tail);
  } else {
    current_tail_ptr->ReleaseRef();
  }
}

下图给出其中一个分析内存顺序的参考示意:
内存顺序

三、测试代码

下面给出测试无锁队列工作是否正常的简单测试代码(文件命名为:lock_free_queue.cpp):

#include "lock_free_queue.h"

#include <algorithm>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

namespace {
constexpr size_t kElementNum = 10;
constexpr size_t kThreadNum = 200;
constexpr size_t kLargeThreadNum = 2000;
}  // namespace

int main() {
  LockFreeQueue<int> queue;

  // Case 1: Single thread test
  for (size_t i = 0; i < kElementNum; ++i) {
    std::cout << "The data " << i << " is pushed in the queue.\n";
    queue.Push(i);
  }
  std::cout << "queue.IsEmpty() == " << std::boolalpha << queue.IsEmpty()
            << std::endl;
  while (auto data = queue.Pop()) {
    std::cout << "Current data is : " << *data << '\n';
  }

  // Case 2: multi-thread test. Producers and consumers are evenly distributed
  std::vector<std::thread> producers1;
  std::vector<std::thread> producers2;
  std::vector<std::thread> consumers1;
  std::vector<std::thread> consumers2;
  for (size_t i = 0; i < kThreadNum; ++i) {
    producers1.emplace_back(&LockFreeQueue<int>::Push, &queue, i * 10);
    producers2.emplace_back(&LockFreeQueue<int>::Push, &queue, i * 20);
    consumers1.emplace_back(&LockFreeQueue<int>::Pop, &queue);
    consumers2.emplace_back(&LockFreeQueue<int>::Pop, &queue);
  }
  for (size_t i = 0; i < kThreadNum; ++i) {
    producers1[i].join();
    consumers1[i].join();
    producers2[i].join();
    consumers2[i].join();
  }
  producers1.clear();
  producers1.shrink_to_fit();
  producers2.clear();
  producers2.shrink_to_fit();
  consumers1.clear();
  consumers1.shrink_to_fit();
  consumers2.clear();
  consumers2.shrink_to_fit();

  // Case 3: multi-thread test. Producers and consumers are randomly distributed
  std::vector<std::thread> producers3;
  std::vector<std::thread> consumers3;
  for (size_t i = 0; i < kLargeThreadNum; ++i) {
    producers3.emplace_back(&LockFreeQueue<int>::Push, &queue, i * 30);
    consumers3.emplace_back(&LockFreeQueue<int>::Pop, &queue);
  }
  std::vector<int> random_numbers(kLargeThreadNum);
  std::mt19937 gen(std::random_device{}());
  std::uniform_int_distribution<int> dis(0, 100000);
  auto rand_num_generator = [&gen, &dis]() mutable { return dis(gen); };
  std::generate(random_numbers.begin(), random_numbers.end(),
                rand_num_generator);
  for (size_t i = 0; i < kLargeThreadNum; ++i) {
    if (random_numbers[i] % 2) {
      producers3[i].join();
      consumers3[i].join();
    } else {
      consumers3[i].join();
      producers3[i].join();
    }
  }
  consumers3.clear();
  consumers3.shrink_to_fit();
  consumers3.clear();
  consumers3.shrink_to_fit();

  return 0;
}

CMake的编译配置文件CMakeLists.txt

cmake_minimum_required(VERSION 3.0.0)
project(lock_free_queue VERSION 0.1.0)
set(CMAKE_CXX_STANDARD 17)

# If the debug option is not given, the program will not have debugging information.
SET(CMAKE_BUILD_TYPE "Debug")

# Check for memory leaks
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=leak -fsanitize=address")

add_executable(${PROJECT_NAME} ${PROJECT_NAME}.cpp)

find_package(Threads REQUIRED)
# libatomic should be linked to the program.
# Otherwise, the following link errors occured:
# /usr/include/c++/9/atomic:254: undefined reference to `__atomic_load_16'
# /usr/include/c++/9/atomic:292: undefined reference to `__atomic_compare_exchange_16'
# target_link_libraries(${PROJECT_NAME} ${CMAKE_THREAD_LIBS_INIT} atomic)
target_link_libraries(${PROJECT_NAME} ${CMAKE_THREAD_LIBS_INIT})

include(CTest)
enable_testing()
set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

上述配置中,SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=leak -fsanitize=address")用于检测程序是否存在内存泄漏。另外添加了对原子库atomic的链接。因为引用计数的结构体CountedNodePtr包含两个数据成员(注:最初实现的版本未使用位域,需要添加对原子库atomic的链接。新版本使用位域,不再需要添加):int external_count; Node* ptr;,这两个变量占用16字节,而16字节的数据结构需要额外链接原子库atomic,否则会出现链接错误:

/usr/include/c++/9/atomic:254: undefined reference to `__atomic_load_16'
/usr/include/c++/9/atomic:292: undefined reference to `__atomic_compare_exchange_16'

VSCode调试启动配置文件.vscode/launch.json

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "cpp_gdb_launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "${workspaceFolder}/build/${workspaceFolderBasename}",
            "args": [],
            "stopAtEntry": false,
            "cwd": "${fileDirname}",
            "environment": [],
            "externalConsole": false,
            "MIMode": "gdb",
            "setupCommands": [
                {
                    "description": "Enable neat printing for gdb",
                    "text": "-enable-pretty-printing",
                    "ignoreFailures": true
                }
            ],
            // "preLaunchTask": "cpp_build_task",
            "miDebuggerPath": "/usr/bin/gdb"
        }
    ]
}

使用CMake的编译命令:

cd lock_free_queue
# 只执行一次
mkdir build
cd build
cmake .. && make

运行结果如下:

./lock_free_queue 
The data 0 is pushed in the queue.
The data 1 is pushed in the queue.
The data 2 is pushed in the queue.
The data 3 is pushed in the queue.
The data 4 is pushed in the queue.
The data 5 is pushed in the queue.
The data 6 is pushed in the queue.
The data 7 is pushed in the queue.
The data 8 is pushed in the queue.
The data 9 is pushed in the queue.
queue.IsEmpty() == false
Current data is : 0
Current data is : 1
Current data is : 2
Current data is : 3
Current data is : 4
Current data is : 5
Current data is : 6
Current data is : 7
Current data is : 8
Current data is : 9

VSCode调试界面如下:
在这里插入图片描述

  • 4
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值