锁与原子操作CAS以及无锁队列的底层实现

小o魂

已于 2022-01-23 20:09:37 修改

阅读量527

点赞数 1

文章标签：后端

于 2021-11-28 14:17:42 首次发布

本文链接：https://blog.csdn.net/u012173846/article/details/121590938

版权

线程安全问题？

首先我们需要了解一下并发编程下，为什么会有线程安全问题？

当多个线程同时共享，同一个全局变量或静态变量，做写的操作时，可能会发生数据冲突问题，也就是线程安全问题。但是做读操作是不会发生数据冲突问题。

案例：

现在有一个计数器，有多个线程需要对计数器做累加操作，如果不做线程安全处理，请问结果是预期值吗？

代码：

#include <cstdio>
#include <thread>

#define MAX_THREAD			10000
#define COUNTER_SIZE		10000

int result = 0;
void counter() {
    for(int i = 0; i < COUNTER_SIZE; ++i){
        result  = result + 1;
    }
}

int main(int argc, char *argv[]) {
	int i = 0;
    
	std::thread threads[MAX_THREAD];
	for (i = 0;i < MAX_THREAD;i ++) {
		threads[i] = std::thread(counter);
	}

	 for (i = 0;i < MAX_THREAD;i ++) {
		threads[i].join();
	}

	printf("%d\n\r",result);
    return 1;
}

可以动手试一下，结果是否等于100000000？

解决办法？

在C++中主要有这几种常用的解决方法：

互斥锁(mutex)-如果获取不到锁就休眠，让出cpu
条件变量(condition_variable)-配合互斥锁使用
自旋锁(spinlock)-获取不到锁，就继续检测，尝试再次获取锁，不会让出cpu
无锁cas(compare and swap)
信号量(counting_semaphore)

自旋锁和互斥锁的应用

互斥锁用于临界区持锁时间比较长的操作，比如下面这些情况都可以考虑

临界区有IO操作
临界区代码复杂或者循环量大
临界区竞争非常激烈
单核处理器

自旋锁就主要用在临界区持锁时间非常短且CPU资源不紧张的情况下，等待时消耗cpu资源较多，自旋锁一般用于多核的服务器。

此外，原子操作适用于简单的加减等数学运算，属于粒度最小的操作。

cas 多线程同时竞争的时候效率并不会特别高，如果mutex和自旋锁能满足要求了尽量不要用cas

什么是CAS？

⽐较并交换(compare and swap, CAS)，是原⼦操作的⼀种，可⽤于在多线程编程中实现不被打断的数据交换操作，从⽽避免多线程同时改写某⼀数据时由于执⾏顺序不确定性以及中断的不可预知性产⽣的数据不⼀致问题。该操作通过将内存中的值与指定数据进⾏⽐较，当数值⼀样时将内存中的数据替换为新的值。

bool CAS( int * pAddr, int nExpected, int nNew)
atomically {
    if ( *pAddr == nExpected ) {
        *pAddr = nNew ;
        return true ;
    }
    else
        return false ;
}

上⾯的CAS返回bool告知原⼦性交换是否成功。

操作的原⼦性

这里引出一个问题"i++"是原子操作嘛？

要解决这个问题，我们可以先写个案例编译出来以后看反汇编即可。

#include <stdio.h>

int i = 0;
int main(int argc, char **argv)
{
    i++;
    return 0;
}

gcc -S test_i++.c 产⽣汇编指令，可以发现当我们操作i++的时候，其汇编指令为：

完整的汇编程序如下所示：

    .file "test_i++.c"
    .text
    .globl i
    .bss
    .align 4
    .type i, @object
    .size i, 4
 i:
    .zero 4
    .text
    .globl main
    .type main, @function
 main:
 .LFB0:
    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq %rsp, %rbp
    .cfi_def_cfa_register 6
    movl %edi, -4(%rbp)
    movq %rsi, -16(%rbp)
    movl i(%rip), %eax
    addl $1, %eax
    movl %eax, i(%rip)
    movl $0, %eax
    popq %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
 .LFE0:
    .size main, .-main
    .ident "GCC: (Ubuntu 8.3.0-6ubuntu1~18.10.1) 8.3.0"
    .section .note.GNU-stack,"",@progbits

从上图能看到⼀个i++对应的操作是：

把变量i从内存（RAM）加载到寄存器；
把寄存器的值加1；
把寄存器的值写回内存（RAM）

那如果有多个线程去做i++操作的时候，也就可能导致这样⼀种情况：

上图这种执⾏情况就导致i变量的结果仅仅⾃增了⼀次，⽽不是两次，导致实际结果与预期结果不对。

总结下上⾯的问题，也就是说我们整个存储的结构如下图：

我们所有的变量⾸先是存储在主存（RAM）上，CPU要去操作的时候⾸先会加载到寄存器，然后才操作，操作好了才写会主存。

关于CPU和cache的更详细介绍可以参考：https://www.cnblogs.com/jokerjason/p/10711022.html

原⼦操作

对于gcc、g++编译器来讲，它们提供了⼀组API来做原⼦操作：

type __sync_fetch_and_add (type *ptr, type value, ...)
type __sync_fetch_and_sub (type *ptr, type value, ...)
type __sync_fetch_and_or (type *ptr, type value, ...)
type __sync_fetch_and_and (type *ptr, type value, ...)
type __sync_fetch_and_xor (type *ptr, type value, ...)
type __sync_fetch_and_nand (type *ptr, type value, ...)

bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
type __sync_lock_test_and_set (type *ptr, type value, ...)
void __sync_lock_release (type *ptr, ...)

详细⽂档⻅：https://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html#AtomicBuiltins

对于c++11来讲，我们也有⼀组atomic的接⼝：https://en.cppreference.com/w/cpp/atomic

原⼦操作的实现

知道了上面的原⼦操作，但是他们都是怎么实现的呢？

X86的架构

Intel X86指令集提供了指令前缀lock⽤于锁定前端串⾏总线FSB，保证了指令执⾏时不会收到其他处理器的⼲扰。⽐如：

static int lxx_atomic_add(int *ptr, int increment){
    int old_value = *ptr;
    __asm__ volatile("lock; xadd %0, %1 \n\t"
                        : "=r"(old_value), "=m"(*ptr)
                        : "0"(increment), "m"(*ptr)
                        : "cc", "memory");
    return *ptr;
}

使⽤lock指令前缀之后，处理期间对count内存的并发访问（Read/Write）被禁⽌，从⽽保证了指令的原⼦性。如图所示：

其他架构自行搜索，基本原理都是用到了芯片架构自己提供的锁机制。

用法

Mutex

pthread_mutex_t mutex;

void mutex_thread_main(void *argv)
{
    pthread_mutex_lock(&mutex);
    // ...
    pthread_mutex_unlock(&mutex);
}

spinlock

class atomic_flag_spinlock
{
    std::atomic_flag flag;
public:
    atomic_flag_spinlock():
        flag(ATOMIC_FLAG_INIT)
    {}
    void lock()
    {
        while(flag.test_and_set(std::memory_order_acquire));
    }
    void unlock()
    {
        flag.clear(std::memory_order_release);
    }
};

void spinlock_thread_main(void *argv)
{
    s_atomic_flag_spinlock.lock();
    // ...
    s_atomic_flag_spinlock.unlock();

    return NULL;
}

posix也提供⼀组⾃旋锁（spin lock）的API：

int pthread_spin_destroy(pthread_spinlock_t *);
int pthread_spin_init(pthread_spinlock_t *, int pshared);
int pthread_spin_lock(pthread_spinlock_t *);
int pthread_spin_trylock(pthread_spinlock_t *);
int pthread_spin_unlock(pthread_spinlock_t *);

pshared的取值：
PTHREAD_PROCESS_SHARED：该⾃旋锁可以在多个进程中的线程之间共享。
PTHREAD_PROCESS_PRIVATE：仅初始化本⾃旋锁的线程所在的进程内的线程才能够使⽤该⾃旋锁

pthread_spin_init(&spinlock, PTHREAD_PROCESS_PRIVATE);

pthread_spin_lock(&spinlock);
// ...
pthread_spin_unlock(&spinlock);

cas-原子操作

int inc(int *value, int add) {

    int old;

    __asm__ volatile ( 
        "lock; xaddl %2, %1;" // "lock; xchg %2, %1, %3;" 
        : "=a" (old)
        : "m" (*value), "a" (add)
        : "cc", "memory"
    );

    return old;
}

int main() {
    int *pcount = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_ANON|MAP_SHARED, -1, 0);
    inc(pcount, 1);

    return 1;
}

无锁队列

为什么需要⽆锁队列？

锁引起的问题：

Cache损坏(Cache trashing)
在同步机制上的争抢队列
动态内存分配

Cache损坏(Cache trashing)

在保存和恢复上下⽂的过程中还隐藏了额外的开销：Cache中的数据会失效,因为它缓存的是将被换出任务的数据,这些数据对于新换进的任务是没⽤的。处理器的运⾏速度⽐主存快N倍,所以⼤量的处理器时间被浪费在处理器与主存的数据传输上。这就是在处理器和主存之间引⼊Cache的原因。Cache是⼀种速度更快但容量更⼩的内存(也更加昂贵),当处理器要访问主存中的数据时,这些数据⾸先被拷⻉到Cache中，因为这些数据在不久的将来可能⼜会被处理器访问。Cache misses对性能有⾮常⼤的影响,因为处理器访问Cache中的数据将⽐直接访问主存快得多。线程被频繁抢占产⽣的Cache损坏将导致应⽤程序性能下降。

在同步机制上的争抢队列

阻塞不是微不⾜道的操作。它导致操作系统暂停当前的任务或使其进⼊睡眠状态(等待，不占⽤任何的处理器)。直到资源(例如互斥锁)可⽤，被阻塞的任务才可以解除阻塞状态(唤醒)。在⼀个负载较重的应⽤程序中使⽤这样的阻塞队列来在线程之间传递消息会导致严重的争⽤问题。也就是说，任务将⼤量的时间(睡眠，等待，唤醒)浪费在获得保护队列数据的互斥锁，⽽不是处理队列中的数据上。如果多个线程争夺同一个资源，还容易引起死锁等问题。

⾮阻塞机制⼤展伸⼿的机会到了。任务之间不争抢任何资源，在队列中预定⼀个位置，然后在这个位置上插⼊或提取数据。这种机制使⽤了⼀种被称之为CAS(⽐较和交换)的特殊操作，这个特殊操作是⼀种特殊的指令，它可以原⼦的完成以下操作:它需要3个操作数m，A，B，其中m是⼀个内存地址，操作将m指向的内存中的内容与A⽐较，如果相等则将B写⼊到m指向的内存中并返回true，如果不相等则什么也不做返回false。

volatile int a;
a = 1;
// this will loop while 'a' is not equal to 1.
// If it is equal to 1 the operation will atomically set a to 2 and return true
while (!CAS(&a, 1, 2)){}

动态内存分配

在多线程系统中,需要仔细的考虑动态内存分配。当⼀个任务从堆中分配内存时，标准的内存分配机制会阻塞所有与这个任务共享地址空间的其它任务(进程中的所有线程)。这样做的原因是让处理更简单，且它⼯作得很好。两个线程不会被分配到⼀块相同的地址的内存，因为它们没办法同时执⾏分配请求。显然线程频繁分配内存会导致应⽤程序性能下降(必须注意,向标准队列或map插⼊数据的时候都会导致堆上的动态内存分配)

ZMQ中⽆锁队列的实现

参考zmq,⻅源码的ypipe.hpp、yqueue.hpp，这些源码可以在⼯程项⽬使⽤，但要注意，这⾥只⽀持单写单读取的场景。

重点：

yqueue_t和ypipe_t的数据结构
chunk设计的优点，N值⼤⼩对性能的影响、spare_chunk的作⽤(局部性原理)，这些可以在⾯试的时候和⾯试官讲述

难点：
ypipe_t r指针的预读机制不好理解，r可以理解为read_end，r并不是读数据的位置索引，⽽是我们可以最多读到哪个位置的索引。读数据的索引位置还是begin_pos。

原⼦操作函数

template <typename T> class atomic_ptr_t{
public:
    inline void set (T *ptr_);            //⾮原⼦操作
    inline T *xchg (T *val_);             //原⼦操作，设置⼀个新的值，然后返回旧的值
    inline T *cas (T *cmp_, T *val_);     //原⼦操作
private:
    volatile T *ptr;
}

set函数，把私有成员ptr指针设置成参数ptr_的值，不是⼀个原⼦操作，需要使⽤者确保执⾏set过程没有其他线程使⽤ptr的值。
xchg函数，把私有成员ptr指针设置成参数val_的值，并返回ptr设置之前的值。原⼦操作，线程安全。
cas函数，原⼦操作，线程安全，把私有成员ptr指针与参数cmp_指针⽐较：
如果相等，就把ptr设置为参数val_的值，返回ptr设置之前的值；
如果不相等直接返回ptr值。

yqueue_t-类接⼝和变量

// T is the type of the object in the queue.队列中元素的类型
// N is granularity(粒度) of the queue，简单来说就是yqueue_t⼀个结点可以装载N个T类型的元素
template <typename T, int N> class yqueue_t{
public:
    inline yqueue_t ();     // Create the queue.
    inline ~yqueue_t ();    // Destroy the queue.
    inline T &front ();     // Returns reference to the front element of the queue. If the queue is empty, behaviour is undefined.
    inline T &back ();      // Returns reference to the back element of the queue.If the queue is empty, behaviour is undefined.
    inline void push ();    // Adds an element to the back end of the queue.
    inline void pop ();     // Removes an element from the front of the queue.
    inline void unpush ()// Removes element from the back end of the queue。 回滚时使⽤
private:
    // Individual memory chunk to hold N elements.
    struct chunk_t {
        T values [N];
        chunk_t *prev;
        chunk_t *next;
    };

    chunk_t *begin_chunk;
    int begin_pos;
    chunk_t *back_chunk;
    int back_pos;
    chunk_t *end_chunk;
    int end_pos;

    atomic_ptr_t<chunk_t> spare_chunk; //空闲块（我把所有元素都已经出队的块称为空闲块），读写线程的共享变量
};

yqueue_t的实现，每次批量分配⼀批元素，减少内存的分配和释放（解决不断动态内存分配的问题）。yqueue_t内部由⼀个⼀个chunk组成，每个chunk保存N个元素。

struct chunk_t{
    T values[N]; //每个chunk_t可以容纳N个T类型的元素，以后就以⼀个chunk_t为单位申请内存
    chunk_t *prev;
    chunk_t *next;
};

当队列空间不⾜时，每次分配⼀个chunk_t，每个chunk_t能存储N个元素。在数据出队列后，队列有多余空间的时候，回收的chunk也不是⻢上释放，⽽是根据局部性原理先回收到spare_chunk⾥⾯，当再次需要分配chunk_t的时候从spare_chunk中获取。

yqueue_t内部有三个chunk_t类型指针以及对应的索引位置：

begin_chunk/begin_pos：begin_chunk⽤于指向队列头的chunk，begin_pos⽤于指向队列第⼀个元素在当前chunk中的位置。
back_chunk/back_pos：back_chunk⽤于指向队列尾的chunk，back_pos⽤于指向队列最后⼀个元素在当前chunk的位置。
end_chunk/end_pos：由于chunk是批量分配的，所以end_chunk⽤于指向分配的最后⼀个chunk位置。

这⾥特别需要注意区分back_chunk/back_pos和end_chunk/end_pos的作⽤：

back_chunk/back_pos：对应的是元素存储位置；
end_chunk/end_pos：决定是否要分配chunk或者回收chunk。

这里有个有趣的地方，就是当chunk中的分配元素数量为1时，chunk就退化成了纯链表。

chunk中的分配元素数量为1000时,就表示插入1000个元素才会分配一次内存。

上图中，有三块chunk，分别由begin_chunk、back_chunk、end_chunk组成。

begin_pos指向begin_chunk中的第n个元素。
back_pos指向back_chunk的最后⼀个元素。
由于back_pos已经指向了back_chunk的最后⼀个元素，所以end_pos就指向了end_chunk的第⼀个元素。

另外还有⼀个spare_chunk指针，⽤于保存释放的chunk指针，当需要再次分配chunk的时候，会⾸先查看这⾥，从这⾥分配chunk。这⾥使⽤了原⼦的cas操作来完成，利⽤了操作系统的局部性原理。

yqueue_t构造函数

// Create the queue.
inline yqueue_t() {
    begin_chunk = (chunk_t *)malloc(sizeof(chunk_t)); // 预先分配chunk
    alloc_assert(begin_chunk);
    begin_pos = 0;
    back_chunk = NULL; //back_chunk总是指向队列中最后⼀个元素所在的chunk，现在还没有元素，所以初始为空
    back_pos = 0;
    end_chunk = begin_chunk; //end_chunk总是指向链表的最后⼀个chunk
    end_pos = 0;
}

end_chunk总是指向最后分配的chunk。刚分配出来的chunk，end_pos也总是为0。back_chunk需要chunk有元素插⼊的时候才指向对应的chunk。

front、back函数

// Returns reference to the front element of the queue.
// If the queue is empty, behaviour is undefined.
inline T &front() {       // 返回的是引⽤，是个左值，调⽤者可以通过其修改容器的值
    return begin_chunk->values[begin_pos]; // 队列⾸个chunk对应的的begin_pos
}

// Returns reference to the back element of the queue.
// If the queue is empty, behaviour is undefined.
inline T &back() {        // 返回的是引⽤，是个左值，调⽤者可以通过其修改容器的值
    return back_chunk->values[back_pos];
}

这⾥的front()或者back()函数，需要注意的返回的是左值引⽤，我们可以修改其值。对于先进后出队列⽽⾔：

begin_chunk->values[begin_pos]代表队列头可读元素，读取队列头元素即是读取begin_pos位置的元素；
back_chunk->values[back_pos]代表队列尾可写元素，写⼊元素时则是更新back_pos位置的元素，要确保元素真正⽣效，还需要调⽤push函数更新back_pos的位置，避免下次更新的时候⼜是更新当前back_pos位置对应的元素。

push函数

// Adds an element to the back end of the queue.
inline void push(){
    back_chunk = end_chunk;
    back_pos = end_pos;     // 更新可写的位置，根据end_pos取更新

    if (++end_pos != N)     //end_pos==N表明这个链表结点已经满了
        return;

    chunk_t *sc = spare_chunk.xchg(nullptr); // 为什么设置为NULL？ 因为如果把之前值取出来了则没有spare chunk了，所以设置为NULL
    if (sc) {               // 如果有spare chunk则继续复⽤它
        end_chunk->next = sc;
        qsc->prev = end_chunk;
    } else {                // 没有则重新分配
        end_chunk->next = (chunk_t *)malloc(sizeof(chunk_t));
        // 分配⼀个chunk
        alloc_assert(end_chunk->next);
        end_chunk->next->prev = end_chunk;
    }

    end_chunk = end_chunk->next;
    end_pos = 0;
}

这⾥分为两种情况：

第⼀种情况：++end_pos != N，说明当前chunk还有空余的位置可以继续插⼊新元素；
第⼆种情况：++end_pos == N，说明该chunk只有N-1的位置可以写⼊元素了，需要再分配⼀个chunk空间。
需要新分配chunk时，先尝试从spare_chunk获取，如果获取到则直接使⽤，如果spare_chunk为NULL则需要重新分配chunk。
最终都是要更新end_chunk和end_pos。

第⼀种情况 ++end_pos != N，此时 back_pos和end_pos相差⼀个位置，即是 (back_pos +1)%N == end_pos。

第⼆种情况：++end_pos == N，说明该chunk只有N-1的位置可以写⼊元素了，需要再分配⼀个chunk空间。

pop函数

// Removes an element from the front end of the queue.
2 inline void pop() {
    if (++begin_pos == N) { // 删除满⼀个chunk才回收chunk
        chunk_t *o = begin_chunk;
        begin_chunk = begin_chunk->next; // 更新begin_chunk位置
        begin_chunk->prev = NULL;
        begin_pos = 0;

        // 'o' has been more recently used than spare_chunk,
        // so for cache reasons we'll get rid of the spare and
        // use 'o' as the spare.
        chunk_t *cs = spare_chunk.xchg(o); //由于局部性原理，总是保存最新的空闲块⽽释放先前的空闲快
        free(cs);
    }
}

整个chunk的元素都被取出队列才去回收chunk，⽽且是把最后回收的chunk保存到spare_chunk，然后释放之前保存的chunk。这⾥有两个点需要注意：

pop掉的元素，其销毁⼯作交给调⽤者完成，即是pop前调⽤者需要通过front()接⼝读取并进⾏销毁（⽐如动态分配的对象）。
空闲块的保存，要求是原⼦操作。因为闲块是读写线程的共享变量，因为在push中也使⽤了spare_chunk

ypipe_t

ypipe_t在yqueue_t的基础上构建⼀个单写单读的⽆锁队列

类接⼝和变量

template <typename T, int N>
class ypipe_t {
public:
    // Initialises the pipe.
    inline ypipe_t();

    // The destructor doesn't have to be virtual. It is mad virtual
    // just to keep ICC and code checking tools from complaining.
    inline virtual ~ypipe_t();
    // Write an item to the pipe. Don't flush it yet. If incomplete is
    // set to true the item is assumed to be continued by items
    // subsequently written to the pipe. Incomplete items are neverflushed down the stream.
    // 写⼊数据，incomplete参数表示写⼊是否还没完成，在没完成的时候不会修改flush指针，即这部分数据不会让读线程看到。
    inline void write(const T &value_, bool incomplete_);

    // Pop an incomplete item from the pipe. Returns true is such
    // item exists, false otherwise.
    inline bool unwrite(T *value_);

    // Flush all the completed items into the pipe. Returns false if
    // the reader thread is sleeping. In that case, caller is obliged to
    // wake the reader up before using the pipe again.
    // 刷新所有已经完成的数据到管道，返回false意味着读线程在休眠，在这种情况下调⽤者需要唤醒读线程。
    inline bool flush();

    // Check whether item is available for reading.
    // 这⾥⾯有两个点，⼀个是检查是否有数据可读，⼀个是预取
    inline bool check_read();

    // Reads an item from the pipe. Returns false if there is novalue.
    // available.
    inline bool read(T *value_);

    // Applies the function fn to the first elemenent in the pipe
    // and returns the value returned by the fn.
    // The pipe mustn't be empty or the function crashes.
    inline bool probe(bool (*fn)(T &));

protected:
    //  Allocation-efficient queue to store pipe items.
    //  Front of the queue points to the first prefetched item, back of
    //  the pipe points to last un-flushed item. Front is used only by
    //  reader thread, while back is used only by writer thread.
    yqueue_t<T, N> queue;

    //  Points to the first un-flushed item. This variable is used
    //  exclusively by writer thread.
    T *w;//指向第一个未刷新的元素,只被写线程使用

    //  Points to the first un-prefetched item. This variable is used
    //  exclusively by reader thread.
    T *r;//指向第一个还没预提取的元素，只被读线程使用

    //  Points to the first item to be flushed in the future.
    T *f;//指向下一轮要被刷新的一批元素中的第一个

    //  The single point of contention between writer and reader thread.
    //  Points past the last flushed item. If it is NULL,
    //  reader is asleep. This pointer should be always accessed using
    //  atomic operations.
    atomic_ptr_t<T> c;//读写线程共享的指针，指向每一轮刷新的起点（看代码的时候会详细说）。当c为空时，表示读线程睡眠（只会在读线程中被设置为空）

    //  Disable copying of ypipe object.
    ypipe_t(const ypipe_t &);
    const ypipe_t &operator=(const ypipe_t &);
};

主要变量：

// Points to the first un-flushed item. This variable is used exclusively by writer thread.
T *w;    // 指向第⼀个未刷新的元素,只被写线程使⽤

// Points to the first un-prefetched item. This variable is used exclusively by reader thread.
T *r;    // 指向第⼀个还没预提取的元素，只被读线程使⽤

// Points to the first item to be flushed in the future.
T *f;    // 指向下⼀轮要被刷新的⼀批元素中的第⼀个

// The single point of contention between writer and reader thread.
// Points past the last flushed item. If it is NULL,reader is asleep.
// This pointer should be always accessed using atomic operations.
atomic_ptr_t<T> c;    // 读写线程共享的指针，指向每⼀轮刷新的起点（看代码的时候会详细说）。当c为空时，表示读线程睡眠（只会在读线程中被设置为空）

主要接⼝：

void write (const T &value, bool incomplete) // 写⼊数据，incomplete参数表示写⼊是否还没完成，在没完成的时候不会修改flush指针，即这部分数据不会让读线程看到。

bool flush () // 刷新所有已经完成的数据到管道，返回false意味着读线程在休眠，在这种情况下调⽤者需要唤醒读线程。

bool read (T *value_) // 读数据，将读出的数据写⼊value指针中，返回false意味着没有数据可读。

ypipe_t ()初始化

//  Initialises the pipe.
inline ypipe_t(){
    //  Insert terminator element into the queue.
    queue.push(); //yqueue_t的尾指针加1，开始back_chunk为空，现在back_chunk指向第一个chunk_t块的第一个位置

    //  Let all the pointers to point to the terminator.
    //  (unless pipe is dead, in which case c is set to NULL).
    r = w = f = &queue.back();//就是让r、w、f、c四个指针都指向这个end迭代器
    c.set(&queue.back());
}

write函数

//  Write an item to the pipe.  Don't flush it yet. If incomplete is
//  set to true the item is assumed to be continued by items
//  subsequently written to the pipe. Incomplete items are neverflushed down the stream.
// 写入数据，incomplete参数表示写入是否还没完成，在没完成的时候不会修改flush指针，即这部分数据不会让读线程看到。
inline void write(const T &value_, bool incomplete_)
{
    //  Place the value to the queue, add new terminator element.
    queue.back() = value_;
    queue.push();

    //  Move the "flush up to here" poiter.
    if (!incomplete_)
        f = &queue.back();      // 记录要刷新的位置
}

flush函数

主要是将w更新到f的位置，说明已经写到的位置

//  Flush all the completed items into the pipe. Returns false if
//  the reader thread is sleeping. In that case, caller is obliged to
//  wake the reader up before using the pipe again.
// 刷新所有已经完成的数据到管道，返回false意味着读线程在休眠，在这种情况下调用者需要唤醒读线程。
// 批量刷新的机制， 写入批量后唤醒读线程；
// flush,主要是为了实现反悔机制 unwrite
inline bool flush()
{
    //  If there are no un-flushed items, do nothing.
    if (w == f)     // 不需要刷新，即是还没有新元素加入
        return true;

    //  Try to set 'c' to 'f'.
    if (c.cas(w, f) != w)   // 尝试将c设置为f，即是准备更新w的位置
    {

        //  Compare-and-swap was unseccessful because 'c' is NULL.
        //  This means that the reader is asleep. Therefore we don't
        //  care about thread-safeness and update c in non-atomic
        //  manner. We'll return false to let the caller know
        //  that reader is sleeping.
        c.set(f);
        w = f;
        return false;       // 线程看到flush返回false之后会发送一个消息给读线程，这个是需要写业务去做处理
    }

    //  Reader is alive. Nothing special to do now. Just move
    //  the 'first un-flushed item' pointer to 'f'.
    w = f;
    return true;
}

刷新之后，w、f、c、r的关系：

read函数

//  Check whether item is available for reading.
// 这里面有两个点，一个是检查是否有数据可读，一个是预取
inline bool check_read()
{
    //  Was the value prefetched already? If so, return.
    if (&queue.front() != r && r) //判断是否在前几次调用read函数时已经预取数据了return true;
        return true;

    //  There's no prefetched value, so let us prefetch more values.
    //  Prefetching is to simply retrieve the
    //  pointer from c in atomic fashion. If there are no
    //  items to prefetch, set c to NULL (using compare-and-swap).
    r = c.cas(&queue.front(), NULL);//尝试预取数据

    //  If there are no elements prefetched, exit.
    //  During pipe's lifetime r should never be NULL, however,
    //  it can happen during pipe shutdown when items
    //  are being deallocated.
    if (&queue.front() == r || !r)//判断是否成功预取数据
        return false;

    //  There was at least one value prefetched.
    return true;
}

//  Reads an item from the pipe. Returns false if there is no value.
//  available.
inline bool read(T *value_)
{
    //  Try to prefetch a value.
    if (!check_read())
        return false;

    //  There was at least one value prefetched.
    //  Return it to the caller.
    *value_ = queue.front();
    queue.pop();
    return true;
}

如果指针r指向的是队头元素（r==&queue.front()）或者r没有指向任何元素（NULL）则说明队列中并没有可读的数据，这个时候check_read尝试去预取数据。所谓的预取就是令 r=c (cas函数就是返回c本身的值，看上⾯关于cas的实现)，⽽c在write中被指向f（⻅上图），这时从queue.front()到f这个位置的数据都被预取出来了，然后每次调⽤read都能取出⼀段。值得注意的是，当c==&queue.front()时，代表数据被取完了，这时把c指向NULL，接着读线程会睡眠，这也是给写线程检查读线程是否睡眠的标志。继续上⾯写⼊AB数据的场景，第⼀次调⽤read时，会先check_read，把指针r指向指针c的位置（所谓的预取），这时r,c,w,f的关系如下:

在"read(&ret),函数返回false，ret没有获取到值"的时候，front()和r相等。

如果此时在r = c.cas(&queue.front(), NULL); 执⾏时没有flush的操作。则说明没有数据可以读取，最终返回false；
如果在r = c.cas(&queue.front(), NULL); 之前写⼊⽅write新数据后并调⽤了flush，则r被更新，最终返回true

⽽_c指针，则是读写线程都可以操作，因此需要使⽤原⼦的CAS操作来修改，它的可能值有以下⼏种：

NULL：读线程设置，此时意味着已经没有数据可读，读线程在休眠。
⾮零：写线程设置，这⾥⼜区分两种情况：
- 旧值为_w的情况下，cas(_w,_f)操作修改为_f，意味着如果原先的值为_w，则原⼦性的修改为_f，表示有更多已被刷新的数据可读。
- 在旧值为NULL的情况下，此时读线程休眠，因此可以安全的设置为当前_f指针的位置。

基于循环循环数组的⽆锁队列

重点

ArrayLockFreeQueue数据结构，可以理解为⼀个环形数组；
多线程写⼊时候，m_maximumReadIndex、m_writeIndex索引如何更新
在更新m_maximumReadIndex的时候为什么要让出cpu；
多线程读取的时候，m_readIndex如何更新。
可读位置是由m_maximumReadIndex控制，⽽不是m_writeIndex去控制的。
- m_maximumReadIndex的更新由m_writeIndex。

类接⼝和变量

template <typename ELEM_T, QUEUE_INT Q_SIZE = ARRAY_LOCK_FREE_Q_DEFAULT_SIZE>
class ArrayLockFreeQueue
{
public:

	ArrayLockFreeQueue();
	virtual ~ArrayLockFreeQueue();

	QUEUE_INT size();

	bool enqueue(const ELEM_T &a_data);    // ⼊队列

	bool dequeue(ELEM_T &a_data);          // 出队列   

    bool try_dequeue(ELEM_T &a_data);      // 尝试⼊队列

private:
	ELEM_T m_thequeue[Q_SIZE];
	volatile QUEUE_INT m_count;            // 队列的元素个数
	volatile QUEUE_INT m_writeIndex;       // 新元素⼊列时存放位置在数组中的下标
	volatile QUEUE_INT m_readIndex;        // 下⼀个出列元素在数组中的下标
	volatile QUEUE_INT m_maximumReadIndex; //最后⼀个已经完成⼊列操作的元素在数组中的下标
	inline QUEUE_INT countToIndex(QUEUE_INT a_count);
};

如果m_maximumReadIndex的值跟m_writeIndex不⼀致，表明有写请求尚未完成。这意味着，有写请求成功申请了空间但数据还没完全写进队列。所以如果有线程要读取，必须要等到写线程将数据完全写⼊到队列之后。

必须指明的是使⽤3种不同的下标都是必须的，因为队列允许任意数量的⽣产者和消费者围绕着它⼯作。已经存在⼀种基于循环数组的⽆锁队列，使得唯⼀的⽣产者和唯⼀的消费者可以良好的⼯作[11]。它的实现相当简洁⾮常值得阅读。

该程序使⽤gcc内置的__sync_bool_compare_and_swap，但重新做了宏定义封装。
#define CAS(a_ptr, a_oldVal, a_newVal) __sync_bool_compare_and_swap(a_ptr, a_oldVal, a_newVal)

enqueue⼊队列

template <typename ELEM_T, QUEUE_INT Q_SIZE>
bool ArrayLockFreeQueue<ELEM_T, Q_SIZE>::enqueue(const ELEM_T &a_data)
{
	QUEUE_INT currentWriteIndex;		// 获取写指针的位置
	QUEUE_INT currentReadIndex;
	do
	{
		currentWriteIndex = m_writeIndex;
		currentReadIndex = m_readIndex;
		if(countToIndex(currentWriteIndex + 1) ==
			countToIndex(currentReadIndex))
		{
			return false;	//队列未空		
		}
	} while(!CAS(&m_writeIndex, currentWriteIndex, (currentWriteIndex+1)));
	// We know now that this index is reserved for us. Use it to save the data
	m_thequeue[countToIndex(currentWriteIndex)] = a_data;

 	// update the maximum read index after saving the data. It wouldn't fail if there is only one thread 
	// inserting in the queue. It might fail if there are more than 1 producer threads because this
	// operation has to be done in the same order as the previous CAS
	while(!CAS(&m_maximumReadIndex, currentWriteIndex, (currentWriteIndex + 1)))
	{
		 // this is a good place to yield the thread in case there are more
		// software threads than hardware processors and you have more
		// than 1 producer thread
		// have a look at sched_yield (POSIX.1b)
		sched_yield();		// 当线程超过cpu核数的时候如果不让出cpu导致一直循环在此。
	}

	AtomicAdd(&m_count, 1);

	return true;

}

以下插图展示了对队列执⾏操作时各下标是如何变化的。如果⼀个位置被标记为X，标识这个位置⾥存放了数据。空⽩表示位置是空的。对于下图的情况,队列中存放了两个元素。WriteIndex指示的位置是新元素将会被插⼊的位置。ReadIndex指向的位置中的元素将会在下⼀次pop操作中被弹出。

当⽣产者准备将数据插⼊到队列中,它⾸先通过增加WriteIndex的值来申请空间。MaximumReadIndex指向最后⼀个存放有效数据的位置(也就是实际的队列尾)。

⼀旦空间的申请完成,⽣产者就可以将数据拷⻉到刚刚申请到的位置中。完成之后增加MaximumReadIndex使得它与WriteIndex的⼀致。

现在队列中有3个元素，接着⼜有⼀个⽣产者尝试向队列中插⼊元素。

在第⼀个⽣产者完成数据拷⻉之前，⼜有另外⼀个⽣产者申请了⼀个新的空间准备拷⻉数据。现在有两个⽣产者同时向队列插⼊数据。

现在⽣产者开始拷⻉数据，在完成拷⻉之后，对MaximumReadIndex的递增操作必须严格遵循⼀个顺序：第⼀个⽣产者线程⾸先递增MaximumReadIndex，接着才轮到第⼆个⽣产者。这个顺序必须被严格遵守的原因是，我们必须保证数据被完全拷⻉到队列之后才允许消费者线程将其出列。
while(!CAS(&m_maximumReadIndex, currentWriteIndex, (currentWriteIndex + 1))){
    sched_yield();		// 让出cpu的⽬的也是为了让排在最前⾯的⽣产者完成m_maximumReadIndex的更新。
}
第⼀个⽣产者完成了数据拷⻉，并对MaximumReadIndex完成了递增，现在第⼆个⽣产者可以递增MaximumReadIndex了。

第⼆个⽣产者完成了对MaximumReadIndex的递增,现在队列中有5个元素。

dequeue出队列

template <typename ELEM_T, QUEUE_INT Q_SIZE>
bool ArrayLockFreeQueue<ELEM_T, Q_SIZE>::dequeue(ELEM_T &a_data)
{
	QUEUE_INT currentMaximumReadIndex;
	QUEUE_INT currentReadIndex;

	do
	{
		 // to ensure thread-safety when there is more than 1 producer thread
       	// a second index is defined (m_maximumReadIndex)
		currentReadIndex = m_readIndex;
		currentMaximumReadIndex = m_maximumReadIndex;

		if(countToIndex(currentReadIndex) ==
			countToIndex(currentMaximumReadIndex))
		{
			// the queue is empty or
			// a producer thread has allocate space in the queue but is 
			// waiting to commit the data into it
			return false;
		}
		// retrieve the data from the queue
		a_data = m_thequeue[countToIndex(currentReadIndex)];

		// try to perfrom now the CAS operation on the read index. If we succeed
		// a_data already contains what m_readIndex pointed to before we 
		// increased it
		if(CAS(&m_readIndex, currentReadIndex, (currentReadIndex + 1)))
		{
			AtomicSub(&m_count, 1);	// 真正读取到了数据
			return true;
		}
	} while(true);

	assert(0);
	 // Add this return statement to avoid compiler warnings
	return false;

}

以下插图展示了元素出列的时候各种下标是如何变化的,队列中初始有2个元素。WriteIndex指示的位置是新元素将会被插⼊的位置。ReadIndex指向的位置中的元素将会在下⼀次pop操作中被弹出。

消费者线程拷⻉数组ReadIndex位置的元素，然后尝试⽤CAS操作将ReadIndex加1。如果操作成功消费者成功的将数据出列。因为CAS操作是原⼦的，所以只有唯⼀的线程可以在同⼀时刻更新ReadIndex的值。如果操作失败，读取新的ReadIndex值，以重复以上操作(copy数据，CAS)。

现在⼜有⼀个消费者将元素出列，队列变成空。

现在有⼀个⽣产者正在向队列中添加元素。它已经成功的申请了空间，但尚未完成数据拷⻉。任何其它企图从队列中移除元素的消费者都会发现队列⾮空(因为writeIndex不等于readIndex)。但它不能读取readIndex所指向位置中的数据，因为readIndex与MaximumReadIndex相等。消费者将会在do循环中不断的反复尝试，直到⽣产者完成数据拷⻉增加MaximumReadIndex的值，或者队列变成空(这在多个消费者的场景下会发⽣)。

当⽣产者完成数据拷⻉，队列的⼤⼩是1，消费者线程可以读取这个数据了.

在多于⼀个⽣产者线程的情况下yielding处理器的必要性

读者可能注意到了enqueue函数中使⽤了sched_yield()来主动的让出处理器，对于⼀个声称⽆锁的算法⽽⾔，这个调⽤看起来有点奇怪。正如⽂章开始的部分解释过的，多线程环境下影响性能的其中⼀个因素就是Cache损坏。⽽产⽣Cache损坏的⼀种情况就是⼀个线程被抢占，操作系统需要保存被抢占线程的上下⽂，然后将被选中作为下⼀个调度线程的上下⽂载⼊。此时Cache中缓存的数据都会失效，因为它是被抢占线程的数据⽽不是新线程的数据。

所以，当此算法调⽤sched_yield()意味着告诉操作系统："我要把处理器时间让给其它线程，因为我要等待某件事情的发⽣"。⽆锁算法和通过阻塞机制同步的算法的⼀个主要区别在于⽆锁算法不会阻塞在线程同步上，那么为什么在这⾥我们要主动请求操作系统抢占⾃⼰呢?这个问题的答案没那么简单。它与有多少个⽣产者线程在并发的往队列中存放数据有关：每个⽣产者线程所执⾏的CAS操作都必须严格遵循FIFO次序，⼀个⽤于申请空间，另⼀个⽤于通知消费者数据已经写⼊完成可以被读取了。

如果我们的应⽤程序只有唯⼀的⽣产者操作这个队列，sche_yield()将永远没有机会被调⽤，第2个CAS操作永远不会失败。因为在⼀个⽣产者的情况下没有⼈能破坏⽣产者执⾏这两个CAS操作的FIFO顺序。

⽽当多于⼀个⽣产者线程往队列中存放数据的时候，问题就出现了。概括来说，⼀个⽣产者通过第1个CAS操作申请空间，然后将数据写⼊到申请到的空间中，然后执⾏第2个CAS操作通知消费者数据准备完毕可供读取了。这第2个CAS操作必须遵循FIFO顺序，也就是说，如果A线程第⾸先执⾏完第⼀个CAS操作，那么它也要第1个执⾏完第2个CAS操作，如果A线程在执⾏完第⼀个CAS操作之后停⽌，然后B线程执⾏完第1个CAS操作，那么B线程将⽆法完成第2个CAS操作，因为它要等待A先完成第2个CAS操作。⽽这就是问题产⽣的根源。让我们考虑如下场景，3个消费者线程和1个消费者线程：

线程1，2，3按顺序调⽤第1个CAS操作申请了空间。那么它们完成第2个CAS操作的顺序也应该与这个顺序⼀致，1，2，3。
线程2⾸先尝试执⾏第2个CAS，但它会失败，因为线程1还没完成它的第2此CAS操作呢。同样对于线程3也是⼀样的。
线程2和3将会不断的调⽤它们的第2个CAS操作，直到线程1完成它的第2个CAS操作为⽌。
线程1最终完成了它的第2个CAS，现在线程3必须等线程2先完成它的第2个CAS。
线程2也完成了，最终线程3也完成

在上⾯的场景中，⽣产者可能会在第2个CAS操作上⾃旋⼀段时间，⽤于等待先于它执⾏第1个CAS操作的线程完成它的第2次CAS操作。在⼀个物理处理器数量⼤于操作队列线程数量的系统上，这不会有太严重的问题：因为每个线程都可以分配在⾃⼰的处理器上执⾏，它们最终都会很快完成各⾃的第2次CAS操作。虽然算法导致线程处理忙等状态，但这正是我们所期望的，因为这使得操作更快的完成。也就是说在这种情况下我们是不需要sche_yield()的，它完全可以从代码中删除。但是，在⼀个物理处理器数量少于线程数量的系统上，sche_yield()就变得⾄关重要了。让我们再次考查上⾯3个线程的场景，当线程3准备向队列中插⼊数据：如果线程1在执⾏完第1个CAS操作，在执⾏第2个CAS操作之前被抢占，那么线程2，3就会⼀直在它们的第2个CAS操作上忙等(它们忙等，不让出处理器，线程1也就没机会执⾏，它们就只能继续忙等)，直到线程1重新被唤醒，完成它的第2个CAS操作。这就是需要sche_yield()的场合了，操作系统应该避免让线程2，3处于忙等状态。它们应该尽快的让出处理器让线程1执⾏，使得线程1可以把它的第2个CAS操作完成。这样线程2和3才能继续完成它们的操作。