文章目录
1.为什么需要无锁队列?
1)锁引起的问题:
1)Cache损坏(Cache trashing) //同一线程在不同CPU运行,就会有CPU的Cache失效
- 在保存和恢复上下⽂的过程中还隐藏了额外的开销:Cache中的数据会失效,因为它缓存的是将被换出任务的数据,这些数据对于新换进的任务是没⽤的。处理器的运⾏速度⽐主存快N倍,所以⼤量的处理器时间被浪费在处理器与主存的数据传输上。这就是在处理器和主存之间引⼊Cache的原因。Cache是⼀种速度更快但容量更⼩的内存(也更加昂贵),当处理器要访问主存中的数据时,这些数据⾸先被拷⻉到Cache中,因为这些数据在不久的将来可能⼜会被处理器访问。Cache misses对性能有⾮常⼤的影响,因为处理器访问Cache中的数据将⽐直接访问主存快得多。
- 线程被频繁抢占产⽣的Cache损坏将导致应⽤程序性能下降
2)在同步机制上的争抢队列
blockqueue通过mutex和condition的队列切换,就会有同步机制上争抢队列的问题
- 阻塞不是微不⾜道的操作。它导致操作系统暂停当前的任务或使其进⼊睡眠状态(等待,不占⽤任何的处理器)。直到资源(例如互斥锁)可⽤,被阻塞的任务才可以解除阻塞状态(唤醒)。在⼀个负载较重的应⽤程序中使⽤这样的阻塞队列来在线程之间传递消息会导致严重的争⽤问题。也就是说,任务将⼤量的时间(睡眠,等待,唤醒)浪费在获得保护队列数据的互斥锁,⽽不是处理队列中的数据上。
- ⾮阻塞机制⼤展伸⼿的机会到了。任务之间不争抢任何资源,在队列中预定⼀个位置,然后在这个位置上插⼊或提取数据。这中机制使⽤了⼀种被称之为CAS(⽐较和交换)的特殊操作,这个特殊操作是⼀种特殊的指令,它可以原⼦的完成以下操作:它需要3个操作数m,A,B,其中m是⼀个内存地址,操作将m指向的内存中的内容与A⽐较,如果相等则将B写⼊到m指向的内存中并返回true,如果不相等则什么也不做返回false。
1 volatile int a;
2 a = 1;
3
4 // this will loop while 'a' is not equal to 1.
5 // If it is equal to 1 the operation will atomically set a to
2 and return true
6 while (!CAS(&a, 1, 2))
7 {
8 ;
9 }
3)动态内存分配问题
队列如果使用动态节点,对于高性能队列:
①每加入一个节点动态分配一次内存
②出队列一次节点,释放一次内存,这样势必影响到高性能服务器的性能
-
在多线程系统中,需要仔细的考虑动态内存分配。当⼀个任务从堆中分配内存时,标准的内存分配机制会阻塞所有与这个任务共享地址空间的其它任务(进程中的所有线程)。这样做的原因是让处理更简单,且它⼯作得很好。两个线程不会被分配到⼀块相同的地址的内存,因为它们没办法同时执⾏分配请求。显然线程频繁分配内存会导致应⽤程序性能下降(必须注意,向标准队列或map插⼊数据的时候都会导致堆上的动态内存分配)
-
1000个节点,时间118ms
-
1个节点,时间231ms
- 当N=1,上图的队列就退化为纯粹的链表,插入队列的时候,就不断去new一个节点加到队列里面,每个元素插入都需要分配内存;
- 当N=1000,就表示一个chunk里面可以存1000个元素,若插入1000个元素,只需动态分配一个内存chunk
- 当分配内存时:多个线程竟态(需要抢夺资源)的去调用malloc分配内存,或者采用内存池的方式代替malloc
2.yqueue的实现(无锁队列基础:单写单读)
参考zmq
1)原子指针操作的一个类
// This class encapsulates several atomic operations on pointers.
template <typename T> class atomic_ptr_t
{
public:
inline void set (T *ptr_); //⾮原⼦操作,设置一个值,把值更新
inline T *xchg (T *val_); //原⼦操作,设置⼀个新的值,然后返回旧的值(交换函数)
inline T *cas (T *cmp_, T *val_);//原⼦操作(原来的值和cmp这个值去对比,如果相等就设置新值,否则不设置)
private:
volatile T *ptr;
}
- set函数,把私有成员ptr指针设置成参数ptr_的值,不是⼀个原⼦操作,需要使⽤者确保执⾏set过程没有其他线程使⽤ptr的值。
- xchg函数,把私有成员ptr指针设置成参数val_的值,并返回ptr设置之前的值。原⼦操作,线程安全。
- cas函数,原⼦操作,线程安全,把私有成员ptr指针与参数cmp_指针⽐较:
- 如果相等,就把ptr设置为参数val_的值,返回ptr设置之前的值;
- 如果不相等直接返回ptr值。
2)无锁队列相关接口
3)数据结构逻辑
- 空闲块理解:相当于连接池的空闲连接队列,当有一块数据溢出时,就可以复用这个空闲块(N值在构造的时候就已经固定了,chunk_t就是个链表)
- begin_chunk指向当前chunk的头部,begin_pos指向chunk的第一个队列元素且范围是[0,N-1],
- 还没有存数据,end和back_chunck才不一样
- end_pos用来试探chunk是否用完
4)无锁队列的函数(构造/析构函数等):
1)构造函数
- 初始化各项数据,且指向都指向首块chunk
2)front、back函数
3)push函数
- spare_chunk只有一个chunk
- 关于chunk的注释
4)pop函数
- yqueue需要flush才可以读数据
5)总结
1)局部性原理
正常情况chunk数量稳定,波动+1、-1
2)如果保存多个
chunk管理的时候,原子性问题不好管理:
- 加锁(基本不会怎么做)
- 或者再设计一个无锁队列管理回收的chunk
3.ypipe的实现(构建⼀个单写单读的⽆锁队列)
- ypipe_t在yqueue_t的基础上构建⼀个单写单读的⽆锁队列
1)类接口和变量
2)ypipe_t()初始化
3)write函数
4)flush函数
- f指向下一个需要更新的点
- 返回true什么都不用做,返回false通知读进程
- 不是靠r纪录元素读的位置,靠的是begin_pos,r用来判断队列是不是空(通过r是否==begin_pos)
5)read函数
- read读到空,表示没数据可读
- 确实没有新数据,queue.front() == r
- 刚好有新数据来,有flush ,r != queue.front()
- r不是++1的切换,而是批量的切换,由flush的时间切换
6)多写多读的对比
- 4写1读和1写1读的特点
- zmq是最高效的1写1读,不适合多线程场景的应用,下面是2写1读
- 4w1r,4写1读
- 开始1写多读
test4设计到内存阑珊各种技术
4 .无锁队列特点
- 使用场景:业务量特别大的时候
- 多生产者性能不是有很大的提升,多生产者不合适
- 适合一生产者,多个读取(因为CAS都是在while循环里面,多生产者的话cpu切换消耗)
- 无锁都是依赖cas:compare and swap,比较和交换,分别是链表+数组的方式,链表比较复杂,数据比较简单
- 无锁原理:
加锁是一种悲观的策略,它总是认为每次访问共享资源的时候,总会发生冲突,所以宁愿牺牲性能(时间)来保证数据安全。
无锁是一种乐观的策略,它假设线程访问共享资源不会发生冲突,所以不需要加锁,因此线程将不断执行,不需要停止。一旦碰到冲突,就重试当前操作直到没有冲突为止。
- 无锁的ABA问题
****
其实在CAS中,还有一种异常产生,也就是常说的`ABA`的现象。所谓ABA现象就是当前现象期望值是A,某个线程将A改为B,另外线程将B改为A,导致当前线程误以为还是原来的值,然后操作就会导致一些异常出现。
这里我们可以借用数据库乐观锁的方式,维护一个全局的版本号或者是标志,每次修改的时候需要期望值和内存值相等并且标识也没有发生改变的时候采取更新值。
****
无锁(CAS)本身编程就不是很友好,如果没有彻底掌握,最好还是使用锁去编写。
CAS 更多的是一种思想,也是实现高性能编程的一种途径,目前已经有一些开源级别的无锁库可以提供我们使用,也许这些才是我们最好的选择。
8)代码-数组
#ifndef _ARRAYLOCKFREEQUEUEIMP_H___
#define _ARRAYLOCKFREEQUEUEIMP_H___
#include "ArrayLockFreeQueue.h"
#include <assert.h>
#include "atom_opt.h"
template <typename ELEM_T, QUEUE_INT Q_SIZE>
ArrayLockFreeQueue<ELEM_T, Q_SIZE>::ArrayLockFreeQueue() :
m_writeIndex(0),
m_readIndex(0),
m_maximumReadIndex(0)
{
m_count = 0;
}
template <typename ELEM_T, QUEUE_INT Q_SIZE>
ArrayLockFreeQueue<ELEM_T, Q_SIZE>::~ArrayLockFreeQueue()
{
}
template <typename ELEM_T, QUEUE_INT Q_SIZE>
inline QUEUE_INT ArrayLockFreeQueue<ELEM_T, Q_SIZE>::countToIndex(QUEUE_INT a_count)
{
return (a_count % Q_SIZE);
}
template <typename ELEM_T, QUEUE_INT Q_SIZE>
QUEUE_INT ArrayLockFreeQueue<ELEM_T, Q_SIZE>::size()
{
QUEUE_INT currentWriteIndex = m_writeIndex;
QUEUE_INT currentReadIndex = m_readIndex;
if(currentWriteIndex>=currentReadIndex)
return currentWriteIndex - currentReadIndex;
else
return Q_SIZE + currentWriteIndex - currentReadIndex;
}
template <typename ELEM_T, QUEUE_INT Q_SIZE>
bool ArrayLockFreeQueue<ELEM_T, Q_SIZE>::enqueue(const ELEM_T &a_data)
{
QUEUE_INT currentWriteIndex; // 获取写指针的位置
QUEUE_INT currentReadIndex;
do
{
currentWriteIndex = m_writeIndex;
currentReadIndex = m_readIndex;
if(countToIndex(currentWriteIndex + 1) ==
countToIndex(currentReadIndex))
{
return false; //队列未空
}
} while(!CAS(&m_writeIndex, currentWriteIndex, (currentWriteIndex+1)));
// We know now that this index is reserved for us. Use it to save the data
m_thequeue[countToIndex(currentWriteIndex)] = a_data;
// update the maximum read index after saving the data. It wouldn't fail if there is only one thread
// inserting in the queue. It might fail if there are more than 1 producer threads because this
// operation has to be done in the same order as the previous CAS
while(!CAS(&m_maximumReadIndex, currentWriteIndex, (currentWriteIndex + 1)))
{
// this is a good place to yield the thread in case there are more
// software threads than hardware processors and you have more
// than 1 producer thread
// have a look at sched_yield (POSIX.1b)
sched_yield(); // 当线程超过cpu核数的时候如果不让出cpu导致一直循环在此。
}
AtomicAdd(&m_count, 1);
return true;
}
template <typename ELEM_T, QUEUE_INT Q_SIZE>
bool ArrayLockFreeQueue<ELEM_T, Q_SIZE>::try_dequeue(ELEM_T &a_data)
{
return dequeue(a_data);
}
template <typename ELEM_T, QUEUE_INT Q_SIZE>
bool ArrayLockFreeQueue<ELEM_T, Q_SIZE>::dequeue(ELEM_T &a_data)
{
QUEUE_INT currentMaximumReadIndex;
QUEUE_INT currentReadIndex;
do
{
// to ensure thread-safety when there is more than 1 producer thread
// a second index is defined (m_maximumReadIndex)
currentReadIndex = m_readIndex;
currentMaximumReadIndex = m_maximumReadIndex;
if(countToIndex(currentReadIndex) ==
countToIndex(currentMaximumReadIndex))
{
// the queue is empty or
// a producer thread has allocate space in the queue but is
// waiting to commit the data into it
return false;
}
// retrieve the data from the queue
a_data = m_thequeue[countToIndex(currentReadIndex)];
// try to perfrom now the CAS operation on the read index. If we succeed
// a_data already contains what m_readIndex pointed to before we
// increased it
if(CAS(&m_readIndex, currentReadIndex, (currentReadIndex + 1)))
{
AtomicSub(&m_count, 1); // 真正读取到了数据
return true;
}
} while(true);
assert(0);
// Add this return statement to avoid compiler warnings
return false;
}
#endif
9)链表
#ifndef SIMPLE_LOCK_FREE_QUEUE_H
// Fairly simple, yet correct, implementation of a simple lock-free queue based on linked pointers with CAS
template <typename T>
class SimpleLockFreeQueue
{
public:
// Total maximum capacity: 2**39 (half a terabyte's worth -- off-by-one aligned indices)
static const int UBER_BLOCKS = 256;
static const int UBER_BLOCK_SIZE = 256;
static const int ULTRA_BLOCK_SIZE = 256;
static const int SUPER_BLOCK_SIZE = 256;
static const int BLOCK_SIZE = 128;
private:
static const uint64_t VERSION_MASK = 0xFFFFFF0000000000ULL;
static const uint64_t VERSION_INCR = 0x0000010000000000ULL;
static const uint64_t UBER_BLOCK_IDX_MASK = 0xFF00000000ULL;
static const uint64_t UBER_BLOCK_MASK = 0x00FF000000ULL;
static const uint64_t ULTRA_BLOCK_MASK = 0x0000FF0000ULL;
static const uint64_t SUPER_BLOCK_MASK = 0x000000FF00ULL;
static const uint64_t BLOCK_MASK = 0x00000000FEULL;
static const uint64_t UBER_BLOCK_IDX_SHIFT = 32;
static const uint64_t UBER_BLOCK_SHIFT = 24;
static const uint64_t ULTRA_BLOCK_SHIFT = 16;
static const uint64_t SUPER_BLOCK_SHIFT = 8;
static const uint64_t BLOCK_SHIFT = 1;
typedef std::uint64_t idx_t;
public:
SimpleLockFreeQueue()
: nextNodeIdx(2), freeListHead(0)
{
// Invariants: Head and tail are never null
auto initialNode = allocate_blank_node();
head.store(set_consumed_flag(initialNode), std::memory_order_relaxed);
tail.store(initialNode, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
~SimpleLockFreeQueue()
{
std::atomic_thread_fence(std::memory_order_seq_cst);
idx_t idx = head.load(std::memory_order_relaxed);
if (is_consumed(idx))
{
idx = clear_consumed_flag(idx);
auto node = get_node_at(idx);
auto next = node->next.load(std::memory_order_relaxed);
node->~Node();
idx = next;
}
while (idx != 0)
{
auto node = get_node_at(idx);
auto next = node->next.load(std::memory_order_relaxed);
node->item()->~T();
node->~Node();
idx = next;
}
idx = freeListHead.load(std::memory_order_relaxed);
while (idx != 0)
{
auto node = get_node_at(idx);
auto next = node->next.load(std::memory_order_relaxed);
node->~Node();
idx = next;
}
}
template <typename U>
inline bool enqueue(U &&item)
{
idx_t nodeIdx = allocate_node_for(std::forward<U>(item));
auto tail_ = tail.load(std::memory_order_relaxed);
while (!tail.compare_exchange_weak(tail_, nodeIdx, std::memory_order_release, std::memory_order_relaxed))
continue;
get_node_at(tail_)->next.store(nodeIdx, std::memory_order_release);
return true;
}
inline bool try_dequeue(T &item)
{
while (true)
{
auto rawHead_ = head.load(std::memory_order_acquire);
auto head_ = clear_consumed_flag(rawHead_);
auto headNode = get_node_at(head_);
auto next = headNode->next.load(std::memory_order_relaxed);
if (next == 0)
{
// Can't move head (that would make head null), but can try to dequeue the node at head anyway
if (is_consumed(rawHead_))
{
return false;
}
if (head.compare_exchange_strong(head_, set_consumed_flag(head_), std::memory_order_release, std::memory_order_relaxed))
{
// Whee, we own the right to dequeue this item
item = std::move(*headNode->item());
headNode->item()->~T();
return true;
}
}
else
{
// Remove node whether it's already been consumed or not; if it hasn't been consumed, consume it!
// head_->next can't possibly change, since once it's not null nobody writes to it (and ABA is avoided with versioning)
if (head.compare_exchange_weak(rawHead_, next, std::memory_order_acq_rel, std::memory_order_relaxed))
{
// Aha, we successfully moved the head. But does it have anything in it?
if (!is_consumed(rawHead_))
{
item = std::move(*headNode->item());
headNode->item()->~T();
}
add_node_to_free_list(head_, headNode);
if (!is_consumed(rawHead_))
{
return true;
}
}
}
}
}
private:
struct Node
{
std::atomic<idx_t> next;
alignas(T) char rawItem[sizeof(T)];
template <typename U>
Node(U &&item)
: next(0)
{
new (this->item()) T(std::forward<U>(item));
}
Node()
: next(0)
{
}
inline T *item() { return reinterpret_cast<T *>(rawItem); }
};
struct Block
{
alignas(Node) char nodes[sizeof(Node) * BLOCK_SIZE];
inline char *node_pos(idx_t idx) { return nodes + ((idx & BLOCK_MASK) >> BLOCK_SHIFT) * sizeof(Node); }
};
template <typename TSubBlock, int BlockSize>
struct HigherOrderBlock
{
std::atomic<TSubBlock *> subblocks[BlockSize];
HigherOrderBlock()
{
for (int i = 0; i != BlockSize; ++i)
{
subblocks[i].store(nullptr, std::memory_order_release);
}
}
~HigherOrderBlock()
{
for (int i = 0; i != BlockSize; ++i)
{
if (subblocks[i].load(std::memory_order_relaxed) != nullptr)
{
delete subblocks[i].load(std::memory_order_relaxed);
}
}
}
};
typedef HigherOrderBlock<Block, SUPER_BLOCK_SIZE> SuperBlock;
typedef HigherOrderBlock<SuperBlock, ULTRA_BLOCK_SIZE> UltraBlock;
typedef HigherOrderBlock<UltraBlock, UBER_BLOCK_SIZE> UberBlock;
typedef HigherOrderBlock<UberBlock, UBER_BLOCKS> UberBlockContainer;
private:
inline idx_t set_consumed_flag(idx_t idx)
{
return idx | (idx_t)1;
}
inline idx_t clear_consumed_flag(idx_t idx)
{
return idx & ~(idx_t)1;
}
inline bool is_consumed(idx_t idx)
{
return (idx & 1) != 0;
}
inline void add_node_to_free_list(idx_t idx, Node *node)
{
auto head = freeListHead.load(std::memory_order_relaxed);
do
{
node->next.store(head, std::memory_order_relaxed);
} while (!freeListHead.compare_exchange_weak(head, idx, std::memory_order_release, std::memory_order_relaxed));
}
inline idx_t try_get_node_from_free_list()
{
auto head = freeListHead.load(std::memory_order_acquire);
while (head != 0 && !freeListHead.compare_exchange_weak(head, get_node_at(head)->next.load(std::memory_order_relaxed), std::mem
{
continue;
}
if (head != 0)
{
// Increment version
head = (head & ~VERSION_MASK) | ((head + VERSION_INCR) & VERSION_MASK);
}
return head;
}
inline Node *get_node_at(idx_t idx)
{
auto uberBlock = uberBlockContainer.subblocks[(idx & UBER_BLOCK_IDX_MASK) >> UBER_BLOCK_IDX_SHIFT].load(std::memory_order_relax
auto ultraBlock = uberBlock->subblocks[(idx & UBER_BLOCK_MASK) >> UBER_BLOCK_SHIFT].load(std::memory_order_relaxed);
auto superBlock = ultraBlock->subblocks[(idx & ULTRA_BLOCK_MASK) >> ULTRA_BLOCK_SHIFT].load(std::memory_order_relaxed);
auto block = superBlock->subblocks[(idx & SUPER_BLOCK_MASK) >> SUPER_BLOCK_SHIFT].load(std::memory_order_relaxed);
return reinterpret_cast<Node *>(block->node_pos(idx));
}
template <typename U>
inline idx_t allocate_node_for(U &&item)
{
auto idx = try_get_node_from_free_list();
if (idx != 0)
{
auto node = get_node_at(idx);
node->next.store(0, std::memory_order_relaxed);
new (node->item()) T(std::forward<U>(item));
return idx;
}
new (new_node_address(idx)) Node(std::forward<U>(item));
return idx;
}
inline idx_t allocate_blank_node()
{
idx_t idx;
new (new_node_address(idx)) Node();
return idx;
}
inline char *new_node_address(idx_t &idx)
{
idx = nextNodeIdx.fetch_add(static_cast<idx_t>(1) << BLOCK_SHIFT, std::memory_order_relaxed);
std::size_t uberBlockContainerIdx = (idx & UBER_BLOCK_IDX_MASK) >> UBER_BLOCK_IDX_SHIFT;
std::size_t uberBlockIdx = (idx & UBER_BLOCK_MASK) >> UBER_BLOCK_SHIFT;
std::size_t ultraBlockIdx = (idx & ULTRA_BLOCK_MASK) >> ULTRA_BLOCK_SHIFT;
std::size_t superBlockIdx = (idx & SUPER_BLOCK_MASK) >> SUPER_BLOCK_SHIFT;
auto uberBlock = lookup_subblock<UberBlockContainer, UberBlock>(&uberBlockContainer, uberBlockContainerIdx);
auto ultraBlock = lookup_subblock<UberBlock, UltraBlock>(uberBlock, uberBlockIdx);
auto superBlock = lookup_subblock<UltraBlock, SuperBlock>(ultraBlock, ultraBlockIdx);
auto block = lookup_subblock<SuperBlock, Block>(superBlock, superBlockIdx);
return block->node_pos(idx);
}
template <typename TBlock, typename TSubBlock>
inline TSubBlock *lookup_subblock(TBlock *block, std::size_t idx)
{
auto ptr = block->subblocks[idx].load(std::memory_order_acquire);
if (ptr == nullptr)
{
auto newBlock = new TSubBlock();
if (!block->subblocks[idx].compare_exchange_strong(ptr, newBlock, std::memory_order_release, std::memory_order_acquire))
{
delete newBlock;
}
else
{
ptr = newBlock;
}
}
return ptr;
}
private:
std::atomic<idx_t> nextNodeIdx;
std::atomic<idx_t> head; // 头
std::atomic<idx_t> tail; // 尾部
std::atomic<idx_t> freeListHead;
UberBlockContainer uberBlockContainer;
};
#endif