高效C++无锁队列实现-moodycamel::ConcurrentQueue

最新推荐文章于 2024-11-10 20:42:32 发布

weixin_30347009

最新推荐文章于 2024-11-10 20:42:32 发布

阅读量2.1k

点赞数

文章标签： c/c++ 开发工具

原文链接：http://www.cnblogs.com/lvdongjie/p/9679168.html

版权

moodycamel::ConcurrentQueue是一个高性能的C++无锁队列，支持多线程读写。它具有单头文件实现、线程安全、模板化等特性，提供阻塞和非阻塞版本，支持高级特性如批量操作。相比其他锁免费队列，它更快速且限制较少，但不保证线程间元素顺序和线性化。作者提供了详细的文档、基准测试和源码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

国外一牛人做的，支持多平台，支持多线程写、多线程读，并可指定读写token，转载过来。

感觉作者也时刻维护着他这个项目，我提了一些问题，每次都会及时得到答复，而且回复得非常认真仔细，非常赞！

链接地址（可下载源码）：https://github.com/cameron314/concurrentqueue

作者的测试效果统计：http://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++#benchmarks

moodycamel::ConcurrentQueue

An industrial-strength lock-free queue for C++.

Note: If all you need is a single-producer, single-consumer queue, I have one of those too.

Features

Knock-your-socks-off blazing fast performance.
Single-header implementation. Just drop it in your project.
Fully thread-safe lock-free queue. Use concurrently from any number of threads.
C++11 implementation -- elements are moved (instead of copied) where possible.
Templated, obviating the need to deal exclusively with pointers -- memory is managed for you.
No artificial limitations on element types or maximum count.
Memory can be allocated once up-front, or dynamically as needed.
Fully portable (no assembly; all is done through standard C++11 primitives).
Supports super-fast bulk operations.
Includes a low-overhead blocking version (BlockingConcurrentQueue).
Exception safe.

Reasons to use

There are not that many full-fledged lock-free queues for C++. Boost has one, but it's limited to objects with trivial assignment operators and trivial destructors, for example. Intel's TBB queue isn't lock-free, and requires trivial constructors too. There's many academic papers that implement lock-free queues in C++, but usable source code is hard to find, and tests even more so.

This queue not only has less limitations than others (for the most part), but it's also faster. It's been fairly well-tested, and offers advanced features like bulk enqueueing/dequeueing (which, with my new design, is much faster than one element at a time, approaching and even surpassing the speed of a non-concurrent queue even under heavy contention).

In short, there was a lock-free queue shaped hole in the C++ open-source universe, and I set out to fill it with the fastest, most complete, and well-tested design and implementation I could. The result is moodycamel::ConcurrentQueue :-)

Reasons not to use

The fastest synchronization of all is the kind that never takes place. Fundamentally, concurrent data structures require some synchronization, and that takes time. Every effort was made, of course, to minimize the overhead, but if you can avoid sharing data between threads, do so!

Why use concurrent data structures at all, then? Because they're gosh darn convenient! (And, indeed, sometimes sharing data concurrently is unavoidable.)

My queue is not linearizable (see the next section on high-level design). The foundations of its design assume that producers are independent; if this is not the case, and your producers co-ordinate amongst themselves in some fashion, be aware that the elements won't necessarily come out of the queue in the same order they were put in relative to the ordering formed by that co-ordination (but they will still come out in the order they were put in by any individual producer). If this affects your use case, you may be better off with another implementation; either way, it's an important limitation to be aware of.

My queue is also not NUMA aware, and does a lot of memory re-use internally, meaning it probably doesn't scale particularly well on NUMA architectures; however, I don't know of any other lock-free queue that is NUMA aware (except for SALSA, which is very cool, but has no publicly available implementation that I know of).

Finally, the queue is not sequentially consistent; there is a happens-before relationship between when an element is put in the queue and when it comes out, but other things (such as pumping the queue until it's empty) require more thought to get right in all eventualities, because explicit memory ordering may have to be done to get the desired effect. In other words, it can sometimes be difficult to use the queue correctly. This is why it's a good idea to follow the samples where possible. On the other hand, the upside of this lack of sequential consistency is better performance.

High-level design

Elements are stored internally using contiguous blocks instead of linked lists for better performance. The queue is made up of a collection of sub-queues, one for each producer. When a consumer wants to dequeue an element, it checks all the sub-queues until it finds one that's not empty. All of this is largely transparent to the user of the queue, however -- it mostly just worksTM.

One particular consequence of this design, however, (which seems to be non-intuitive) is that if two producers enqueue at the same time, there is no defined ordering between the elements when they're later dequeued. Normally this is fine, because even with a fully linearizable queue there'd be a race between the producer threads and so you couldn't rely on the ordering anyway. However, if for some reason you do extra explicit synchronization between the two producer threads yourself, thus defining a total order between enqueue operations, you might expect that the elements would come out in the same total order, which is a guarantee my queue does not offer. At that point, though, there semantically aren't really two separate producers, but rather one that happens to be spread across multiple threads. In this case, you can still establish a total ordering with my queue by creating a single producer token, and using that from both threads to enqueue (taking care to synchronize access to the token, of course, but there was already extra synchronization involved anyway).

I've written a more detailed overview of the internal design, as well as the full nitty-gritty details of the design, on my blog. Finally, the source itself is available for perusal for those interested in its implementation.

Basic use

The entire queue's implementation is contained in one header, concurrentqueue.h. Simply download and include that to use the queue. The blocking version is in a separate header, blockingconcurrentqueue.h, that depends on the first. The implementation makes use of certain key C++11 features, so it requires a fairly recent compiler (e.g. VS2012+ or g++ 4.8; note that g++ 4.6 has a known bug with std::atomic and is thus not supported). The algorithm implementations themselves are platform independent.

Use it like you would any other templated queue, with the exception that you can use it from many threads at once :-)

Simple example: