线程池剖析和复现

夜半读核

已于 2024-02-28 22:37:51 修改

阅读量947

点赞数 21

分类专栏：多线程编程文章标签： c++ 多线程

于 2024-02-27 22:53:17 首次发布

本文链接：https://blog.csdn.net/dengwodaer/article/details/136331037

版权

多线程编程专栏收录该内容

2 篇文章 0 订阅

订阅专栏

https://github.com/progschj/ThreadPool.git
这是一个在github上有7.2K star的c++多线程，短小精焊，学习和分析了下.

并发编程分析

最近阅读<<Boost.Asio C++ Network Programming Cookbook>>这本书，书上有一句话，读到以后让我理解了多线程编程的一个关键.
“The application running more threads than the number of cores or processors installed in the computer may slow down the application due to the effect of the thread switching overhead.”
这句话让我深刻意识到，多线程、服务器的processor、cores之间的关系.
服务器的processor/cores数目是固定的，根据计算系组成原理，指令流水线部分部分，多processors 、多cores 有助于并行执行指令，因此多处理器多核对提高程序的运行效率是很有帮助的.

所以，多线程编程模式是很好的，但是一定要明白，多线程的前提是为了在多处理器多核提供的能力下让程序并行地执行，从而能让程序数倍地提高效率，但是是否意味着，线程越多就越好？
肯定不行，机器的最大能力就在那里，你开几百个线程，最终在某一时刻也只能其中的一部分线程被并发执行，反而会由于线程过多，导致在操作系统的调度策略下，也要将那些没有在运行的线程切换进来运行，这就发生了线程的切换，而频繁地进行线程切换，导致的时间消耗也是很大的，这个时候线程越多反而会降低程序的处理能力.

而对于单处理器(单核)的并行，即单核心下的多线程处理，只是伪并行，这所谓的并行体现在，多个线程下，如果有个线程阻塞了，那么处理器理所当然地切换去执行另一个线程.这种切换是有好处的。这就是在有限的能力下尽最大努力并发.(总结就是，处理器不停地干活.)
因此，如果要使用线程池来处理任务，首先要考虑的是机器的并发能力，再确定合适的线程数目.

源码分析

核心思路

使用std::queue作为任务队列，生产者投入任务到任务队列，唤醒线程池（std::vector<std::thread> workes)中某个处于阻塞状态的线程，从任务队列取出任务，执行.
让我们来划分一下:

生产者投入任务到任务队列
唤醒线程池中某个处于阻塞状态的线程
很简单的过程.
首先，我们为什么需要线程池，因为我们有很多任务需要执行.一个线程不够，我们需要多个线程可以并发为我们执行多任务.
其次，一个完备的线程池项目的定位是什么？是其通用性。我用它执行算术运算，它支持。那如果我用起去读写磁盘，是否需要另外的修改呢？如果需要的话那就失去了其通用性。但是，如何能够让它支持不同的需求都能使用呢？那么就让我们来看看这个线程池是如何进行抽象的.

// add new work item to the pool
template<class F, class... Args>
auto ThreadPool::enqueue(F&& f, Args&&... args) 
    -> std::future<typename std::result_of<F(Args...)>::type>
{
    using return_type = typename std::result_of<F(Args...)>::type;

    auto task = std::make_shared< std::packaged_task<return_type()> >(
            std::bind(std::forward<F>(f), std::forward<Args>(args)...)
        );
        
    std::future<return_type> res = task->get_future();
    {
        std::unique_lock<std::mutex> lock(queue_mutex);

        // don't allow enqueueing after stopping the pool
        if(stop)
            throw std::runtime_error("enqueue on stopped ThreadPool");

        tasks.emplace([task](){ (*task)(); });
    }
    condition.notify_one();
    return res;
}

如上就是投入任务到任务队列的入队的逻辑。

任务入队前的转换

我们的任务通常是一个过程或者一段程序，因此我们入队的就是一个待执行函数和其参数。由于要做到通用性。因此，要用同一入队接口入队不同签名的函数，我们需要使用模板来保证其通用。
前面提到，我们使用std::queue模板类作为任务队列，那么，在其可以作为队列使用前我们必须实例化这个模板类。所以，在指定std::queue<T>的类型T以后，无论用户传入何种签名的函数，在任务入队前，必须要转换成类型为T的对象.

又因为，用户不仅希望任务被执行，还希望能得到其运行结果，但是任务并不是在业务线程中运行的，而是在线程池中的任务线程执行完毕的，和业务线程是异步的。如何能在业务线程中得到任务线程的执行结果？那么就只能在任务执行前，任务线程和业务线程约定在某个地方共享执行结果。我们知道，线程之间资源是共享的，所以线程之间完全可以在堆内存空间上放置一个对象，任务线程执行完毕以后，对该堆上对象写入执行结果，从而业务线程则可去该堆内存空间取执行结果。c++的标准库为我们提供了这样的机制,即，std::future和std::packaged_task

这两个模板的使用介绍:
https://en.cppreference.com/w/cpp/thread/future
“The class template std::future provides a mechanism to access the result of asynchronous operations:”
std::future提供了一种访问异步操作的结果的机制.

[2] https://en.cppreference.com/w/cpp/thread/packaged_task
“The class template std::packaged_task wraps any Callable target (function, lambda expression, bind expression, or another function object) so that it can be invoked asynchronously. Its return value or exception thrown is stored in a shared state which can be accessed through std::future objects.”
std::packaged_task 用于包装任何一种可调用目标(函数，lambda表示式，bind表达式，函数对象等)
从而可作为一个可异步执行的操作。且在他被执行后，其返回值或者抛出的异常，会存储为一个share state,该share state可被std::future访问。

因此，我们来总结一下整个转换过程:
在这里插入图片描述
如上图，

step1 -> step2的转换:

using return_type = typename std::result_of<F(Args...)>::type;
auto task = std::make_shared< std::packaged_task<return_type()> >(
    std::bind(std::forward<F>(f), std::forward<Args>(args)...)
);

std::result_of<F(Args…)>
https://en.cppreference.com/w/cpp/types/result_of
用于程序编译期间推导函数对象的返回类型。
所以,我们通过它得到用户传入的函数的返回类型，如前面所说，返回值对我们很重要，我们要通过std::future<T>去访问异步执行结果，因此返回类型的确定至关重要.

std::packaged_task<T>封装用户传入的函数，这是发生了改造，我们将用户传入的函数，改造为一个返回类型不变，依旧为return_type,但是参数为空的Callable Object,(可调用对象),于是起构造函数我们使用std::bind和std::forward,将各个参数绑定为默认参数.

std::bind(std::forward<F>(f), std::forward<Args>(args)...

这里涉及了c++的折叠表达式，对于不定参数模板编程很有用.

step2 -> step3的转换，则如图所示，统一封装为一个lambda对象，并按值捕捉 std::packaged_task对象即可.

至此，其线程池的任务入队机制，已分析完毕,下面介绍线程池的工作机制.

线程池是何如工作的

1. 任务入队唤醒线程

在搞定对任务的封装后,用户投入任务到任务队列，那么业务线程就要通知线程池中的线程，有任务来了，谁有空赶紧去取任务执行。就这么个逻辑。所以，这里就设计到线程之间的通信。如我前一篇多线程的文章c++版本两个线程交替打印1~100，使用条件变量来完成线程通信.

条件变量，即，条件不满足时线程们阻塞，条件满足后线程工作，那我们的条件是什么?是队列中有任务。因此，一旦任务入队，那么就要唤醒阻塞的线程。

如下代码片段，使用condition.notify_one()，因为我们只需要唤醒一个线程来工作即可.

{
	std::unique_lock<std::mutex> lock(queue_mutex);
	// don't allow enqueueing after stopping the pool
	if(stop)
		throw std::runtime_error("enqueue on stopped ThreadPool");
	tasks.emplace([task](){ (*task)(); });
}
condition.notify_one();

2. 线程取出任务并执行

std::function<void()> task;
{
    std::unique_lock<std::mutex> lock(this->queue_mutex);
    this->condition.wait(lock,
        [this]{ return this->stop || !this->tasks.empty(); });
    if(this->stop && this->tasks.empty())
        return;
    task = std::move(this->tasks.front());
    this->tasks.pop();
}
task();

这里，线程启动以后，可能由于队列为空，进入阻塞状态.一旦收到唤醒通知，他首先还是去判断条件是否满足，因为有可能唤醒它的不是业务线程，而是操作系统虚假唤醒。
如果是业务线程唤醒它，条件满足了，则同tasks任务队列头中取出任务，之后执行，如前所述，该任务对象为一个lambda表达式，直接执行就可以.

3. 线程池及线程执行例程

// the constructor just launches some amount of workers
inline ThreadPool::ThreadPool(size_t threads)
    :   stop(false)
{
    for(size_t i = 0;i<threads;++i)
        workers.emplace_back(
            [this]
            {
                for(;;)
                {
                    std::function<void()> task;

                    {
                        std::unique_lock<std::mutex> lock(this->queue_mutex);
                        this->condition.wait(lock,
                            [this]{ return this->stop || !this->tasks.empty(); });
                        if(this->stop && this->tasks.empty())
                            return;
                        task = std::move(this->tasks.front());
                        this->tasks.pop();
                    }
                    task();
                }
            }
        );
}

如上，ThreadPool就代表这个线程池对象，其构造函数的参数为线程个数threads，启动后，我们根据threads，逐一创建线程.
emplace_back让我们在vector指定的地方创建std::thread 对象，避免发生拷贝，而且std::thread也不支持拷贝构造，使用std::vector<std::thread>::push_back会编译报错.
https://en.cppreference.com/w/cpp/container/vector/emplace_back.
每个线程例程，其在loop内运行，通过条件变量，判断是否有任务需要执行，没有就进入阻塞状态。

4. 互斥措施操作

为了让线程们互斥访问 stop 和tasks资源，因此使用 std::mutex queue_mutex，要访问共享资源前需要先加锁，再访问

{
        std::unique_lock<std::mutex> lock(this->queue_mutex);
        this->condition.wait(lock,
            [this]{ return this->stop || !this->tasks.empty(); });
        if(this->stop && this->tasks.empty())
            return;
        task = std::move(this->tasks.front());
        this->tasks.pop();
}

{
	std::unique_lock<std::mutex> lock(queue_mutex);
	// don't allow enqueueing after stopping the pool
	if(stop)
		throw std::runtime_error("enqueue on stopped ThreadPool");
	tasks.emplace([task](){ (*task)(); });
}

5. 线程池销毁

如果程序推出，销毁线程池，那么就要告知每个线程，让其退出运行

inline ThreadPool::~ThreadPool()
{
    {
        std::unique_lock<std::mutex> lock(queue_mutex);
        stop = true;
    }
    condition.notify_all();
    for(std::thread &worker: workers)
        worker.join();
}

如上，将stop标志置为true，并让条件变量notify_all，通知所有阻塞的线程。之前提到条件变量的条件时只提到了任务队列是否为空，而stop变量也是条件的一部分。当线程例程在任务队列为空且stop为true时，线程例程执行return语句，从而线程退出运行。

这里的实现也体现了一个关键点，就是，即使线程已经被要求stop的情况下，如果任务队列还有任务没执行完，也必须将任务执行完才能退出。即，这是任务不丢失型线程池，而在有的实现里，在线程收到退出条件后，是可以丢弃待执行任务直接退出运行的.

复现与应用

在了解到实现如上一个线程池所需要掌握的技能后，我尝试手动写出来加深记忆。以全局变量+面向过程的形式

#include <cassert>
#include <atomic>
#include <mutex>
#include <future>
#include <vector>
#include <thread>
#include <queue>
#include <deque>
#include <random>
#include <ctime>
#include <algorithm>
#include <iostream>
#include <type_traits>
#include <functional>
#include <condition_variable>

std::mutex mtx_;
std::vector<std::thread> workers_;
std::condition_variable condition_;
std::atomic<bool> stop_(false);
std::queue<std::function<void()>> taskQueue_;

template<class F, class... Args>
auto enqueue(F&& f, Args&&... args)
    -> std::future<typename std::result_of<F(Args...)>::type>
{
    using return_type = typename std::result_of<F(Args...)>::type;
    auto task = std::make_shared<std::packaged_task<return_type()>> (
        std::bind(std::forward<F>(f), std::forward<Args>(args)...)
    );
    //task <==> std::function<return_type(void)>;
    std::future<return_type> res = task->get_future();

    {
        std::unique_lock<std::mutex> guard(mtx_);
        if (stop_)
            throw std::runtime_error("enqueue after threadpool exited.");
        taskQueue_.emplace([task]{(*task)();});
    }

    condition_.notify_one();
    return res;
}

void start(size_t threadCnt)
{
    for (int i = 0; i < threadCnt; ++i) {
        workers_.emplace_back([](){
            for (;;) {
                std::function<void()> task;

                {
                    std::unique_lock<std::mutex> guard(mtx_);
                    condition_.wait(guard, [](){
                        return stop_ || not taskQueue_.empty();
                    });

                    if (stop_ && taskQueue_.empty()) {
                        std::cout <<"thread exited!\n";
                        return;
                    }
                    task = taskQueue_.front();
                    taskQueue_.pop();
                }
                task();
            }
        });
    }
}

void stop()
{
    {
        std::unique_lock<std::mutex> guard(mtx_);
        stop_ = true;
    }
    condition_.notify_all();
    for (auto& work : workers_)
        work.join();
}

/******************************************test with large numbers*************************************/

void mergeSort(std::vector<int>& a, std::vector<int>& b, size_t begin, size_t split, size_t end) {
    size_t i = begin;
    size_t j = split;
    size_t k = begin;

    while (i < split && j < end) {
        if (a[i] < a[j]) {
            b[k++] = a[i++];
        } else {
            b[k++] = a[j++];
        }
    }

    while (i < split) b[k++] = a[i++];
    while (j < end) b[k++] = a[j++];

    for (k = begin; k < end; ++k) {
        a[k] = b[k];
    }
}

void mergeSortRange(std::vector<int>& numbers, std::vector<int>& tempBuffer, size_t begin, size_t end) {
  if (end - begin < 2) {
      return;
  }

  size_t split = begin + (end - begin) / 2;
  mergeSortRange(numbers, tempBuffer, begin, split);
  mergeSortRange(numbers, tempBuffer, split, end);
  mergeSort(numbers, tempBuffer, begin, split, end);
}


int getRandomInt(int min, int max) {
    std::random_device rd;  
    std::mt19937 gen(rd()); 
    std::uniform_int_distribution<> distrib(min, max); 

    return distrib(gen); 
}

std::shared_ptr<std::vector<int>> generateRamdomVec() {
    int minRange = -1000000;
    int maxRange = 1000000;
    const int totalSize = 5000000;
    std::shared_ptr<std::vector<int>> randomVec = std::make_shared<std::vector<int>>();
    for (int i = 0; i < totalSize; i++)
        randomVec->emplace_back(getRandomInt(minRange, maxRange));
    // std::cout << "size: " << randomVec->size();
    return randomVec;
}

void getCurTimeStr(std::string head) {
    auto currentTimePoint = std::chrono::high_resolution_clock::now();
    auto currentTimeNs = std::chrono::time_point_cast<std::chrono::milliseconds>(currentTimePoint);
    auto duration = currentTimeNs.time_since_epoch();
    auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(duration).count();
    std::cout << head << ": " << milliseconds << "\n";
}

void sortTestWithoutThreadPool()
{
    auto randomVecPtr = generateRamdomVec();
    int totalSize = randomVecPtr->size();
    std::cout << "totalSize: " << totalSize << "\n";

    getCurTimeStr("start sort[1]");
    std::sort(randomVecPtr->begin(), randomVecPtr->end());
    getCurTimeStr("stop sort[1]");
}

void sortTestWithThreadPool()
{
    auto randomVecPtr = generateRamdomVec();
    int totalSize = randomVecPtr->size();
    std::cout << "totalSize: " << totalSize << "\n";
    unsigned int threadCounter = std::thread::hardware_concurrency();
    start(threadCounter);

    const int chunkSize = 1000000;

    getCurTimeStr("start sort[1]");
    int i = 0;
    while (i*chunkSize < totalSize) {
        enqueue([](auto p1, auto p2){ std::sort(p1, p2);}, randomVecPtr->begin() + i*chunkSize, randomVecPtr->begin() + (i+1)*chunkSize + std::min(0, totalSize - (i+1)*chunkSize));
        i++;
    }
    stop(); //block until all task be finished.
    getCurTimeStr("stop sort[1]");

    std::shared_ptr<std::vector<int>> tempBufferPtr = std::make_shared<std::vector<int>>(totalSize);

    getCurTimeStr("start sort[2]");
    mergeSortRange(*randomVecPtr.get(), *tempBufferPtr.get(), 0, totalSize);
    getCurTimeStr("stop sort[2]");
}

int main(int argc, char**argv)
{
    assert(argc == 2);
    if (std::string(argv[1]) == "with")
    {
        std::cout << "run with thread pool!\n";
        sortTestWithThreadPool();
    }
    else
    {
        std::cout << "run in master thread!\n";
        sortTestWithoutThreadPool();
    }
    return 0;
}

如上代码，使用5百万个int
1.不使用线程池，输出结果:

run in master thread!
totalSize: 5000000
start sort[1]: 1709130309826
stop sort[1]: 1709130311534

对5百万个int进行排序, 花了1708ms

2.使用线程池，输出结果:

run with thread pool!
totalSize: 5000000
start sort[1]: 1709130832624
thread exited!
thread exited!
thread exited!
thread exited!
thread exited!
thread exited!
stop sort[1]: 1709130832951
start sort[2]: 1709130832965
stop sort[2]: 1709130834180

对5百万个int进行子数组(size = 100万) 排序, 花了327 ms,
排序以后，再进行归并排序进行merge，消耗 1215 ms.
共计花费: 1542 ms

这个模型选的不适用，因为最终的消耗是归并排序，虽然多线程将子数组排序后，归并排序能在log(n)的复杂度下进行merge.
但是最终还是需要扫描整个数组.线程池应该应用于任务间没有关联的模型，也就是真正的并行.