C++多线程多路归并排序函数模板
设计
实现方式
- 各个线程对分配给它的数据分块进行快速排序(
std::sort
)。 - 主线程排序完分配给自己的数据分块后,需要等待(
std::future
,std::promise
)各个线程完成。 - 主线程多路归并排序(需要自己实现,标准库只有二路归并)。
实现注意
- 本实现大量使用C++11特性。
- c语言的qsort内部用到了静态变量,所以不是线程安全的。可以通过在主线程使用一次qsort解决。具体自行百度或看这篇。而这里使用std::sort则没有这个问题。
函数参数
- 指向数据首地址的指针
- 数据长度
- 指定需要创建的新线程数量
适用性
- 只支持数值类型,读者可自行扩展
- 需要支持随机存储
数据分块大小
设数据长度(data_len)为 D,新线程数量(thread_num)为 T,一个分块大小(chunk_size)为 C,最后一个分块的大小(last_chunk_size)为 LC。
直观地,如果工作线程数量可以整除数据长度,那么商就是块大小。
LC = C = D/(T+1)
如果不能整除的时候:
C = D/(T+1)+1
=> (C-1) = D/(T+1)
=> (C-1)*(T+1) + K = D, K ∈ (0,T+1)
LC = D - C*T
= (C-1)*(T+1) + K - C*T
= C-(T+1) + K
∈ (C-(T+1), C)
因为 T 非常小,所以最后一块的大小差别不大,本实现就由主线程来处理最后一块。
实现代码
// multi_threaded_sort.h
#ifndef MULTI_THREADED_SORT_H
#define MULTI_THREADED_SORT_H
#include <type_traits>
#include <limits>
#include <algorithm>
#include <thread>
#include <future>
template <typename IntegerType>
void multi_threaded_sort(IntegerType* in_data, size_t data_len, size_t thread_num)
{
// compile time type check
static_assert(std::is_integral<IntegerType>::value,
"Data type must be integer");
// under such conditions, multi-thread makes no sense
// call std::sort directly
if(data_len <= 1 || thread_num == 0
|| data_len < (thread_num+1)*(thread_num+1))
{
std::sort(in_data, in_data+data_len);
return;
}
/* one thread sort one chunk
* main thread sort the last chunk */
size_t chunk_size = data_len/(thread_num+1);
if(data_len%(thread_num+1) != 0)
++chunk_size;
// for threads synchronize
auto sort_promise = new std::promise<void>[thread_num];
auto sort_future = new std::future<void>[thread_num];
for(int i=0; i<thread_num; ++i)
sort_future[i] = sort_promise[i].get_future();
// create threads
for(size_t i=0; i<thread_num; ++i){
std::thread th([=]{
std::sort(in_data + i*chunk_size, in_data + (i+1)*chunk_size);
sort_promise[i].set_value();
});
th.detach();
}
// sort the last chunk
std::sort(in_data + chunk_size*thread_num, in_data + data_len);
// before wait and block, do things not based on data
auto out_data = new IntegerType[data_len];
auto index = new size_t[thread_num + 1];
for (int i=0; i<thread_num + 1; ++i)
index[i] = i * chunk_size;
// wait for all threads
for(size_t i=0; i<thread_num; ++i)
sort_future[i].wait();
delete[] sort_future;
delete[] sort_promise;
// do merge sort
for(size_t i = 0; i < data_len; ++i)
{
IntegerType min_index;
IntegerType min_num = std::numeric_limits<IntegerType>::max();
// traverse every chunk and find the minimum
for(size_t j=0; j<thread_num; ++j)
{
if((index[j] < (j+1)*chunk_size)
&& (in_data[index[j]] < min_num))
{
min_index = j;
min_num = in_data[index[j]];
}
}
if(index[thread_num] < data_len
&& (in_data[index[thread_num]] < min_num))
{
min_index = thread_num;
}
out_data[i] = in_data[index[min_index]];
index[min_index]++;
}
std::copy(out_data, out_data + data_len, in_data);
delete[] out_data;
}
#endif //MULTI_THREADED_SORT_H
测试
测试代码
// main.cpp
#include "multi_threaded_sort.h"
#include <iostream>
#include <chrono>
#include <random>
using namespace std;
int main(int argc, char *argv[]) {
{
cout << "this example check the correctness:";
short data[] = {2, 5, 5, 3, 93, 43, 3, 0, -3, 43};
size_t N = sizeof(data) / sizeof(short);
cout << "\ninput " << N << " data: ";
for (int i = 0; i < N; ++i) cout << data[i] << ' ';
multi_threaded_sort(data, N, 1);
cout << "\n\tsort with 2 threads: ";
for (int i = 0; i < N; ++i) cout << data[i] << ' ';
random_shuffle(data, data + N);// before c++17
cout << "\nafter shuffle: ";
for (int i = 0; i < N; ++i) cout << data[i] << ' ';
multi_threaded_sort(data, N, 2);
cout << "\n\tsort with 3 threads: ";
for (int i = 0; i < N; ++i) cout << data[i] << ' ';
}
// -----------------------------------------------------
{
const size_t N = 654321; // numbers to generate
const size_t T = 6; // threads to test
cout << "\n\nthis example check the efficiency:\n"
<< "randomly generate " << N
<< " natural number and sort them...\n";
short random_data[T][N];
random_device rd;
default_random_engine rng{rd()};
std::uniform_int_distribution<short> dis;
for (size_t i = 0; i < N; ++i)
random_data[0][i] = dis(rng);
for (size_t i = 1; i < T; ++i)
copy(random_data[0], random_data[0] + N, random_data[i]);
chrono::time_point<chrono::high_resolution_clock> start_t[T], end_t[T];
chrono::duration<double, std::milli> elapsed[T];
for (size_t i = 0; i < T; ++i) {
start_t[i] = chrono::high_resolution_clock::now();
if (i)
multi_threaded_sort(random_data[1], N, i + 1);
else
sort(random_data[0], random_data[0] + N);
end_t[i] = chrono::high_resolution_clock::now();
elapsed[i] = end_t[i] - start_t[i];
}
cout << "Use std::sort() cost: " << elapsed[0].count() << " ms\n";
for (size_t i = 1; i < T; ++i)
cout << "Add " << i << " threads cost: " << elapsed[i].count() << " ms\n";
}
return 0;
}
输出
this example check the correctness:
input 10 data: 2 5 5 3 93 43 3 0 -3 43
sort with 2 threads: -3 0 2 3 3 5 5 43 43 93
after shuffle: 5 -3 3 5 43 43 3 0 2 93
sort with 3 threads: -3 0 2 3 3 5 5 43 43 93
this example check the efficiency:
randomly generate 654321 natural number and sort them...
Use std::sort() cost: 36.1254 ms
Add 1 threads cost: 25.5484 ms
Add 2 threads cost: 9.9246 ms
Add 3 threads cost: 10.2634 ms
Add 4 threads cost: 11.2967 ms
Add 5 threads cost: 14.7214 ms
后记
-
因为线程本身创建和销毁需要时间,所以在数据量较少的情况下使用多线程肯定是不值得的,反而会变慢。
-
因为CPU核数是有限的,所以可以并发执行线程数肯定不会很高,所以创建过多线程也是没有用的,线程数达到某个数量(一般很小)后排序速度随线程数量增加而缓慢下降。
对于上面两点,读者可以修改测试代码的 N 和 T 自己进行测试。
笔者水平有限,如有错误,还望指出。