OP的问题引起了我的注意,因为通过多线程获得的加速数字运算是我个人名单上的顶级待办事项之一.
我必须承认,我对Eigen库的经验非常有限. (我曾经将3×3旋转矩阵的分解用于欧拉角,这在特征库中以一般方式非常巧妙地解决.)
因此,我定义了另一个样本任务,包括对样本数据集中的值进行愚蠢的计数.
这是多次完成的(使用相同的评估函数):
>单线程(获取比较值)
>在一个额外的线程中开始每个子任务(以一种公认的相当愚蠢的方式)
>启动线程,交叉访问样本数据
>启动具有分区访问样本数据的线程.
test-multi-threading.cc:
#include
#include
#include
#include
#include
#include
#include
#include
// a sample function to process a certain amount of data
template
size_t countFrequency(
size_t n, const T data[], const T &begin, const T &end)
{
size_t result = 0;
for (size_t i = 0; i < n; ++i) result += data[i] >= begin && data[i] < end;
return result;
}
typedef std::uint16_t Value;
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds MuSecs;
typedef decltype(std::chrono::duration_cast(Clock::now() - Clock::now())) Time;
Time duration(const Clock::time_point &t0)
{
return std::chrono::duration_cast(Clock::now() - t0);
}
std::vector makeTest()
{
const Value SizeGroup = 4, NGroups = 10000, N = SizeGroup * NGroups;
const size_t NThreads = std::thread::hardware_concurrency();
// make a test sample
std::vector sample(N);
for (Value &value : sample) value = (Value)rand();
// prepare result vectors
std::vector results4[4] = {
std::vector(NGroups, 0),
std::vector(NGroups, 0),
std::vector(NGroups, 0),
std::vector(NGroups, 0)
};
// make test
std::vector times{
[&]() { // single threading
// make a copy of test sample
std::vector data(sample);
std::vector &results = results4[0];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment single-threaded
for (size_t i = 0; i < NGroups; ++i) {
results[i] = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
}
// done
return duration(t0);
}(),
[&]() { // multi-threading - stupid aproach
// make a copy of test sample
std::vector data(sample);
std::vector &results = results4[1];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<:thread> threads(NThreads);
for (Value i = 0; i < NGroups;) {
size_t nT = 0;
for (; nT < NThreads && i < NGroups; ++nT, ++i) {
threads[nT] = std::move(std::thread(
[i, &results, &data, SizeGroup]() {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}));
}
for (size_t iT = 0; iT < nT; ++iT) threads[iT].join();
}
// done
return duration(t0);
}(),
[&]() { // multi-threading - interleaved
// make a copy of test sample
std::vector data(sample);
std::vector &results = results4[2];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<:thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT; i < NGroups; i += NThreads) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}
}));
}
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
}(),
[&]() { // multi-threading - grouped
std::vector data(sample);
std::vector &results = results4[3];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<:thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT * NGroups / NThreads,
iN = (iT + 1) * NGroups / NThreads; i < iN; ++i) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}
}));
}
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
}()
};
// check results (must be equal for any kind of computation)
const unsigned nResults = sizeof results4 / sizeof *results4;
for (unsigned i = 1; i < nResults; ++i) {
size_t nErrors = 0;
for (Value j = 0; j < NGroups; ++j) {
if (results4[0][j] != results4[i][j]) {
++nErrors;
#ifdef _DEBUG
std::cerr
<< "results4[0][" << j << "]: " << results4[0][j]
<< " != results4[" << i << "][" << j << "]: " << results4[i][j]
<< "!
";
#endif // _DEBUG
}
}
if (nErrors) std::cerr << nErrors << " errors in results4[" << i << "]!
";
}
// done
return times;
}
int main()
{
std::cout << "std::thread::hardware_concurrency(): "
<< std::thread::hardware_concurrency() << '
';
// heat up
std::cout << "Heat up...
";
for (unsigned i = 0; i < 3; ++i) makeTest();
// repeat NTrials times
const unsigned NTrials = 10;
std::cout << "Measuring " << NTrials << " runs...
"
<< " "
<< " | " << std::setw(10) << "Single"
<< " | " << std::setw(10) << "Multi 1"
<< " | " << std::setw(10) << "Multi 2"
<< " | " << std::setw(10) << "Multi 3"
<< '
';
std::vector sumTimes;
for (unsigned i = 0; i < NTrials; ++i) {
std::vector times = makeTest();
std::cout << std::setw(2) << (i + 1) << ".";
for (const Time &time : times) {
std::cout << " | " << std::setw(10) << time.count();
}
std::cout << '
';
sumTimes.resize(times.size(), 0.0);
for (size_t j = 0; j < times.size(); ++j) sumTimes[j] += times[j].count();
}
std::cout << "Average Values:
";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(1)
<< sumTime / NTrials;
}
std::cout << '
';
std::cout << "Ratio:
";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(3)
<< sumTime / sumTimes.front();
}
std::cout << '
';
// done
return 0;
}
在Windows 10上的cygwin64上编译和测试:
$g++ --version
g++ (GCC) 7.3.0
$g++ -std=c++11 -O2 -o test-multi-threading test-multi-threading.cc
$./test-multi-threading
std::thread::hardware_concurrency(): 8
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 384008 | 1052937 | 130662 | 138411
2. | 386500 | 1103281 | 133030 | 132576
3. | 382968 | 1078988 | 137123 | 137780
4. | 395158 | 1120752 | 138731 | 138650
5. | 385870 | 1105885 | 144825 | 129405
6. | 366724 | 1071788 | 137684 | 130289
7. | 352204 | 1104191 | 133675 | 130505
8. | 331679 | 1072299 | 135476 | 138257
9. | 373416 | 1053881 | 138467 | 137613
10. | 370872 | 1096424 | 136810 | 147960
Average Values:
| 372939.9 | 1086042.6 | 136648.3 | 136144.6
Ratio:
| 1.000 | 2.912 | 0.366 | 0.365
我在coliru.com上做了同样的事情. (当我超过原始值的时间限制时,我不得不减少加热周期和样本量.):
g++ (GCC) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
std::thread::hardware_concurrency(): 4
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 224684 | 297729 | 48334 | 39016
2. | 146232 | 337222 | 66308 | 59994
3. | 195750 | 344056 | 61383 | 63172
4. | 198629 | 317719 | 62695 | 50413
5. | 149125 | 356471 | 61447 | 57487
6. | 155355 | 322185 | 50254 | 35214
7. | 140269 | 316224 | 61482 | 53889
8. | 154454 | 334814 | 58382 | 53796
9. | 177426 | 340723 | 62195 | 54352
10. | 151951 | 331772 | 61802 | 46727
Average Values:
| 169387.5 | 329891.5 | 59428.2 | 51406.0
Ratio:
| 1.000 | 1.948 | 0.351 | 0.303
我想知道coliru(只有4个线程)的比率甚至比我的PC(有8个线程)更好.实际上,我不知道如何解释这一点.
但是,在这两种设置中存在许多其他差异,这些差异可能是也可能不是.至少,对于第3和第4次进近,两次测量都显示3的粗略加速,其中第2次消耗唯一的每个潜在加速(可能通过启动和连接所有这些线程).
查看示例代码,您将发现没有互斥锁或任何其他显式锁定.这是故意的.正如我已经了解的(许多年前),每次并行化尝试都可能导致额外的通信开销(对于必须交换数据的并发任务).如果通信开销变得很大,它就会消耗并发的速度优势.因此,可以通过以下方式实现最佳加速:
>最小通信开销,即并发任务对独立数据进行操作
>对并发计算结果进行后合并的最小努力.
在我的示例代码中,我
>在启动线程之前准备好每个数据和存储,
>线程运行时,永远不会更改读取的共享数据,
>以线程本地写入的数据(没有两个线程写入相同的数据地址)
>在连接所有线程后评估计算结果.
关于3.我有点挣扎这是否合法,即它是否被授予在线程中写入的数据,以便在加入后在主线程中正确显示. (事情似乎工作得很好但总的来说是虚幻的,但在多线程方面尤其虚幻.)
cppreference.com提供以下解释
The completion of the invocation of the constructor synchronizes-with (as defined in 07003) the beginning of the invocation of the copy of f on the new thread of execution.
The completion of the thread identified by *this synchronizes with the corresponding successful return from join().
在Stack Overflow中,我发现了以下相关的Q / A:
这让我信服,没关系.
然而,缺点是
>线程的创建和加入会带来额外的努力(并不是那么便宜).
另一种方法可能是使用线程池来克服这个问题.我google了一下,发现例如Jakob Progsch’s ThreadPool on github.但是,我想,在一个线程池中,锁定问题又回到了“游戏中”.
这是否也适用于特征函数,取决于如何.实现了特征函数.如果访问全局变量(在同时调用同一函数时共享),则会导致数据竞争.
谷歌搜索了一下,我发现了以下文档.
In the case your own application is multithreaded, and multiple threads make calls to 070010, then you have to initialize 070010 by calling the following routine before creating the threads:
06003
Note
With 070010 3.3, and a fully C++11 compliant compiler (i.e., 070013), then calling initParallel() is optional.
Warning
note that all functions generating random matrices are not re-entrant nor thread-safe. Those include 070014, and 070015 despite a call to Eigen::initParallel(). This is because these functions are based on std::rand which is not re-entrant. For thread-safe random generator, we recommend the use of boost::random or c++11 random feature.