如何:编写 parallel_for 循环
本示例演示如何使用 Concurrency::parallel_for 计算两个矩阵的乘积。
下面的示例演示 matrix_multiply 函数,该函数计算两个方形矩阵的乘积。
// Computes the product of two square matrices. void matrix_multiply(double** m1, double** m2, double** result, size_t size) { for (size_t i = 0; i < size; i++) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } } }
下面的示例演示 parallel_matrix_multiply 函数,该函数使用 parallel_for 算法并行执行外层循环。
// Computes the product of two square matrices in parallel. void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size) { parallel_for (size_t(0), size, [&](size_t i) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } }); }
此示例仅并行化外层循环,这是因为该循环执行的工作足够多,可以从并行处理的开销中受益。如果并行化内层循环,则将不会获得性能上的提升,这是因为内层循环执行的少量工作不能抵消并行处理的开销。因此,仅并行化外层循环是用于实现大多数系统上并行处理的受益最大化的最佳方式。
以下更为完整的示例将比较 matrix_multiply 函数与 parallel_matrix_multiply 函数的性能。
// parallel-matrix-multiply.cpp // compile with: /EHsc #include <windows.h> #include <ppl.h> #include <iostream> #include <random> using namespace Concurrency; using namespace std; // Calls the provided work function and returns the number of milliseconds // that it takes to call that function. template <class Function> __int64 time_call(Function&& f) { __int64 begin = GetTickCount(); f(); return GetTickCount() - begin; } // Creates a square matrix with the given number of rows and columns. double** create_matrix(size_t size); // Frees the memory that was allocated for the given square matrix. void destroy_matrix(double** m, size_t size); // Initializes the given square matrix with values that are generated // by the given generator function. template <class Generator> double** initialize_matrix(double** m, size_t size, Generator& gen); // Computes the product of two square matrices. void matrix_multiply(double** m1, double** m2, double** result, size_t size) { for (size_t i = 0; i < size; i++) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } } } // Computes the product of two square matrices in parallel. void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size) { parallel_for (size_t(0), size, [&](size_t i) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } }); } int wmain() { // The number of rows and columns in each matrix. // TODO: Change this value to experiment with serial // versus parallel performance. const size_t size = 750; // Create a random number generator. mt19937 gen(42); // Create and initialize the input matrices and the matrix that // holds the result. double** m1 = initialize_matrix(create_matrix(size), size, gen); double** m2 = initialize_matrix(create_matrix(size), size, gen); double** result = create_matrix(size); // Print to the console the time it takes to multiply the // matrices serially. wcout << L"serial: " << time_call([&] { matrix_multiply(m1, m2, result, size); }) << endl; // Print to the console the time it takes to multiply the // matrices in parallel. wcout << L"parallel: " << time_call([&] { parallel_matrix_multiply(m1, m2, result, size); }) << endl; // Free the memory that was allocated for the matrices. destroy_matrix(m1, size); destroy_matrix(m2, size); destroy_matrix(result, size); } // Creates a square matrix with the given number of rows and columns. double** create_matrix(size_t size) { double** m = new double*[size]; for (size_t i = 0; i < size; ++i) { m[i] = new double[size]; } return m; } // Frees the memory that was allocated for the given square matrix. void destroy_matrix(double** m, size_t size) { for (size_t i = 0; i < size; ++i) { delete[] m[i]; } delete m; } // Initializes the given square matrix with values that are generated // by the given generator function. template <class Generator> double** initialize_matrix(double** m, size_t size, Generator& gen) { for (size_t i = 0; i < size; ++i) { for (size_t j = 0; j < size; ++j) { m[i][j] = static_cast<double>(gen()); } } return m; }
下例是四处理器计算机的输出结果。
serial: 3853 parallel: 1311
随着Intel和AMD不断推出多核心的CPU,一芯多核,成为越来越普遍的事情。从单核到双核,从双核到四核,再到八核等等,毫无疑问,我们开始进入一个一芯多核的时代,程序员们也不得不开始考虑如何将自己的软件并行化以充分利用多核心CPU的计算能力。当我们的项目从Visual C++ 6.0升级到Visual C++ 2010之后,自然要迎接这场挑战。不过,Visual C++ 2010已经为这场挑战做好了准备,那就是全新的并行模式库(Parallel Patterns Library)。PPL在一个比操作系统线程更高的高度对并行计算进行了抽象,让程序员们不再直接跟比较危险的线程打交道,而是在另外一个更高的抽象层次,用新的Task来表达我们对可以同时执行的多个任务的封装,使得并行计算的程序更加容易理解和开发。利用PPL,也可以将我们从Visual C++ 6.0升级而来的串行的应用程序 轻松地并行化,从而充分利用多核CPU的计算能力。
PPL主要包括并行算法和并行任务两个部分。并行算法包括parallel_for(),parallel_for_each()和Parallel_invoke(),它们可以简单地将原来非常耗时的串行执行的for循环或者是for_each()算法并行化。例如,在我们原来的代码中有这样一个耗时的将图像灰度化的for循环: