windows C++ 并行编程-C++ AMP(二)

最新推荐文章于 2024-09-02 00:15:00 发布

sului

最新推荐文章于 2024-09-02 00:15:00 发布

阅读量883

点赞数 13

分类专栏： windows C++并行编程技术文章标签： c++ 开发语言

本文链接：https://blog.csdn.net/m0_72813396/article/details/141539118

版权

windows C++并行编程技术专栏收录该内容

12 篇文章 0 订阅

订阅专栏

平铺同步 — tile_static 和 tile_barrier::wait

以上示例(AMP（一）)演示了平铺布局和索引，但这种结构本身并不是很有用。当平铺是算法的一部分并利用 tile_static 变量时，平铺就很有用。由于平铺中的所有线程都可以访问 tile_static 变量，因此 tile_barrier::wait 调用可用于同步对 tile_static 变量的访问。尽管平铺中的所有线程都可以访问 tile_static 变量，但无法保证平铺中线程的执行顺序。以下示例演示如何使用 tile_static 变量和 tile_barrier::wait 方法来计算每个平铺的平均值。下面是理解该示例的关键所在：

rawData 存储在 8x8 矩阵中；
平铺大小为 2x2。这会创建一个 4x4 平铺网格，可以使用 array 对象将平均值存储在 4x4 矩阵中。在受 AMP 限制的函数中，只能通过引用来捕获有限数量的类型。 array 类就是其中之一；
矩阵大小和样本大小是使用 #define 语句定义的，因为 array、array_view、extent 和 tiled_index 的类型参数必须是常量值。还可以使用 const int static 声明。一个额外的好处是，可以轻而易举地通过更改样本大小来计算 4x4 平铺的平均值；
为每个平铺声明 tile_static 2x2 浮点值数组。尽管声明位于每个线程的代码路径中，但只需为矩阵中的每个平铺创建一个数组；
有一行代码用于将每个平铺中的值复制到 tile_static 数组。对于每个线程，在将值复制到该数组后，线程上的执行将因调用 tile_barrier::wait 而停止；
当平铺中的所有线程都到达屏障时，即可计算平均值。因为代码针对每个线程执行，因此有一个 if 语句只计算一个线程上的平均值。该平均值存储在 averages 变量中。屏障在本质上是按平铺控制计算的构造，这非常类似于使用 for 循环；
由于 averages 变量中的数据是一个 array 对象，因此它必须复制回主机。此示例使用向量转换运算符；
在完整示例中，可将 SAMPLESIZE 更改为 4，然后无需进行任何其他更改即可正常正确执行代码；

#include <iostream>
#include <amp.h>
using namespace concurrency;

#define SAMPLESIZE 2
#define MATRIXSIZE 8
void SamplingExample() {

    // Create data and array_view for the matrix.
    std::vector<float> rawData;
    for (int i = 0; i < MATRIXSIZE * MATRIXSIZE; i++) {
        rawData.push_back((float)i);
    }
    extent<2> dataExtent(MATRIXSIZE, MATRIXSIZE);
    array_view<float, 2> matrix(dataExtent, rawData);

    // Create the array for the averages.
    // There is one element in the output for each tile in the data.
    std::vector<float> outputData;
    int outputSize = MATRIXSIZE / SAMPLESIZE;
    for (int j = 0; j < outputSize * outputSize; j++) {
        outputData.push_back((float)0);
    }
    extent<2> outputExtent(MATRIXSIZE / SAMPLESIZE, MATRIXSIZE / SAMPLESIZE);
    array<float, 2> averages(outputExtent, outputData.begin(), outputData.end());

    // Use tiles that are SAMPLESIZE x SAMPLESIZE.
    // Find the average of the values in each tile.
    // The only reference-type variable you can pass into the parallel_for_each call
    // is a concurrency::array.
    parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
        [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp)
    {
        // Copy the values of the tile into a tile-sized array.
        tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
        tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

        // Wait for the tile-sized array to load before you calculate the average.
        t_idx.barrier.wait();

        // If you remove the if statement, then the calculation executes for every
        // thread in the tile, and makes the same assignment to averages each time.
        if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
            for (int trow = 0; trow < SAMPLESIZE; trow++) {
                for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                    averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
                }
            }
            averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
        }
    });

    // Print out the results.
    // You cannot access the values in averages directly. You must copy them
    // back to a CPU variable.
    outputData = averages;
    for (int row = 0; row < outputSize; row++) {
        for (int col = 0; col < outputSize; col++) {
            std::cout << outputData[row*outputSize + col] << " ";
        }
        std::cout << "\n";
    }
    // Output for SAMPLESIZE = 2 is:
    //  4.5  6.5  8.5 10.5
    // 20.5 22.5 24.5 26.5
    // 36.5 38.5 40.5 42.5
    // 52.5 54.5 56.5 58.5

    // Output for SAMPLESIZE = 4 is:
    // 13.5 17.5
    // 45.5 49.5
}

int main() {
    SamplingExample();
}

争用条件

你可能会倾向于创建一个名为 total 的 tile_static 变量，并为每个线程递增该变量，如下所示：

// Do not do this.
tile_static float total;
total += matrix[t_idx];
t_idx.barrier.wait();

averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE* SAMPLESIZE);

这种方法的第一个问题是 tile_static 变量不能包含初始化表达式。第二个问题是 total 赋值存在争用条件，因为平铺中的所有线程都可以不按特定顺序访问该变量。可以编写一种算法以便仅允许一个线程访问每个屏障上的总计，如下所示。但是，这种解决方法不可延伸。

// Do not do this.
tile_static float total;
if (t_idx.local[0] == 0&& t_idx.local[1] == 0) {
    total = matrix[t_idx];
}
t_idx.barrier.wait();

if (t_idx.local[0] == 0&& t_idx.local[1] == 1) {
    total += matrix[t_idx];
}
t_idx.barrier.wait();

// etc.

内存围栏

必须同步两种内存访问 — 全局内存访问和 tile_static 内存访问。 concurrency::array 对象仅分配全局内存。 concurrency::array_view 可以引用全局内存和/或 tile_static 内存，具体取决于它的构造方式。必须同步两种内存：

全局内存
tile_static

内存围栏确保线程平铺中的其他线程可以访问内存，并根据程序顺序执行内存访问。为确保这一点，编译器和处理器不会在整个围栏中将读取和写入重新排序。在 C++ AMP 中，内存围栏是通过调用以下方法之一创建的：

tile_barrier::wait 方法：创建一个围绕全局和 tile_static 内存的围栏。
tile_barrier::wait_with_all_memory_fence 方法：创建一个围绕全局和 tile_static 内存的围栏。
tile_barrier::wait_with_global_memory_fence 方法：创建仅围绕全局内存的围栏。
tile_barrier::wait_with_tile_static_memory_fence 方法：创建仅围绕 tile_static 内存的围栏。

调用所需的特定围栏可以提高应用的性能。屏障类型会影响编译器和硬件将语句重新排序的方式。例如，如果使用全局内存围栏，则它仅适用于全局内存访问，因此，编译器和硬件可能会重新排序对围栏两侧的 tile_static 变量的读取和写入。

在以下示例中，屏障会将写入同步到 tileValues（一个 tile_static 变量）。在此示例中，调用的是 tile_barrier::wait_with_tile_static_memory_fence 而不是 tile_barrier::wait。

// Using a tile_static memory fence.
parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
    [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp)
{
    // Copy the values of the tile into a tile-sized array.
    tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
    tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

    // Wait for the tile-sized array to load before calculating the average.
    t_idx.barrier.wait_with_tile_static_memory_fence();

    // If you remove the if statement, then the calculation executes
    // for every thread in the tile, and makes the same assignment to
    // averages each time.
    if (t_idx.local[0] == 0&& t_idx.local[1] == 0) {
        for (int trow = 0; trow <SAMPLESIZE; trow++) {
            for (int tcol = 0; tcol <SAMPLESIZE; tcol++) {
                averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
            }
        }
    averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE* SAMPLESIZE);
    }
});

sului

关注

13
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
windows C++ 并行编程-C++ AMP(二)

C++ AMP (C++ Accelerated Massive Parallelism) 利用数据并行硬件（通常作为独立显卡上的图形处理单元 (GPU) 存在）来加速 C++ 代码的执行。 C++ AMP 编程模型包括多维数组、索引、内存传输和平铺的支持。它还包括数学函数库。可以使用 C++ AMP 语言扩展来控制如何在 CPU 与 GPU 之间来回移动数据。
复制链接

扫一扫

专栏目录