oneAPI GPU 优化指南 - 原子操作

TechVideoGPU

已于 2023-12-08 14:37:14 修改

阅读量218

点赞数

分类专栏： oneAPI GPU 优化指南文章标签： oneapi c++ 1024程序员节

于 2023-12-08 13:33:32 首次发布

原文链接：https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-1/atomic-operations.html

版权

oneAPI GPU 优化指南专栏收录该内容

46 篇文章

订阅专栏

本文探讨了SYCL中的原子操作，包括不同数据类型（整型和浮点数）的处理，以及在全局和本地地址空间中的应用。强调了硬件支持和性能差异，以及如何使用VTune和Advisor工具进行性能分析和优化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本章节翻译by chenshusmail@163.com 原文：Atomic Operations (intel.com)

原子操作中的数据类型

全局和本地地址空间中的原子操作

基于 USM 数据的原子操作

原子操作允许多个 work-item 通过内存进行任何跨 work-item 通信。 SYCL 原子类似于 C++ 原子，保证被原子保护的资源的访问作为单个单元执行。以下因素影响原子操作的性能和合法性：

数据类型
本地与全局地址空间
主机端、共享和设备端分配的 USM

原子操作中的数据类型

下面的 kernel 展示了在 SYCL 中实现归约（reduction）操作的方法，其中每个 work-item 原子地更新全局累加器。这种加法的输入数据类型和应用此 reduction 操作的向量是整型数。与用于 reduction 的其他技术（如阻塞 blocking）相比，此 kernel 的性能更为合理。

    q.submit([&](auto &h) {
      sycl::accessor buf_acc(buf, h, sycl::read_only);
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(data_size, [=](auto index) {
        size_t glob_id = index[0];
        auto v = sycl::atomic_ref<int, sycl::memory_order::relaxed,
                                  sycl::memory_scope::device,
                                  sycl::access::address_space::global_space>(
            sum_acc[0]);
        v.fetch_add(buf_acc[glob_id]);
      });
    });

如果向量的数据类型是浮点数或双精度浮点数，如下面的 kernel 所示，由于某些加速器缺乏对浮点数或双精度原子的硬件支持，对应的性能会受到影响。基于硬件是否支持原子操作，下面两个 kernel 演示了原子加操作（add）的执行时间有巨大差异。

//
int VectorInt(sycl::queue &q, int iter) {
  VectorAllocator<int> alloc;
  AlignedVector<int> a(array_size, alloc);
  AlignedVector<int> b(array_size, alloc);

  InitializeArray<int>(a);
  InitializeArray<int>(b);
  sycl::range num_items{a.size()};
  sycl::buffer a_buf(a);
  sycl::buffer b_buf(b);
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {
    q.submit([&](sycl::handler &h) {
      // InpuGt accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_write);
      sycl::accessor b_acc(a_buf, h, sycl::read_only);

      h.parallel_for(num_items, [=](auto i) {
        auto v = sycl::atomic_ref<int, sycl::memory_order::relaxed,
                                  sycl::memory_scope::device,
                                  sycl::access::address_space::global_space>(
            a_acc[0]);
        v += b_acc[i];
      });
    });
  }
  q.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector int completed on device - took " << (end - start).count()
            << " u-secs\n";
  return ((end - start).count());
}

在使用原子操作时，必须注意确保硬件支持并且能够有效执行。在 Gen9 和 Intel® Iris® Xe 集成显卡上不支持浮点数或双精度数据类型上的原子操作， VectorDouble 的性能将非常差。

//
int VectorDouble(sycl::queue &q, int iter) {
  VectorAllocator<double> alloc;
  AlignedVector<double> a(array_size, alloc);
  AlignedVector<double> b(array_size, alloc);

  InitializeArray<double>(a);
  InitializeArray<double>(b);
  sycl::range num_items{a.size()};
  sycl::buffer a_buf(a);
  sycl::buffer b_buf(b);

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {
    q.submit([&](sycl::handler &h) {
      // InpuGt accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_write);
      sycl::accessor b_acc(a_buf, h, sycl::read_only);

      h.parallel_for(num_items, [=](auto i) {
        auto v = sycl::atomic_ref<double, sycl::memory_order::relaxed,
                                  sycl::memory_scope::device,
                                  sycl::access::address_space::global_space>(
            a_acc[0]);
        v += b_acc[i];
      });
    });
  }
  q.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector Double completed on device - took "
            << (end - start).count() << " u-secs\n";
  return ((end - start).count());
}

通过使用 VTune Profiler 分析这些 kernel，我们可以测出硬件是否支持原子操作的影响。您可以看到 VectorInt kernel 比 VectorDouble 和 VectorFloat 快得多。

VTune Profiler 动态指令的分析让我们看到当没有原子操作的硬件支持时，指令计数会发生巨大变化。

这是我们 VectorInt kernel 的汇编代码。

与 VectorDouble 的汇编代码相比，当我们执行 VectorDouble kernel 时，需要多执行 3300 万个 GPU 指令。

Intel® Advisor 工具有推荐窗格栏，可提供有关如何提高 GPU kernel 性能的建议。

Intel® Advisor 提供的一项建议是“存在低效原子操作”。当硬件中不支持原子操作时，它们会由软件来模拟。这可以被检测到， Intel® Advisor 会给出可能解决方案的建议。

全局和本地地址空间中的原子操作

标准 C++ 内存模型假定应用程序在具有单个地址空间的单个设备上执行。但是，这两个假设都不适用于 SYCL 应用程序：应用程序的不同部分在不同设备上执行（即主机端和一个或多个加速器）；每个设备都有多个地址空间（即私有、本地和全局）；并且每个设备的全局地址空间可能是不相交的（取决于 USM 的支持）。

在全局地址空间中使用原子操作时，同样要小心，因为全局更新比本地更新慢得多。

#include <CL/sycl.hpp>
#include <iostream>
int main() {
  constexpr int N = 256 * 256;
  constexpr int M = 512;
  int total = 0;
  int *a = static_cast<int *>(malloc(sizeof(int) * N));
  for (int i = 0; i < N; i++)
    a[i] = 1;
  sycl::queue q({sycl::property::queue::enable_profiling()});
  sycl::buffer<int> buf(&total, 1);
  sycl::buffer<int> bufa(a, N);
  auto e = q.submit([&](sycl::handler &h) {
    sycl::accessor acc(buf, h);
    sycl::accessor acc_a(bufa, h, sycl::read_only);
    h.parallel_for(sycl::nd_range<1>(N, M), [=](auto it) {
      auto i = it.get_global_id();
      sycl::atomic_ref<int, sycl::memory_order_relaxed,
                       sycl::memory_scope_device,
                       sycl::access::address_space::global_space>
          atomic_op(acc[0]);
      atomic_op += acc_a[i];
    });
  });
  sycl::host_accessor h_a(buf);
  std::cout << "Reduction Sum : " << h_a[0] << "\n";
  std::cout
      << "Kernel Execution Time of Global Atomics Ref: "
      << e.get_profiling_info<sycl::info::event_profiling::command_end>() -
             e.get_profiling_info<sycl::info::event_profiling::command_start>()
      << "\n";
  return 0;
}

您可以重构代码以使用本地内存空间，如下面的示例所示。

#include <CL/sycl.hpp>
#include <iostream>
int main() {
  constexpr int N = 256 * 256;
  constexpr int M = 512;
  constexpr int NUM_WG = N / M;
  int total = 0;
  int *a = static_cast<int *>(malloc(sizeof(int) * N));
  for (int i = 0; i < N; i++)
    a[i] = 1;
  sycl::queue q({sycl::property::queue::enable_profiling()});
  sycl::buffer<int> global(&total, 1);
  sycl::buffer<int> bufa(a, N);
  auto e1 = q.submit([&](sycl::handler &h) {
    sycl::accessor b(global, h);
    sycl::accessor acc_a(bufa, h, sycl::read_only);
    auto acc = sycl::local_accessor<int, 1>(NUM_WG, h);
    h.parallel_for(sycl::nd_range<1>(N, M), [=](auto it) {
      auto i = it.get_global_id(0);
      auto group_id = it.get_group(0);
      sycl::atomic_ref<int, sycl::memory_order_relaxed,
                       sycl::memory_scope_device,
                       sycl::access::address_space::local_space>
          atomic_op(acc[group_id]);
      sycl::atomic_ref<int, sycl::memory_order_relaxed,
                       sycl::memory_scope_device,
                       sycl::access::address_space::global_space>
          atomic_op_global(b[0]);
      atomic_op += acc_a[i];
      it.barrier(sycl::access::fence_space::local_space);
      if (it.get_local_id() == 0)
        atomic_op_global += acc[group_id];
    });
  });
  sycl::host_accessor h_global(global);
  std::cout << "Reduction Sum : " << h_global[0] << "\n";
  int total_time =
      (e1.get_profiling_info<sycl::info::event_profiling::command_end>() -
       e1.get_profiling_info<sycl::info::event_profiling::command_start>());
  std::cout << "Kernel Execution Time of Local Atomics : " << total_time
            << "\n";
  return 0;
}