oneAPI GPU 优化指南 - 内存分配

最新推荐文章于 2024-08-13 15:53:36 发布

TechVideoGPU

最新推荐文章于 2024-08-13 15:53:36 发布

阅读量115

点赞数

文章标签： oneapi

本章节翻译by chenchensmail@163.com 原文：

本节将探讨各种分配内存的方式，以及支持的分配类型。主机上的指针与设备上的指针大小相同。

主机分配 由主机拥有，预计将从系统内存中分配。主机分配可以由主机和所有支持的设备访问。因此，可以在主机和所有支持的设备上使用指向主机分配的相同指针。预计主机分配不会在系统内存和设备本地内存之间迁移。当在设备上访问指向主机分配的指针时，数据通常会通过连接设备和主机的总线（如 PCI-Express ）发送。

设备分配 由特定设备拥有，预计将从设备本地内存中分配。在该设备上可以读取和写入分配的存储空间，但不能直接从主机或任何其他支持的设备访问。

共享分配 可以由主机和所有支持的设备访问。因此，可以在主机和所有支持的设备上使用指向共享分配的相同指针，就像在主机分配中一样。然而，共享分配并不由任何特定设备拥有，而是预计会在主机和一个或多个设备之间迁移。这意味着在迁移发生后，在设备上的访问会从更快的设备本地内存中进行，而不是通过高延迟的总线连接远程访问系统内存。

共享系统分配 是共享分配的子类，其中内存由系统分配器（如 malloc 或 new ）而不是分配 API （如 OpenMP 内存分配 API ）分配。共享系统分配没有关联的设备；它们本质上是跨设备的。像其他共享分配一样，共享系统分配预计会在主机和支持的设备之间迁移，并且可以在主机和所有支持的设备上使用指向共享系统分配的相同指针。

注意事项:

目前，在 Intel® Data Center GPU Max Series 系统上不支持共享系统分配。然而，通过分配 API 进行共享分配是被支持的。

下表总结了各种类型的内存分配的特性

+——————–+——————+————+————-+ |

Type of allocation | Initial location | Accessible | Accessible | | | | on host? | on device? | +====================+==================+============+=============+ | Host | Host | Yes | Yes | +——————–+——————+————+————-+ | Device | Device | No | Yes | +——————–+——————+————+————-+ | Shared | Host, Device, | Yes | Yes | | | or Unspecified | | | +——————–+——————+————+————-+ | Shared-System | Host | Yes | Yes | | | | | | +——————–+——————+————+————-+

主机分配提供了广泛的可访问性（可以直接从主机和所有支持的设备访问），但每次访问的成本可能较高，因为数据通常通过如 PCI Express* 之类的总线发送。

共享分配也提供了广泛的可访问性，但每次访问的成本可能低于主机分配，因为数据会迁移到访问设备。

设备分配有访问限制（不能直接从主机或其他支持的设备访问），但提供了更高的性能，因为访问是对设备本地内存进行的。

OpenMP Runtime 内存分配例程

Intel 编译器支持一些OpenMP运行时例程来执行内存分配。这些例程在下表中显示。

OpenMP memory allocation routine	Intel extension?	Type of allocation
`omp_target_alloc`	No	Device
`omp_target_alloc_device`	Yes	Device
`omp_target_alloc_host`	Yes	Host
`omp_target_alloc_shared`	Yes	Shared

请注意，三个例程 omp_target_alloc_device, omp_target_alloc_host 和 omp_target_alloc_shared 是对 OpenMP 规范的 Intel 扩展。

以下示例使用了上述 OpenMP 内存分配例程。将这些与使用 map 子句的示例进行比较。

有关内存分配的更多信息，请参阅：

Data Parallel C++, by James Reinders et al.
SYCL 2020 Specification
oneAPI Level Zero Specification
本指南的 SYCL 部分

使用 `map` 子句

第一个示例使用 map 子句在设备上分配内存并在主机和设备之间拷贝数据。

在以下示例中，通过调用 C/C++ 标准库例程， malloc 在系统内存中分配数组 A, B 和 C。

在第 58 行的 target 构造是在设备上计算数组 C 值的主 kernel。由于需要在计算之前将 C 的值从主机传输到设备，并在计算结束时从设备传输到主机，因此在此 target 构造上指定了 map(tofrom:C[0:length) 子句。对于数组 A 和 B ，指定了 map(to: A[0:length], B[0:length])，因为需要将这些数组的值从主机传输到设备，而设备只读取这些值。在底层， map 子句导致在设备上为数组分配存储空间，并将数据从主机拷贝到设备，反之亦然。

 1#include <stdio.h>
 2#include <stdlib.h>
 3#include <stdint.h>
 4#include <math.h>
 5#include <omp.h>
 6
 7#define iterations 100
 8#define length     64*1024*1024
 9
10int main(void)
11{
12  size_t bytes = length*sizeof(double);
13  double * __restrict A;
14  double * __restrict B;
15  double * __restrict C;
16  double scalar = 3.0;
17  double nstream_time = 0.0;
18
19  // Allocate arrays on the host using plain malloc()
20
21  A = (double *) malloc(bytes);
22  if (A == NULL){
23     printf(" ERROR: Cannot allocate space for A using plain malloc().\n");
24     exit(1);
25  }
26
27  B = (double *) malloc(bytes);
28  if (B == NULL){
29     printf(" ERROR: Cannot allocate space for B using plain malloc().\n");
30     exit(1);
31  }
32
33  C = (double *) malloc(bytes);
34  if (C == NULL){
35     printf(" ERROR: Cannot allocate space for C using plain malloc().\n");
36     exit(1);
37  }
38
39  // Initialize the arrays
40
41  #pragma omp parallel for
42  for (size_t i=0; i<length; i++) {
43      A[i] = 2.0;
44      B[i] = 2.0;
45      C[i] = 0.0;
46  }
47
48  // Perform the computation
49
50  nstream_time = omp_get_wtime();
51  for (int iter = 0; iter<iterations; iter++) {
52      #pragma omp target teams distribute parallel for \
53        map(to: A[0:length], B[0:length]) \
54        map(tofrom: C[0:length])
55      for (size_t i=0; i<length; i++) {
56          C[i] += A[i] + scalar * B[i];
57      }
58  }
59  nstream_time = omp_get_wtime() - nstream_time;
60
61  // Validate and output results
62
63  double ar = 2.0;
64  double br = 2.0;
65  double cr = 0.0;
66  for (int iter = 0; iter<iterations; iter++) {
67      for (int i=0; i<length; i++) {
68          cr += ar + scalar * br;
69      }
70  }
71
72  double asum = 0.0;
73  #pragma omp parallel for reduction(+:asum)
74  for (size_t i=0; i<length; i++) {
75      asum += fabs(C[i]);
76  }
77
78  free(A);
79  free(B);
80  free(C);
81
82  double epsilon=1.e-8;
83  if (fabs(cr - asum)/asum > epsilon) {
84      printf("Failed Validation on output array\n"
85             "       Expected checksum: %lf\n"
86             "       Observed checksum: %lf\n"
87             "ERROR: solution did not validate\n", cr, asum);
88      return 1;
89  } else {
90      printf("Solution validates\n");
91      double avgtime = nstream_time/iterations;
92      printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
93  }
94
95  return 0;
96}

编译命令:

icpx -fiopenmp -fopenmp-targets=spir64 test_target_map.cpp

运行命令:

OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out

在 iterations 循环内的 target 构造上的 map 子句导致数据（ A, B, C 的值）在每个 target 区域开始时从主机传输到设备，并导致数据（ C 的值）在每个 target 区域结束时从设备传输到主机。这些数据传输产生了显著的性能开销。使用 map 子句的更好方法是将整个 iterations 循环放入带有 map 子句的 target data 构造中。这导致在 iterations 循环开始时发生一次传输，在 iterations 循环结束时再发生一次传输。下面显示了使用 target data 和 map 子句的修改后的示例。

 1#include <stdio.h>
 2#include <stdlib.h>
 3#include <stdint.h>
 4#include <math.h>
 5#include <omp.h>
 6
 7#define iterations 100
 8#define length     64*1024*1024
 9
10int main(void)
11{
12  size_t bytes = length*sizeof(double);
13  double * __restrict A;
14  double * __restrict B;
15  double * __restrict C;
16  double scalar = 3.0;
17  double nstream_time = 0.0;
18
19  // Allocate arrays on the host using plain malloc()
20
21  A = (double *) malloc(bytes);
22  if (A == NULL){
23     printf(" ERROR: Cannot allocate space for A using plain malloc().\n");
24     exit(1);
25  }
26
27  B = (double *) malloc(bytes);
28  if (B == NULL){
29     printf(" ERROR: Cannot allocate space for B using plain malloc().\n");
30     exit(1);
31  }
32
33  C = (double *) malloc(bytes);
34  if (C == NULL){
35     printf(" ERROR: Cannot allocate space for C using plain malloc().\n");
36     exit(1);
37  }
38
39  // Initialize the arrays
40
41  #pragma omp parallel for
42  for (size_t i=0; i<length; i++) {
43      A[i] = 2.0;
44      B[i] = 2.0;
45      C[i] = 0.0;
46  }
47
48  // Perform the computation
49
50  nstream_time = omp_get_wtime();
51  #pragma omp target data  map(to: A[0:length], B[0:length]) \
52    map(tofrom: C[0:length])
53  {
54     for (int iter = 0; iter<iterations; iter++) {
55         #pragma omp target teams distribute parallel for
56         for (size_t i=0; i<length; i++) {
57             C[i] += A[i] + scalar * B[i];
58         }
59     }
60  }
61  nstream_time = omp_get_wtime() - nstream_time;
62
63  // Validate and output results
64
65  double ar = 2.0;
66  double br = 2.0;
67  double cr = 0.0;
68  for (int iter = 0; iter<iterations; iter++) {
69      for (int i=0; i<length; i++) {
70          cr += ar + scalar * br;
71      }
72  }
73
74  double asum = 0.0;
75  #pragma omp parallel for reduction(+:asum)
76  for (size_t i=0; i<length; i++) {
77      asum += fabs(C[i]);
78  }
79
80  free(A);
81  free(B);
82  free(C);
83
84  double epsilon=1.e-8;
85  if (fabs(cr - asum)/asum > epsilon) {
86      printf("Failed Validation on output array\n"
87             "       Expected checksum: %lf\n"
88             "       Observed checksum: %lf\n"
89             "ERROR: solution did not validate\n", cr, asum);
90      return 1;
91  } else {
92      printf("Solution validates\n");
93      double avgtime = nstream_time/iterations;
94      printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
95  }
96
97  return 0;
98}

`omp_target_alloc`

接下来，将上面的示例修改为使用设备分配而不是 map 子句。通过调用 OpenMP runtime 例程 omp_target_alloc 直接在设备上为数组 A, B 和 C 分配存储空间。该例程接受两个参数：要在设备上分配的字节数，以及要分配存储空间的设备编号。该例程返回一个设备指针，该指针引用了在设备上分配的存储空间的设备地址。如果调用 omp_target_alloc 返回 NULL ，则表示分配未成功。

T要在一个 target 构造中访问已分配的内存，可以在该构造上列出一个调用 omp_target_alloc 返回的设备指针，并使用一个 is_device_ptr 子句。这确保了 kernel 执行前后没有数据传输，因为 kernel 操作的是已经在设备上的数据。

在程序结束时，运行时例程 omp_target_free 用于在设备上释放数组 A, B 和 C 的存储空间。

 1#include <stdio.h>
 2#include <stdlib.h>
 3#include <stdint.h>
 4#include <math.h>
 5#include <omp.h>
 6
 7#define iterations 100
 8#define length     64*1024*1024
 9
10int main(void)
11{
12  int device_id = omp_get_default_device();
13  size_t bytes = length*sizeof(double);
14  double * __restrict A;
15  double * __restrict B;
16  double * __restrict C;
17  double scalar = 3.0;
18  double nstream_time = 0.0;
19
20  // Allocate arrays in device memory
21
22  A = (double *) omp_target_alloc(bytes, device_id);
23  if (A == NULL){
24     printf(" ERROR: Cannot allocate space for A using omp_target_alloc().\n");
25     exit(1);
26  }
27
28  B = (double *) omp_target_alloc(bytes, device_id);
29  if (B == NULL){
30     printf(" ERROR: Cannot allocate space for B using omp_target_alloc().\n");
31     exit(1);
32  }
33
34  C = (double *) omp_target_alloc(bytes, device_id);
35  if (C == NULL){
36     printf(" ERROR: Cannot allocate space for C using omp_target_alloc().\n");
37     exit(1);
38  }
39
40  // Initialize the arrays
41
42  #pragma omp target teams distribute parallel for \
43    is_device_ptr(A,B,C)
44  for (size_t i=0; i<length; i++) {
45      A[i] = 2.0;
46      B[i] = 2.0;
47      C[i] = 0.0;
48  }
49
50  // Perform the computation 'iterations' number of times
51
52  nstream_time = omp_get_wtime();
53  for (int iter = 0; iter<iterations; iter++) {
54      #pragma omp target teams distribute parallel for \
55         is_device_ptr(A,B,C)
56      for (size_t i=0; i<length; i++) {
57          C[i] += A[i] + scalar * B[i];
58      }
59  }
60  nstream_time = omp_get_wtime() - nstream_time;
61
62  // Validate and output results
63
64  double ar = 2.0;
65  double br = 2.0;
66  double cr = 0.0;
67  for (int iter = 0; iter<iterations; iter++) {
68      for (int i=0; i<length; i++) {
69          cr += ar + scalar * br;
70      }
71  }
72
73  double asum = 0.0;
74  #pragma omp target teams distribute parallel for reduction(+:asum) \
75    map(tofrom: asum) is_device_ptr(C)
76  for (size_t i=0; i<length; i++) {
77      asum += fabs(C[i]);
78  }
79
80  omp_target_free(A, device_id);
81  omp_target_free(B, device_id);
82  omp_target_free(C, device_id);
83
84  double epsilon=1.e-8;
85  if (fabs(cr - asum)/asum > epsilon) {
86      printf("Failed Validation on output array\n"
87             "       Expected checksum: %lf\n"
88             "       Observed checksum: %lf\n"
89             "ERROR: solution did not validate\n", cr, asum);
90      return 1;
91  } else {
92      printf("Solution validates\n");
93      double avgtime = nstream_time/iterations;
94      printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
95  }
96
97  return 0;
98}

注意事项:

在调用 omp_target_alloc 时，指定的设备编号必须是支持的设备之一，而不能是主机设备。这将是分配存储空间的设备。

由于数组 A, B 和 C 不可从主机访问，因此需要在 OpenMP target 区域内完成数组的初始化、kernel 执行和 C 元素的求和。

设备分配只能由 omp_target_alloc 调用中指定的设备访问，但可以通过调用 omp_target_memcpy 将其拷贝到主机或其他设备上分配的内存中。

`omp_target_alloc_device`

Intel 扩展 omp_target_alloc_device 与 omp_target_alloc 相似。它也是用两个参数调用的：在设备上分配的字节数，以及要在其上分配存储的设备的编号。该例程返回一个设备指针，该指针引用设备上分配的存储的设备地址。如果对 omp_target_alloc_device 的调用返回 NULL ，则表示分配不成功。

上述的 omp_target_alloc 示例可以通过简单地将对 omp_target_alloc 的调用替换为对 omp_targer_alloc_device 的调用来重写，如下所示。

在程序结束时， runtime 例程 omp_target_free 用于在设备上释放数组 A, B 和 C 的存储空间。

26  // Allocate arrays in device memory
27
28  A = (double *) omp_target_alloc_device(bytes, device_id);
29  if (A == NULL){
30     printf(" ERROR: Cannot allocate space for A using omp_target_alloc_device().\n");
31     exit(1);
32  }
33
34  B = (double *) omp_target_alloc_device(bytes, device_id);
35  if (B == NULL){
36     printf(" ERROR: Cannot allocate space for B using omp_target_alloc_device().\n");
37     exit(1);
38  }
39
40  C = (double *) omp_target_alloc_device(bytes, device_id);
41  if (C == NULL){
42     printf(" ERROR: Cannot allocate space for C using omp_target_alloc_device().\n");
43     exit(1);
44  }

注意事项:

所有适用于omp_target_alloc的上述注意事项也适用于 omp_target_alloc_device。

`omp_target_alloc_host`

上述示例也可以通过对 A, B 和 C 进行主机分配来重写。这允许内存可以被主机和所有支持的设备访问。

在以下修改后的示例中，调用了 runtime 例程 omp_target_alloc_host （一个 Intel 扩展）来为数组 A, B 和 C 分配存储空间。该例程接受两个参数：要分配的字节数，以及设备编号。设备编号必须是支持的设备之一，而不能是主机设备。该例程返回一个指向主机内存中存储位置的指针。如果调用 omp_target_alloc_host 返回 NULL ，这表示分配未成功。

注意在程序顶部指定了指令 requires unified_address。这要求实现保证所有通过 OpenMP API 例程和指令可访问的设备使用统一地址空间。在此地址空间中，从所有设备看到的指针将始终引用内存中的同一位置，而且不需要使用 is_device_ptr 子句从设备指针获取设备地址以便在目标区域内使用。当使用 Intel 编译器时，实际上并不需要 requires unified_address 指令，因为默认情况下就保证了统一地址空间。然而，为了可移植性，代码包含了该指令。

通过调用 omp_target_alloc_host 返回的指针可以从主机和所有支持的设备访问已分配的内存。由于使用了统一地址空间，因此在设备上访问内存时不需要在 target 构造上使用任何 map 子句和任何 is_device_ptr 子句。

在程序结束时， runtime 例程 omp_target_free 用于释放数组 A, B 和 C 的存储空间。

 1#include <stdio.h>
 2#include <stdlib.h>
 3#include <stdint.h>
 4#include <math.h>
 5#include <omp.h>
 6
 7#pragma omp requires unified_address
 8
 9#define iterations 100
10#define length     64*1024*1024
11
12int main(void)
13{
14  int device_id = omp_get_default_device();
15  size_t bytes = length*sizeof(double);
16  double * __restrict A;
17  double * __restrict B;
18  double * __restrict C;
19  double scalar = 3.0;
20  double nstream_time = 0.0;
21
22  // Allocate arrays in host memory
23
24  A = (double *) omp_target_alloc_host(bytes, device_id);
25  if (A == NULL){
26     printf(" ERROR: Cannot allocate space for A using omp_target_alloc_host().\n");
27     exit(1);
28  }
29
30  B = (double *) omp_target_alloc_host(bytes, device_id);
31  if (B == NULL){
32     printf(" ERROR: Cannot allocate space for B using omp_target_alloc_host().\n");
33     exit(1);
34  }
35
36  C = (double *) omp_target_alloc_host(bytes, device_id);
37  if (C == NULL){
38     printf(" ERROR: Cannot allocate space for C using omp_target_alloc_host().\n");
39     exit(1);
40  }
41
42  // Initialize the arrays
43
44  #pragma omp parallel for
45  for (size_t i=0; i<length; i++) {
46      A[i] = 2.0;
47      B[i] = 2.0;
48      C[i] = 0.0;
49  }
50
51  // Perform the computation
52
53  nstream_time = omp_get_wtime();
54  for (int iter = 0; iter<iterations; iter++) {
55      #pragma omp target teams distribute parallel for
56      for (size_t i=0; i<length; i++) {
57          C[i] += A[i] + scalar * B[i];
58      }
59  }
60  nstream_time = omp_get_wtime() - nstream_time;
61
62  // Validate and output results
63
64  double ar = 2.0;
65  double br = 2.0;
66  double cr = 0.0;
67  for (int iter = 0; iter<iterations; iter++) {
68      for (int i=0; i<length; i++) {
69          cr += ar + scalar * br;
70      }
71  }
72
73  double asum = 0.0;
74  #pragma omp parallel for reduction(+:asum)
75  for (size_t i=0; i<length; i++) {
76      asum += fabs(C[i]);
77  }
78
79  omp_target_free(A, device_id);
80  omp_target_free(B, device_id);
81  omp_target_free(C, device_id);
82
83  double epsilon=1.e-8;
84  if (fabs(cr - asum)/asum > epsilon) {
85      printf("Failed Validation on output array\n"
86             "       Expected checksum: %lf\n"
87             "       Observed checksum: %lf\n"
88             "ERROR: solution did not validate\n", cr, asum);
89      return 1;
90  } else {
91      printf("Solution validates\n");
92      double avgtime = nstream_time/iterations;
93      printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
94  }
95
96  return 0;
97}

注意事项:

当调用 omp_target_alloc_host 时，指定的设备号必须是支持的设备之一，不能是主机设备。

由于数组 A、 B``和 ``C 可以从主机和设备访问，因此可以在主机（在 target 结构体外部）或设备（在 target 结构体内部）上初始化数组并对 C 的元素进行求和。

Intel® Data Center GPU Max 系列不支持对主机分配（即通过 omp_target_alloc_host 分配内存）进行原子操作（或使用原子操作的算法，如某些 reduction ）。请在通过 omp_target_alloc_device 分配的内存上使用原子操作。

`omp_target_alloc_shared`

上述示例已修改，以使用共享分配代替主机分配。runtime 例程 omp_target_alloc_shared 被调用来为数组 A、 B``和 ``C 分配存储空间。该例程接受两个参数：在设备上分配的字节数和设备号。设备号必须是支持的设备之一，不能是主机设备。该例程返回一个指向共享内存中存储位置的指针。如果调用 omp_target_alloc_shared 返回 NULL ，则表示分配未成功。

注意，在程序顶部指定了 requires unified_addres 指令，以确保可移植性。

通过调用 omp_target_alloc_share 返回的指针可以从主机和所有支持的设备访问存储。由于使用了统一地址空间，因此在 targe 结构体上访问设备的内存时，不需要 map 子句和 is_device_pt 子句。

在程序结束时， runtime 例 omp_target_fre 用于为 A、 B 和 C 释放存储空间。

28  // Allocate arrays in shared memory
29
30  A = (double *) omp_target_alloc_shared(bytes, device_id);
31  if (A == NULL){
32     printf(" ERROR: Cannot allocate space for A using omp_target_alloc_shared().\n");
33     exit(1);
34  }
35
36  B = (double *) omp_target_alloc_shared(bytes, device_id);
37  if (B == NULL){
38     printf(" ERROR: Cannot allocate space for B using omp_target_alloc_shared().\n");
39     exit(1);
40  }
41
42  C = (double *) omp_target_alloc_shared(bytes, device_id);
43  if (C == NULL){
44     printf(" ERROR: Cannot allocate space for C using omp_target_alloc_shared().\n");
45     exit(1);
46  }

注意事项:

当调用 omp_target_alloc_shared 时，指定的设备号必须是支持的设备之一，不能是主机设备。

由于数组可以从主机和设备访问，因此初始化和验证可以在主机或设备（在 target 结构体内部）上完成。

不支持从主机和设备同时访问通过 omp_target_alloc_shared 分配的内存。

`omp_target_memcpy`

以下示例展示了如何使用 runtime 例程 omp_target_memcpy 将内存从主机拷贝到设备，以及从设备拷贝到主机。首先使用普通的 malloc 在系统内存中为数组 h_A、 h_B 和 h_C 分配空间，并进行初始化。然后使用 omp_target_alloc 在设备上为对应的数组 d_A 、 d_B 和 d_C 分配空间。

在第 104 行的 target 结构体开始之前，通过调用 omp_target_memcpy，将 h_A、 h_B 和 h_C 中的值拷贝到 d_A 、 d_B 和 d_C 。在 target 区域之后，在设备上计算出的新的 d_C 值通过调用 omp_target_memcpy 拷贝到 h_C。

  1#include <stdio.h>
  2#include <stdlib.h>
  3#include <stdint.h>
  4#include <math.h>
  5#include <omp.h>
  6
  7#define iterations 100
  8#define length     64*1024*1024
  9
 10int main(void)
 11{
 12  int device_id = omp_get_default_device();
 13  int host_id = omp_get_initial_device();
 14  size_t bytes = length*sizeof(double);
 15  double * __restrict h_A;
 16  double * __restrict h_B;
 17  double * __restrict h_C;
 18  double * __restrict d_A;
 19  double * __restrict d_B;
 20  double * __restrict d_C;
 21  double scalar = 3.0;
 22  double nstream_time = 0.0;
 23
 24  // Allocate arrays h_A, h_B, and h_C on the host using plain malloc()
 25
 26  h_A = (double *) malloc(bytes);
 27  if (h_A == NULL){
 28     printf(" ERROR: Cannot allocate space for h_A using plain malloc().\n");
 29     exit(1);
 30  }
 31
 32  h_B = (double *) malloc(bytes);
 33  if (h_B == NULL){
 34     printf(" ERROR: Cannot allocate space for h_B using plain malloc().\n");
 35     exit(1);
 36  }
 37
 38  h_C = (double *) malloc(bytes);
 39  if (h_C == NULL){
 40     printf(" ERROR: Cannot allocate space for h_C using plain malloc().\n");
 41     exit(1);
 42  }
 43
 44  // Allocate arrays d_A, d_B, and d_C on the device using omp_target_alloc()
 45
 46  d_A = (double *) omp_target_alloc(bytes, device_id);
 47  if (d_A == NULL){
 48     printf(" ERROR: Cannot allocate space for d_A using omp_target_alloc().\n");
 49     exit(1);
 50  }
 51
 52  d_B = (double *) omp_target_alloc(bytes, device_id);
 53  if (d_B == NULL){
 54     printf(" ERROR: Cannot allocate space for d_B using omp_target_alloc().\n");
 55     exit(1);
 56  }
 57
 58  d_C = (double *) omp_target_alloc(bytes, device_id);
 59  if (d_C == NULL){
 60     printf(" ERROR: Cannot allocate space for d_C using omp_target_alloc().\n");
 61     exit(1);
 62  }
 63
 64  // Initialize the arrays on the host
 65
 66  #pragma omp parallel for
 67  for (size_t i=0; i<length; i++) {
 68      h_A[i] = 2.0;
 69      h_B[i] = 2.0;
 70      h_C[i] = 0.0;
 71  }
 72
 73  // Call omp_target_memcpy() to copy values from host to device
 74
 75  int rc = 0;
 76  rc = omp_target_memcpy(d_A, h_A, bytes, 0, 0, device_id, host_id);
 77  if (rc) {
 78     printf("ERROR: omp_target_memcpy(A) returned %d\n", rc);
 79     exit(1);
 80  }
 81
 82  rc = omp_target_memcpy(d_B, h_B, bytes, 0, 0, device_id, host_id);
 83  if (rc) {
 84     printf("ERROR: omp_target_memcpy(B) returned %d\n", rc);
 85     exit(1);
 86  }
 87
 88  rc = omp_target_memcpy(d_C, h_C, bytes, 0, 0, device_id, host_id);
 89  if (rc) {
 90     printf("ERROR: omp_target_memcpy(C) returned %d\n", rc);
 91     exit(1);
 92  }
 93
 94  // Perform the computation
 95
 96  nstream_time = omp_get_wtime();
 97  for (int iter = 0; iter<iterations; iter++) {
 98      #pragma omp target teams distribute parallel for \
 99        is_device_ptr(d_A,d_B,d_C)
100      for (size_t i=0; i<length; i++) {
101          d_C[i] += d_A[i] + scalar * d_B[i];
102      }
103  }
104  nstream_time = omp_get_wtime() - nstream_time;
105
106  // Call omp_target_memcpy() to copy values from device to host
107
108  rc = omp_target_memcpy(h_C, d_C, bytes, 0, 0, host_id, device_id);
109  if (rc) {
110     printf("ERROR: omp_target_memcpy(A) returned %d\n", rc);
111     exit(1);
112  }
113
114  // Validate and output results
115
116  double ar = 2.0;
117  double br = 2.0;
118  double cr = 0.0;
119  for (int iter = 0; iter<iterations; iter++) {
120      for (int i=0; i<length; i++) {
121          cr += ar + scalar * br;
122      }
123  }
124
125  double asum = 0.0;
126  #pragma omp parallel for reduction(+:asum)
127  for (size_t i=0; i<length; i++) {
128      asum += fabs(h_C[i]);
129  }
130
131  free(h_A);
132  free(h_B);
133  free(h_C);
134  omp_target_free(d_A, device_id);
135  omp_target_free(d_B, device_id);
136  omp_target_free(d_C, device_id);
137
138  double epsilon=1.e-8;
139  if (fabs(cr - asum)/asum > epsilon) {
140      printf("Failed Validation on output array\n"
141             "       Expected checksum: %lf\n"
142             "       Observed checksum: %lf\n"
143             "ERROR: solution did not validate\n", cr, asum);
144      return 1;
145  } else {
146      printf("Solution validates\n");
147      double avgtime = nstream_time/iterations;
148      printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
149  }
150
151  return 0;
152}

性能考虑因素

在上述示例中（使用 map 子句， omp_target_alloc ， omp_target_alloc_device ， omp_target_alloc_host ， omp_target_alloc_shared ， omp_target_memcpy ），主 kernel 是计算数组 C 值的 target 结构。为了获得更准确的时间，这个 target 结构被包含在一个循环中，所以部署发生了 iterations 次（其中 iterations = 100 ）。通过将循环 iterations 所用的总时间除以 100 ，计算出 kernel 所用的平均时间。

56  // Perform the computation 'iterations' number of times
57
58  nstream_time = omp_get_wtime();
59  for (int iter = 0; iter<iterations; iter++) {
60      #pragma omp target teams distribute parallel for \
61         is_device_ptr(A,B,C)
62      for (size_t i=0; i<length; i++) {
63          C[i] += A[i] + scalar * B[i];
64      }
65  }
66  nstream_time = omp_get_wtime() - nstream_time;

LIBOMPTARGET_DEBUG=1 输出显示，所有上述示例都具有相同的 ND_range 分区。

Target LEVEL0 RTL --> Allocated a device memory 0xff00000020200000
Libomptarget --> omp_target_alloc returns device ptr 0xff00000020200000
Libomptarget --> Call to omp_target_alloc for device 0 requesting 536870912 bytes
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_initial_device returning 1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.

下表显示了在使用的特定 GPU （仅 1 堆栈）上运行各个版本时， kernel 所用的平均时间。

Version	Time (seconds)
`map`	0.183604
`map` + `target data`	0.012757
`omp_target_alloc`	0.002501
`omp_target_alloc_device`	0.002499
`omp_target_alloc_host`	0.074412
`omp_target_alloc_shared`	0.012491
`omp_target_memcpy`	0.011072

上述性能数据显示，使用 map 子句的版本是最慢的版本（ 0.183604秒）。这是因为在每次 kernel 启动的开始和结束时都会发生数据传输。主 kernel 启动了 100 次。在每次 kernel 启动的开始，设备上为数组 A、 B``和 ``C 分配了存储空间，并将这些数组的值从主机拷贝到设备。在 kernel 结束时，将数组 C 的值从设备拷贝到主机。将整个 iterations 循环放在带有 map 子句的 target data 结构中，可以将运行时间减少到 0.012757 秒，因为传输只在 iterations 循环中的第一个 kernel 启动时发生一次，然后在该循环中的最后一个 kernel 之后再发生一次。

omp_target_alloc 和 omp_target_alloc_device 版本具有最佳性能（分别为 0.002501 秒和 0.002499 秒）。在这些版本中，直接在设备内存中为 A、 B 和 C 分配存储空间，因此设备上的访问是从设备本地内存进行的。这对于在设备端使用临时数组的应用程序来说是一个有用的模型。这些数组永远不需要在主机上访问。在这种情况下，建议将临时数组分配到设备上，并不用担心数据传输，如本示例所示。

使用 omp_target_alloc_shared 版本也表现良好，但稍慢（ 0.012491 秒）。在此版本中，为 A、 B 和 C 分配了共享内存中的存储空间。因此，数据可以在主机和设备之间迁移。迁移有一定的开销，但迁移后，在设备上的访问会从更快的设备本地内存进行。在此版本中，数组的初始化发生在主机上。在第一次 kernel 启动时，数组被迁移到设备上，并且 kernel 在设备上本地访问数组。最后，在主机执行 reduction 计算之前，整个 C 数组被迁回到主机。

使用 omp_target_alloc_host 版本（ 0.074412 秒）所需时间几乎是 omp_target_alloc_shared 版本所需时间的 6 倍。这是因为在主机内存中分配的数据不会从主机迁移到设备。当 kernel 试图访问数据时，数据通常通过连接设备和主机的总线（如 PCI Express ）发送。这比访问本地设备内存要慢。如果设备只偶尔访问数组的一小部分，则可以使用 omp_target_alloc_host 将该数组分配到主机内存中。然而，如果设备端频繁访问数组，则应将其保留在设备内存中。将数据保留在主机内存中并通过 PCI 访问它会降低性能。

最后，关于数据传输的注意事项：可以通过搜索 "Libomptarget --> Moving" 来查看 map 版本中传输的数据量。注意每次启动主 kernel 都会产生以下数据传输:

$ grep "Libomptarget --> Moving" test_target_map.debug
Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a5fc8b010) -> (tgt:0xff00000000200000)
Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a9fc8d010) -> (tgt:0xff00000020200000)
Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a7fc8c010) -> (tgt:0xff00000040200000)
Libomptarget --> Moving 536870912 bytes (tgt:0xff00000000200000) -> (hst:0x00007f1a5fc8b010)

另一方面， omp_target_alloc_... 版本中的数据传输由 runtime 系统的较低层处理。因此，在 LIBOMPTARGET_DEBUG=1 输出中搜索 "Libomptarget --> Moving" 将不会显示发生的数据传输。

Fortran 示例

下面展示了使用 target data 和 map 子句的 Fortran 示例。

 1    program main
 2    use iso_fortran_env
 3    use omp_lib
 4    implicit none
 5
 6    integer, parameter :: iterations=100
 7    integer, parameter :: length=64*1024*1024
 8    real(kind=REAL64), parameter ::  epsilon=1.D-8
 9    real(kind=REAL64), allocatable ::  A(:)
10    real(kind=REAL64), allocatable ::  B(:)
11    real(kind=REAL64), allocatable ::  C(:)
12    real(kind=REAL64) :: scalar=3.0
13    real(kind=REAL64) :: ar, br, cr, asum
14    real(kind=REAL64) :: nstream_time, avgtime
15    integer :: err, i, iter
16
17    !
18    ! Allocate arrays on the host using plain allocate
19
20    allocate( A(length), stat=err )
21    if (err .ne. 0) then
22      print *, "Allocation of A returned ", err
23      stop 1
24    endif
25
26    allocate( B(length), stat=err )
27    if (err .ne. 0) then
28      print *, "Allocation of B returned ", err
29      stop 1
30    endif
31
32    allocate( C(length), stat=err )
33    if (err .ne. 0) then
34      print *, "Allocation of C returned ", err
35      stop 1
36    endif
37
38    !
39    ! Initialize the arrays
40
41    !$omp parallel do
42    do i = 1, length
43       A(i) = 2.0
44       B(i) = 2.0
45       C(i) = 0.0
46    end do
47
48    !
49    ! Perform the computation
50
51    nstream_time = omp_get_wtime()
52    !$omp target data  map(to: A, B) map(tofrom: C)
53
54    do iter = 1, iterations
55       !$omp target teams distribute parallel do
56       do i = 1, length
57          C(i) = C(i) + A(i) + scalar * B(i)
58       end do
59    end do
60
61    !$omp end target data
62    nstream_time = omp_get_wtime() - nstream_time
63
64    !
65    ! Validate and output results
66
67    ar = 2.0
68    br = 2.0
69    cr = 0.0
70    do iter = 1, iterations
71       do i = 1, length
72          cr = cr + ar + scalar * br
73       end do
74    end do
75
76    asum = 0.0
77    !$omp parallel do reduction(+:asum)
78    do i = 1, length
79       asum = asum + abs(C(i))
80    end do
81
82    if (abs(cr - asum)/asum > epsilon) then
83       write(*,110) "Failed Validation on output array: Expected =", cr, ", Observed =", asum
84    else
85       avgtime = nstream_time/iterations
86       write(*,120) "Solution validates: Checksum =", asum, ", Avg time (s) =",  avgtime
87    endif
88
89110 format (A, F20.6, A, F20.6)
90120 format (A, F20.6, A, F10.6)
91
92    deallocate(A)
93    deallocate(B)
94    deallocate(C)
95
96    end program main

下面展示了使用 omp_target_alloc_device 的 Fortran 示例。在此示例中，使用分配器 omp_target_device_mem_alloc 的 allocate 指令在设备上为数组 A、 B 和 C 分配空间。在 target data 指令（第 37 行）上使用了 use_device_addr(A, B, C) 子句，以指示数组具有设备地址，并且应在 target 区域中使用这些地址。

 1    use iso_fortran_env
 2    use omp_lib
 3    implicit none
 4
 5    integer, parameter :: iterations=100
 6    integer, parameter :: length=64*1024*1024
 7    real(kind=REAL64), parameter ::  epsilon=1.D-8
 8    real(kind=REAL64), allocatable ::  A(:)
 9    real(kind=REAL64), allocatable ::  B(:)
10    real(kind=REAL64), allocatable ::  C(:)
11    real(kind=REAL64) :: scalar=3.0
12    real(kind=REAL64) :: ar, br, cr, asum
13    real(kind=REAL64) :: nstream_time, avgtime
14    integer :: i, iter
15
16    !
17    ! Allocate arrays in device memory
18
19    !$omp allocate allocator(omp_target_device_mem_alloc)
20    allocate(A(length))
21
22    !$omp allocate allocator(omp_target_device_mem_alloc)
23    allocate(B(length))
24
25    !$omp allocate allocator(omp_target_device_mem_alloc)
26    allocate(C(length))
27
28    !
29    ! Begin target data
30
31    !$omp target data use_device_addr(A, B, C)
32
33    !
34    ! Initialize the arrays
35
36    !$omp target teams distribute parallel do
37    do i = 1, length
38       A(i) = 2.0
39       B(i) = 2.0
40       C(i) = 0.0
41    end do
42
43    !
44    ! Perform the computation
45
46    nstream_time = omp_get_wtime()
47    do iter = 1, iterations
48       !$omp target teams distribute parallel do
49       do i = 1, length
50          C(i) = C(i) + A(i) + scalar * B(i)
51       end do
52    end do
53    nstream_time = omp_get_wtime() - nstream_time
54
55    !
56    ! Validate and output results
57
58    ar = 2.0
59    br = 2.0
60    cr = 0.0
61    do iter = 1, iterations
62       do i = 1, length
63          cr = cr + ar + scalar * br
64       end do
65    end do
66
67    asum = 0.0
68    !$omp target teams distribute parallel do reduction(+:asum) &
69    !$omp map(tofrom: asum)
70    do i = 1, length
71       asum = asum + abs(C(i))
72    end do
73
74    !
75    ! End target data
76
77    !$omp end target data
78
79    if (abs(cr - asum)/asum > epsilon) then
80       write(*,110) "Failed Validation on output array: Expected =", cr, ", Observed =", asum
81    else
82       avgtime = nstream_time/iterations
83       write(*,120) "Solution validates: Checksum =", asum, ", Avg time (s) =",  avgtime
84    endif
85
86110 format (A, F20.6, A, F20.6)
87120 format (A, F20.6, A, F10.6)
88
89    deallocate(A)
90    deallocate(B)
91    deallocate(C)
92
93    end program main