本章节翻译by chenchensmail@163.com 原文:Offloading oneMKL Computations onto the GPU (intel.com)
Intel® oneAPI 数学 kernel 库 (oneMKL) 通过 为解决大型计算问题的软件应用程序提供数学例程来提高性能。 oneMKL 提供 BLAS、Sparse BLAS 和 LAPACK 线性代数例程、快速傅里叶变换、矢量化数学函数、 随机数生成函数和其他功能。
oneMKL 发行版包括一个示例目录, 其中包含对 oneMKL 例程的各种调用的示例。
有关 Intel® oneAPI MKL 的更多信息,请参见:
-
Developer Reference for Intel® oneAPI Math Kernel Library - C
-
Developer Reference for Intel® oneAPI Math Kernel Library - Fortran
使用 oneMKL OpenMP 部署时的编译和链接命令
本节中给出的信息特定于 Linux。 有关特定于 Windows 的信息以及更多详细信息, 请参阅 Intel® oneAPI Math Kernel Library Link Line Advisor。
注意:
-
下面显示的链接命令将动态链接到 oneMKL 库。
-
Intel® oneMKL LP64 库使用 32 位整数 类型索引数组;而 Intel® oneMKL ILP64 库使用 64 位整数类型 (用于索引大型数组,具有超过 231 - 1 个元素)索引数组。
C/C++ (Linux)
使用 OpenMP 线程并调用 32 位整数的 oneMKL C/C++ API 的 C/C++ 程序的编译和链接命令,如下所示:
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -c source.cpp Link: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
如果程序调用的是 64 位整数的 oneMKL C/C++ API, 编译和链接命令,如下所示:
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -DMKL_ILP64 -c source.cpp Link: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
Fortran (Linux)
使用 OpenMP 线程并调用 32 位整数的 oneMKL Fortran API 的 Fortran 程序的编译和链接命令,如下所示:
Compile: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fpp -free -c source.f Link: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
如果程序调用的是 64 位整数的 oneMKL Fortran API, 编译和链接命令,如下所示:
Compile: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -i8 -fpp -free -c source.f Link: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL source.o
关于 -qmkl
编译器选项的说明
使用 -qmkl
选项 (相当于 -qmkl=parallel
) 根据提供的线程选项链接到某个 Intel® oneAPI MKL 线程层:
-
对于
-fiopenmp
, Intel® 编译器的 OpenMP 线程层 -
对于
-tbb
, Intel® 线程构建块(Intel® TBB)线程层
使用 -qmkl=sequential
链接到 Intel® oneAPI MKL 的顺序版本。
请注意, -qmkl=parallel/sequential
仅影响 CPU 上的线程。 部署的 MKL 计算始终会适当并行化, 并尽可能占用 GPU 上的执行单元 (EUs)。
部署 oneMKL 计算的 OpenMP 指令
你可以使用 OpenMP 指令将 oneMKL 计算部署到 GPU 上。 有两种方法可以做到这一点。
一种方法涉及使用 Intel 特定的 OpenMP 扩展 target variant dispatch
指令。你将在一个 target variant dispatch
构造中放置对 oneMKL 例程的调用,如下面的示例所示。 在此示例中,在调用 oneMKL 例程 cblas_dgemm
之前, 用于乘法的矩阵 a
, b
和 c
被映射到设备上。 在 target variant dispatch
指令上 使用 use_device_ptr(A,B,C)
子句,表示 a
, b
和 c
指向具有对应存储在设备上的对象。当调用 cblas_dgemm
时, 将传递 a
、 b
和 c
的相应设备指针作为参数, 并在计算中使用设备副本。
1#include <stdio.h> 2#include <stdlib.h> 3#include <math.h> 4#include <omp.h> 5#include "mkl.h" 6#include "mkl_omp_offload.h" 7 8#define min(x,y) (((x) < (y)) ? (x) : (y)) 9#define EPSILON 0.0001 10 11int main() 12{ 13 double *A, *B, *C, *C_fl; 14 int64_t m, n, k; 15 double alpha, beta; 16 double sum; 17 int64_t i, j, q; 18 int fail; 19 20 printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n" 21 " Intel oneMKL function dgemm, where A, B, and C are matrices and \n" 22 " alpha and beta are double precision scalars\n\n"); 23 24 m = 2000, k = 200, n = 1000; 25 printf (" Initializing data for matrix multiplication C=A*B for matrix \n" 26 " A(%li x %li) and matrix B(%li x %li)\n\n", m, k, k, n); 27 alpha = 1.0; beta = 0.0; 28 29 printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" 30 " performance \n\n"); 31 A = (double *)mkl_malloc( m * k * sizeof( double ), 64 ); 32 B = (double *)mkl_malloc( k * n * sizeof( double ), 64 ); 33 C = (double *)mkl_malloc( m * n * sizeof( double ), 64 ); 34 C_fl = (double *)mkl_malloc( m*n*sizeof( double ), 64 ); 35 36 if (A == NULL || B == NULL || C == NULL || C_fl == NULL) { 37 printf( "\n ERROR: Cannot allocate memory for matrices. Exiting... \n\n"); 38 return 1; 39 } 40 41 printf (" Intializing matrices \n\n"); 42 for (i = 0; i < (m*k); i++) { 43 A[i] = (double)(i+1); 44 } 45 46 for (i = 0; i < (k*n); i++) { 47 B[i] = (double)(-i-1); 48 } 49 50 for (i = 0; i < (m*n); i++) { 51 C[i] = 0.0; 52 C_fl[i] = 0.0; 53 } 54 55 printf (" Computing matrix product using Intel oneMKL dgemm function via CBLAS interface \n\n"); 56 57 #pragma omp target data map(to: A[0:m*k], B[0:k*n]) map(tofrom: C[0:m*n]) 58 { 59 #pragma omp target variant dispatch use_device_ptr(A, B, C) 60 { 61 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 62 m, n, k, alpha, A, k, B, n, beta, C, n); 63 } 64 } 65 66 printf ("\n Top left corner of matrix C: \n"); 67 for (i=0; i<min(m,6); i++) { 68 for (j=0; j<min(n,6); j++) { 69 printf ("%12.5G", C[j+i*n]); 70 } 71 printf ("\n"); 72 } 73 74 printf (" Computing matrix product using for-loops \n"); 75 76 for (i = 0; i < m; i++) { 77 for (j = 0; j < n; j++) { 78 sum = 0.0; 79 for (q = 0; q < k; q++) { 80 sum += A[k*i+q] * B[n*q+j]; 81 } 82 C_fl[n*i+j] = alpha * sum + beta * C_fl[n*i+j]; 83 } 84 } 85 86 printf ("\n Top left corner of matrix C_fl: \n"); 87 for (i=0; i<min(m,6); i++) { 88 for (j=0; j<min(n,6); j++) { 89 printf ("%12.5G", C_fl[j+i*n]); 90 } 91 printf ("\n"); 92 } 93 94 printf (" Computations completed. Verifying... \n\n"); 95 96 fail = 0; 97 for (i = 0; i < (m*n); i++) { 98 if (fabs(C[i] - C_fl[i]) > EPSILON) { 99 fail = 1; 100 break; 101 } 102 } 103 104 if (fail) 105 printf ("\n **** FAIL **** \n"); 106 else 107 printf ("\n **** PASS **** \n"); 108 109 printf ("\n Deallocating memory \n\n"); 110 mkl_free(A); 111 mkl_free(B); 112 mkl_free(C); 113 114 return fail; 115}
另一种通知编译器 oneMKL 计算应 卸载到 GPU 的方法是使用 OpenMP 5.1 的 dispatch
指令, 如下面的示例所示。在这个例子中, 矩阵 a
, b
和 c
在调用 oneMKL 例程 cblas_dgemm
之前也被映射到设备上。当调用 cblas_dgemm
时, 将传递 a
, b
和 c
的相应设备指针作为参数, 因此设备副本将在计算中使用。
在 dispatch
指令上不需要 use_device_ptr
子句。 在 OpenMP 5.1 中, oneMKL 例程所需的设备指针列表 在 oneMKL OpenMP 部署头文件 mkl_omp_offload.h
中给出, 其中声明了 GPU 变体函数。 用户应仔细查看 oneMKL 头文件中所需的 设备指针列表,并确保在调用 oneMKL 例程之前 相应的矩阵可以从设备访问。
请注意,根据你使用的编译器版本, 可能需要添加编译器选项 -fopenmp-version=51
才能使 dispatch
指令被接受。
1#include <stdio.h> 2#include <stdlib.h> 3#include <math.h> 4#include <omp.h> 5#include "mkl.h" 6#include "mkl_omp_offload.h" 7 8#define min(x,y) (((x) < (y)) ? (x) : (y)) 9#define EPSILON 0.0001 10 11int main() 12{ 13 double *A, *B, *C, *C_fl; 14 int64_t m, n, k; 15 double alpha, beta; 16 double sum; 17 int64_t i, j, q; 18 int fail; 19 20 printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n" 21 " Intel oneMKL function dgemm, where A, B, and C are matrices and \n" 22 " alpha and beta are double precision scalars\n\n"); 23 24 m = 2000, k = 200, n = 1000; 25 printf (" Initializing data for matrix multiplication C=A*B for matrix \n" 26 " A(%li x %li) and matrix B(%li x %li)\n\n", m, k, k, n); 27 alpha = 1.0; beta = 0.0; 28 29 printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" 30 " performance \n\n"); 31 A = (double *)mkl_malloc( m * k * sizeof( double ), 64 ); 32 B = (double *)mkl_malloc( k * n * sizeof( double ), 64 ); 33 C = (double *)mkl_malloc( m * n * sizeof( double ), 64 ); 34 C_fl = (double *)mkl_malloc( m*n*sizeof( double ), 64 ); 35 36 if (A == NULL || B == NULL || C == NULL || C_fl == NULL) { 37 printf( "\n ERROR: Cannot allocate memory for matrices. Exiting... \n\n"); 38 return 1; 39 } 40 41 printf (" Intializing matrices \n\n"); 42 for (i = 0; i < (m*k); i++) { 43 A[i] = (double)(i+1); 44 } 45 46 for (i = 0; i < (k*n); i++) { 47 B[i] = (double)(-i-1); 48 } 49 50 for (i = 0; i < (m*n); i++) { 51 C[i] = 0.0; 52 C_fl[i] = 0.0; 53 } 54 55 printf (" Computing matrix product using Intel oneMKL dgemm function via CBLAS interface \n\n"); 56 57 #pragma omp target data map(to: A[0:m*k], B[0:k*n]) map(tofrom: C[0:m*n]) 58 { 59 #pragma omp target variant dispatch use_device_ptr(A, B, C) 60 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 61 m, n, k, alpha, A, k, B, n, beta, C, n); 62 } 63 64 printf ("\n Top left corner of matrix C: \n"); 65 for (i=0; i<min(m,6); i++) { 66 for (j=0; j<min(n,6); j++) { 67 printf ("%12.5G", C[j+i*n]); 68 } 69 printf ("\n"); 70 } 71 72 printf (" Computing matrix product using for-loops \n"); 73 74 for (i = 0; i < m; i++) { 75 for (j = 0; j < n; j++) { 76 sum = 0.0; 77 for (q = 0; q < k; q++) { 78 sum += A[k*i+q] * B[n*q+j]; 79 } 80 C_fl[n*i+j] = alpha * sum + beta * C_fl[n*i+j]; 81 } 82 } 83 84 printf ("\n Top left corner of matrix C_fl: \n"); 85 for (i=0; i<min(m,6); i++) { 86 for (j=0; j<min(n,6); j++) { 87 printf ("%12.5G", C_fl[j+i*n]); 88 } 89 printf ("\n"); 90 } 91 92 printf (" Computations completed. Verifying... \n\n"); 93 94 fail = 0; 95 for (i = 0; i < (m*n); i++) { 96 if (fabs(C[i] - C_fl[i]) > EPSILON) { 97 fail = 1; 98 break; 99 } 100 } 101 102 if (fail) 103 printf ("\n **** FAIL **** \n"); 104 else 105 printf ("\n **** PASS **** \n"); 106 107 printf ("\n Deallocating memory \n\n"); 108 mkl_free(A); 109 mkl_free(B); 110 mkl_free(C); 111 112 return fail; 113}
注意:
-
oneMKL 例程期望在运行计算之前 数组/矩阵已经在设备上。因此,在调用 oneMKL 例程之前, 用户必须将数据映射到设备上, 或直接在设备上分配数据。
-
如果 oneMKL 例程不是从
target variant dispatch
(或 dispatch)区域调用的,或者禁用了部署, 则 oneMKL 计算将在 CPU 上执行。 -
只能从 OpenMP
target variant dispatch
(或dispatch
) 构造中发出一次对 oneMKL 例程的调用。 如果有两个连续的调用 oneMKL 例程,则应将调用 放置在单独的target variant dispatch``(或 ``dispatch
)构造中。
Fortran
当从 Fortran 代码中调用 oneMKL 例程时,请确保添加以下 include 语句:
include "mkl_omp_offload.f90"
此外,如果使用 32 位整数调用 oneMKL Fortran API,请添加以下模块 use
语句:
use onemkl_blas_omp_offload_lp64
另一方面,如果使用 64 位整数调用 oneMKL Fortran API,请添加以下模块 use
语句:
use onemkl_blas_omp_offload_ilp64
以下 Fortran 示例说明了如何从 Fortran 程序中调用 DGEMM, 以及上述提到的 include
和 use
语句。
1 include "mkl_omp_offload.f90" 2 3 program DGEMM_MAIN 4 5#if defined(MKL_ILP64) 6 use onemkl_blas_omp_offload_ilp64 7#else 8 use onemkl_blas_omp_offload_lp64 9#endif 10 use omp_lib 11 use iso_fortran_env 12 implicit none 13 14 integer, parameter :: m = 20 15 integer, parameter :: k = 5 16 integer, parameter :: n = 10 17 double precision a(m,k), b(k,n), c1(m,n), c2(m,n) 18 double precision alpha, beta 19 integer i, j 20 21 print* 22 print*,' D G E M M EXAMPLE PROGRAM' 23 24 25 ! Initialize 26 27 alpha = 1.025 28 beta = 0.75 29 30 do i = 1, m 31 do j = 1, k 32 a(i,j) = (i-1) - (0.25 * k) 33 end do 34 end do 35 36 do i = 1, k 37 do j = 1, n 38 b(i,j) = -((i-1) + j) 39 end do 40 end do 41 42 do i = 1, m 43 do j = 1, n 44 c1(i,j) = 0.2 + i - j 45 c2(i,j) = 0.2 + i - j 46 end do 47 end do 48 49 50 ! Execute DGEMM on host. 51 52 call DGEMM('N','N',m,n,k,alpha,a,m,b,k,beta,c1,m) 53 54 print * 55 print *, 'c1 - After DGEMM host execution' 56 57 do i=1,m 58 print 110, (c1(i,j),j=1,n) 59 end do 60 print* 61 62 63 ! Execute DGEMM on device 64 65!$omp target data map(to: a, b) map(tofrom: c2) 66 67!$omp dispatch 68 call DGEMM('N','N',m,n,k,alpha,a,m,b,k,beta,c2,m) 69 70!$omp end target data 71 72 print * 73 print *, 'c2 - After DGEMM device execution' 74 75 do i=1,m 76 print 110, (c2(i,j),j=1,n) 77 end do 78 print * 79 80 110 format(7x,10(f10.2,2x)) 81 82 end
要使用 32 位整数编译和链接上述 Fortran 示例:
ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fpp -free -c dgemm_dispatch_f.f90 ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fsycl -L${MKLROOT}/lib/intel64 -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -lmkl_sycl dgemm_dispatch_f.o
要使用 64 位整数编译和链接上述 Fortran 示例:
ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -m64 -DMKL_ILP64 -i8 -fpp -free -c dgemm_dispatch_f.f90 ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fsycl -L${MKLROOT}/lib/intel64 -liomp5 -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl -lmkl_sycl dgemm_dispatch_f.o
在生成可执行文件( a.out )后,从 C/C++ 或 Fortran 程序中运行可执行文件, 你可以在 onetrace
下运行可执行文件并在生成的跟踪中查找标题“设备计时结果”。 在该标题下面,我们应该看到 oneMKL kernel 列出。这样我们就确认了 oneMKL 计算已经部署到 GPU 上。
示例运行命令:
OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 onetrace -h -d ./a.out
批量 oneMKL GEMM 调用
oneMKL 库包括 “batch” 例程,允许用户将多个 oneMKL 调用批量为单个 oneMKL 调用。 在 runtime 中, oneMKL 将智能地执行所有矩阵运算以优化整体性能。
例如, cblas_dgemm
例程计算两个通用矩阵 a
和 b
的矩阵乘积, 返回结果矩阵 c
。下面显示了 cblas_dgemm
接口。
void cblas_dgemm (const CBLAS_LAYOUT layout, const CBLAS_TRANSPOSE transa, const CBLAS_TRANSPOSE transb, const MKL_INT m, const MKL_INT n, const MKL_INT k, const double alpha, const double *a, const MKL_INT lda, const double *b, const MKL_INT ldb, const double beta, double *c, const MKL_INT ldc);
cblas_dgemm_batch
例程类似于 cblas_dgemm
例程, 但是 cblas_dgemm_batch
例程对矩阵组进行矩阵乘法运算,一次处理多个组。
下面显示了 cblas_dgemm_batch
接口。请注意,该接口类似于 cblas_dgemm
接口。 但是,它涉及将矩阵参数作为指向矩阵的指针数组传递,并将参数作为参数数组传递。
void cblas_dgemm_batch (const CBLAS_LAYOUT layout, const CBLAS_TRANSPOSE* transa_array, const CBLAS_TRANSPOSE* transb_array, const MKL_INT* m_array, const MKL_INT* n_array, const MKL_INT* k_array, const double* alpha_array, const double **a_array, const MKL_INT* lda_array, const double **b_array, const MKL_INT* ldb_array, const double* beta_array, double **c_array, const MKL_INT* ldc_array, const MKL_INT group_count, const MKL_INT* group_size);
批量操作定义如下:
idx = 0 for i = 0 .. group_count - 1 alpha and beta in alpha_array[i] and beta_array[i] for j = 0 .. group_size[i] - 1 a, b, and c matrices in a_array[idx], b_array[idx], and c_array[idx], respectively c := alpha*op(a)*op(b) + beta*c, idx = idx + 1 end for end for
其中:
-
op(X) 是 op(X) = X,或 op(X) = XT,或 op(X) = XH 之一,
-
alpha 和 beta 是 alpha_array 和 beta_array 的标量元素,
-
a、b 和 c 是矩阵,对于 m、n 和 k 分别是 m_array、n_array 和 k_array 的元素:
-
op(a)是一个 m-by-k 矩阵,
-
op(b)是一个 k-by-n 矩阵,
-
C 是一个 m-by-n 矩阵。
-
a、b 和 c 表示存储在由 a_array、b_array 和 c_array 指向的地址处的矩阵。 a_array、b_array 和 c_array 中的条目数或 total_batch_count 等于所有 group_size 条目之和。
可以通过将它们打包成组来批量化不同形状和参数的乘法, 其中每个组由具有相同形状(相同的 m、n 和 k )和相同参数的矩阵乘法组成。
批量 API 的基本假设是批量中的所有操作(无论是在同一组还是不同组)彼此相互独立。 因此, oneMKL 不保证批量中操作之间的任何特定顺序,并且会尝试并行执行多个操作。
通常情况下,你可以使批量大小越大越好。这允许 oneMKL 更好地并行化操作并在 GPU 上分配工作。
我们说明如何用一次调用 cblas_dgemm_batch
替换两次调用 cblas_dgemm
。 下面的示例包括两次调用 cblas_dgemm
。
1#include <stdio.h> 2#include <stdlib.h> 3#include <math.h> 4#include <omp.h> 5#include "mkl.h" 6#include "mkl_omp_offload.h" 7 8#define min(x,y) (((x) < (y)) ? (x) : (y)) 9#define epsilon 0.0000001f 10 11bool compare(double x, double y) 12{ 13 // returns true if x and y are the same 14 return fabs(x - y) <= epsilon; 15} 16 17int main() 18{ 19 double *A1, *B1, *C1, *C1_fl; 20 double *A2, *B2, *C2, *C2_fl; 21 int m, n, k, i, j, q; 22 double alpha, beta; 23 double sum; 24 int fail; 25 double t_start, t_end; 26 27 m = 2000, k = 200, n = 1000; 28 alpha = 1.0; beta = 0.0; 29 30 printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" 31 " performance \n\n"); 32 A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 33 B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 34 C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 35 C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 36 37 A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 38 B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 39 C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 40 C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 41 42 if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || 43 A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { 44 printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); 45 return 1; 46 } 47 48 printf (" Intializing matrix data \n\n"); 49 for (i = 0; i < (m*k); i++) { 50 A1[i] = A2[i] = (double)(i+1); 51 } 52 53 for (i = 0; i < (k*n); i++) { 54 B1[i] = B2[i] = (double)(-i-1); 55 } 56 57 for (i = 0; i < (m*n); i++) { 58 C1[i] = C2[i] = 0.0; 59 C1_fl[i] = C2_fl[i] = 0.0; 60 } 61 62 printf (" \nComputing matrix product using Intel MKL cblas_dgemm function \n"); 63 64 t_start = omp_get_wtime(); 65 66 #pragma omp target data \ 67 map(to: A1[0:m*k], B1[0:k*n], A2[0:m*k], B2[0:k*n]) \ 68 map(tofrom: C1[0:m*n], C2[0:m*n]) 69 { 70 #pragma omp dispatch 71 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 72 m, n, k, alpha, A1, k, B1, n, beta, C1, n); 73 74 #pragma omp dispatch 75 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 76 m, n, k, alpha, A2, k, B2, n, beta, C2, n); 77 } 78 79 t_end = omp_get_wtime(); 80 81 printf ("\n Top left corner of matrix C1: \n"); 82 for (i=0; i<min(m,6); i++) { 83 for (j=0; j<min(n,6); j++) { 84 printf ("%12.5G", C1[j+i*n]); 85 } 86 printf ("\n"); 87 } 88 89 printf ("\n Top left corner of matrix C2: \n"); 90 for (i=0; i<min(m,6); i++) { 91 for (j=0; j<min(n,6); j++) { 92 printf ("%12.5G", C2[j+i*n]); 93 } 94 printf ("\n"); 95 } 96 97 printf (" \nComputing matrix product using for-loops \n"); 98 99 for (i = 0; i < m; i++) { 100 for (j = 0; j < n; j++) { 101 sum = 0.0; 102 for (q = 0; q < k; q++) 103 sum += A1[k*i+q] * B1[n*q+j]; 104 C1_fl[n*i+j] = sum; 105 } 106 } 107 108 for (i = 0; i < m; i++) { 109 for (j = 0; j < n; j++) { 110 sum = 0.0; 111 for (q = 0; q < k; q++) 112 sum += A2[k*i+q] * B2[n*q+j]; 113 C2_fl[n*i+j] = sum; 114 } 115 } 116 117 printf ("\n Top left corner of matrix C1: \n"); 118 for (i=0; i<min(m,6); i++) { 119 for (j=0; j<min(n,6); j++) { 120 printf ("%12.5G", C1_fl[j+i*n]); 121 } 122 printf ("\n"); 123 } 124 125 printf ("\n Top left corner of matrix C2: \n"); 126 for (i=0; i<min(m,6); i++) { 127 for (j=0; j<min(n,6); j++) { 128 printf ("%12.5G", C2_fl[j+i*n]); 129 } 130 printf ("\n"); 131 } 132 133 printf ("\n Computations completed. Verifying... \n\n"); 134 135 fail = 0; 136 for (i = 0; i < (m*n); i++) { 137 if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { 138 fail = 1; 139 break; 140 } 141 } 142 143 if (fail) { 144 printf (" **** FAIL **** \n"); 145 } 146 else { 147 printf(" time = %lf seconds\n", t_end - t_start); 148 printf (" **** PASS **** \n"); 149 } 150 151 mkl_free(A1); 152 mkl_free(B1); 153 mkl_free(C1); 154 mkl_free(C1_fl); 155 mkl_free(A2); 156 mkl_free(B2); 157 mkl_free(C2); 158 mkl_free(C2_fl); 159 160 return 0; 161}
上述示例中的两次 cblas_dgemm
调用可以批量在一起,结果是一次调用 cblas_dgemm_batch
, 如下面的示例所示。请注意,批量由一个大小为 2 的组组成,因为我们有两个具有相同参数集 (layout
, transa
, transb
, m
, n
, k
, alpha
, lda
, ldb
, beta
, 和 ldc
)的矩阵乘法。 这种情况下的 total_batch_size 为 2。
1#include <stdio.h> 2#include <stdlib.h> 3#include <math.h> 4#include <omp.h> 5#include "mkl.h" 6#include "mkl_omp_offload.h" 7 8#define min(x,y) (((x) < (y)) ? (x) : (y)) 9#define epsilon 0.0000001f 10 11bool compare(double x, double y) 12{ 13 // return true is x and y are the same 14 return (fabs(x - y) <= epsilon); 15} 16 17int main() 18{ 19 double *A1, *B1, *C1, *C1_fl; 20 double *A2, *B2, *C2, *C2_fl; 21 int m, n, k, i, j, q; 22 double alpha, beta; 23 double sum; 24 int fail; 25 double t_start, t_end; 26 27 m = 2000, k = 200, n = 1000; 28 alpha = 1.0; beta = 0.0; 29 30 printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" 31 " performance \n\n"); 32 A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 33 B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 34 C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 35 C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 36 37 A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 38 B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 39 C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 40 C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 41 42 if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || 43 A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { 44 printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); 45 return 1; 46 } 47 48 printf (" Intializing matrix data \n\n"); 49 for (i = 0; i < (m*k); i++) { 50 A1[i] = A2[i] = (double)(i+1); 51 } 52 53 for (i = 0; i < (k*n); i++) { 54 B1[i] = B2[i] = (double)(-i-1); 55 } 56 57 for (i = 0; i < (m*n); i++) { 58 C1[i] = C2[i] = 0.0; 59 C1_fl[i] = C2_fl[i] = 0.0; 60 } 61 62 printf (" \nComputing matrix product using Intel MKL cblas_dgemm_batch function \n"); 63 64 #define GRP_COUNT 1 // 1 group 65 66 MKL_INT group_count = GRP_COUNT; 67 MKL_INT group_sizes[GRP_COUNT] = {2}; // 8 matrix multiplications 68 69 CBLAS_TRANSPOSE transa_array[GRP_COUNT] = {CblasNoTrans}; 70 CBLAS_TRANSPOSE transb_array[GRP_COUNT] = {CblasNoTrans}; 71 72 MKL_INT m_array[GRP_COUNT] = {m}; 73 MKL_INT n_array[GRP_COUNT] = {n}; 74 MKL_INT k_array[GRP_COUNT] = {k}; 75 76 MKL_INT lda_array[GRP_COUNT] = {k}; 77 MKL_INT ldb_array[GRP_COUNT] = {n}; 78 MKL_INT ldc_array[GRP_COUNT] = {n}; 79 80 double alpha_array[GRP_COUNT] = {alpha}; 81 double beta_array[GRP_COUNT] = {beta}; 82 83 // Number of matrix multiplications = 2 84 double **a_array, **b_array, **c_array; 85 a_array = (double **)mkl_calloc(2, sizeof( double* ), 64); 86 b_array = (double **)mkl_calloc(2, sizeof( double* ), 64); 87 c_array = (double **)mkl_calloc(2, sizeof( double* ), 64); 88 89 t_start = omp_get_wtime(); 90 91 // Call cblas_dgemm_batch 92 #pragma omp target enter data \ 93 map(to: A1[0:m*k], B1[0:k*n], C1[0:m*n]) \ 94 map(to: A2[0:m*k], B2[0:k*n], C2[0:m*n]) 95 96 #pragma omp target data use_device_ptr(A1, B1, C1, A2, B2, C2) 97 { 98 a_array[0] = A1, a_array[1] = A2; 99 b_array[0] = B1, b_array[1] = B2; 100 c_array[0] = C1, c_array[1] = C2; 101 } 102 103 #pragma omp target data \ 104 map(to:a_array[0:2], b_array[0:2], c_array[0:2]) 105 { 106 #pragma omp dispatch 107 cblas_dgemm_batch ( 108 CblasRowMajor, 109 transa_array, 110 transb_array, 111 m_array, 112 n_array, 113 k_array, 114 alpha_array, 115 (const double **)a_array, 116 lda_array, 117 (const double **)b_array, 118 ldb_array, 119 beta_array, 120 c_array, 121 ldc_array, 122 group_count, 123 group_sizes); 124 } // end target data map 125 126 #pragma omp target exit data \ 127 map(from: C1[0:m*n], C2[0:m*n]) 128 129 t_end = omp_get_wtime(); 130 131 printf ("\n Top left corner of matrix C1: \n"); 132 for (i=0; i<min(m,6); i++) { 133 for (j=0; j<min(n,6); j++) { 134 printf ("%12.5G", C1[j+i*n]); 135 } 136 printf ("\n"); 137 } 138 139 printf ("\n Top left corner of matrix C2: \n"); 140 for (i=0; i<min(m,6); i++) { 141 for (j=0; j<min(n,6); j++) { 142 printf ("%12.5G", C2[j+i*n]); 143 } 144 printf ("\n"); 145 } 146 147 printf (" \nComputing matrix product using for-loops \n"); 148 149 for (i = 0; i < m; i++) { 150 for (j = 0; j < n; j++) { 151 sum = 0.0; 152 for (q = 0; q < k; q++) 153 sum += A1[k*i+q] * B1[n*q+j]; 154 C1_fl[n*i+j] = sum; 155 } 156 } 157 158 for (i = 0; i < m; i++) { 159 for (j = 0; j < n; j++) { 160 sum = 0.0; 161 for (q = 0; q < k; q++) 162 sum += A2[k*i+q] * B2[n*q+j]; 163 C2_fl[n*i+j] = sum; 164 } 165 } 166 167 printf ("\n Top left corner of matrix C1: \n"); 168 for (i=0; i<min(m,6); i++) { 169 for (j=0; j<min(n,6); j++) { 170 printf ("%12.5G", C1_fl[j+i*n]); 171 } 172 printf ("\n"); 173 } 174 175 printf ("\n Top left corner of matrix C2: \n"); 176 for (i=0; i<min(m,6); i++) { 177 for (j=0; j<min(n,6); j++) { 178 printf ("%12.5G", C2_fl[j+i*n]); 179 } 180 printf ("\n"); 181 } 182 183 printf ("\n Computations completed. Verifying... \n\n"); 184 185 fail = 0; 186 for (i = 0; i < (m*n); i++) { 187 if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { 188 fail = 1; 189 break; 190 } 191 } 192 193 if (fail) { 194 printf (" **** FAIL **** \n"); 195 } 196 else { 197 printf(" time = %lf seconds\n", t_end - t_start); 198 printf (" **** PASS **** \n"); 199 } 200 201 mkl_free(A1); 202 mkl_free(B1); 203 mkl_free(C1); 204 mkl_free(C1_fl); 205 mkl_free(A2); 206 mkl_free(B2); 207 mkl_free(C2); 208 mkl_free(C2_fl); 209 210 return 0; 211}
在特定 GPU 上运行时(仅 1 堆栈),上述两个示例的性能如下:
dgemm_example_01_c.cpp (两次调用 cblas_dgemm): 2.976183 秒 dgemm_batch_example_01_c.cpp (一次调用 cblas_dgemm_batch): 1.881641 秒
下面显示了一个更复杂的批量示例。在这个示例中,我们有一个由 3 组( GROUP_COUNT=3 )组成的批量。 每个组的大小是在1到10之间随机选择的数字。几个参数(layout
, transA
, transB
, m
, n
, 和 k
) 被随机选择,但是在每个组中,所有乘法的参数都相同。total_batch_size 等于所有组大小之和。
1#include <stdio.h> 2#include <omp.h> 3#include "mkl.h" 4#include "mkl_omp_offload.h" 5#include "common.h" 6 7#define GROUP_COUNT 3 8 9int dnum = 0; 10 11int main() { 12 13 CBLAS_LAYOUT layout = (rand_int(0,1) == 0) ? CblasColMajor : CblasRowMajor; 14 CBLAS_TRANSPOSE *transA, *transB; 15 MKL_INT *m, *n, *k, *lda, *ldb, *ldc; 16 double *alpha, *beta; 17 MKL_INT *group_size, *sizea_array, *sizeb_array, *sizec_array, total_batch_size = 0, sizea, sizeb, sizec; 18 double **a_array, **b_array, **c_array, **c_ref_array; 19 double **a_array_dev, **b_array_dev, **c_array_dev; 20 21 transA = (CBLAS_TRANSPOSE *)mkl_malloc(GROUP_COUNT * sizeof(CBLAS_TRANSPOSE), 64); 22 transB = (CBLAS_TRANSPOSE *)mkl_malloc(GROUP_COUNT * sizeof(CBLAS_TRANSPOSE), 64); 23 24 m = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 25 n = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 26 k = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 27 lda = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 28 ldb = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 29 ldc = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 30 group_size = (MKL_INT *)mkl_malloc(GROUP_COUNT * sizeof(MKL_INT), 64); 31 alpha = (double *)mkl_malloc(GROUP_COUNT * sizeof(double), 64); 32 beta = (double *)mkl_malloc(GROUP_COUNT * sizeof(double), 64); 33 34 if ((m == NULL) || (n == NULL) || (k == NULL) || (lda == NULL) || (ldb == NULL) || (ldc == NULL) || 35 (group_size == NULL) || (alpha == NULL) || (beta == NULL)) { 36 printf("Cannot allocate input arrays\n"); 37 return 1; 38 } 39 40 MKL_INT i, j, p, idx; 41 42 for (i = 0; i < GROUP_COUNT; i++) { 43 transA[i] = (rand_int(0,1) == 0) ? CblasNoTrans : CblasTrans; 44 transB[i] = (rand_int(0,1) == 0) ? CblasNoTrans : CblasTrans; 45 alpha[i] = rand_double_scalar(); 46 beta[i] = rand_double_scalar(); 47 m[i] = rand_int(1, 20); 48 n[i] = rand_int(1, 20); 49 k[i] = rand_int(1, 20); 50 lda[i] = MAX(m[i], k[i]); 51 ldb[i] = MAX(k[i], n[i]); 52 ldc[i] = MAX(m[i], n[i]); 53 group_size[i] = rand_int(1, 10); 54 total_batch_size += group_size[i]; 55#ifdef MKL_ILP64 56 printf("Group %lld: layout = %s, transA = %s, transB = %s, m = %lld, n = %lld, k = %lld, lda = %lld, ldb = %lld, ldc = %lld, alpha = %lf, beta = %lf, group_size = %lld\n", 57 i, (layout == CblasColMajor) ? "Column Major" : "Row Major", 58 (transA[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", 59 (transB[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", 60 m[i], n[i], k[i], lda[i], ldb[i], ldc[i], alpha[i], beta[i], group_size[i]); 61#else 62 printf("Group %d: layout = %s, transA = %s, transB = %s, m = %d, n = %d, k = %d, lda = %d, ldb = %d, ldc = %d, alpha = %lf, beta = %lf, group_size = %d\n", 63 i, (layout == CblasColMajor) ? "Column Major" : "Row Major", 64 (transA[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", 65 (transB[i] == CblasNoTrans) ? "Non Transpose" : "Transpose", 66 m[i], n[i], k[i], lda[i], ldb[i], ldc[i], alpha[i], beta[i], group_size[i]); 67#endif 68 } 69 70 sizea_array = (MKL_INT *)mkl_malloc(sizeof(MKL_INT) * total_batch_size, 64); 71 sizeb_array = (MKL_INT *)mkl_malloc(sizeof(MKL_INT) * total_batch_size, 64); 72 sizec_array = (MKL_INT *)mkl_malloc(sizeof(MKL_INT) * total_batch_size, 64); 73 74 a_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 75 b_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 76 c_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 77 a_array_dev = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 78 b_array_dev = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 79 c_array_dev = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 80 c_ref_array = (double **)mkl_malloc(sizeof(double *) * total_batch_size, 64); 81 82 if ((sizea_array == NULL) || (sizeb_array == NULL) || (sizec_array == NULL) || (a_array == NULL) || 83 (b_array == NULL) || (c_array == NULL) || (a_array_dev == NULL) || (b_array_dev == NULL) || 84 (c_array_dev == NULL) || (c_ref_array == NULL)) { 85 printf("Cannot allocate matrices and size arrays\n"); 86 return 1; 87 } 88 89 idx = 0; 90 for (i = 0; i < GROUP_COUNT; i++) { 91 sizea = (((layout == CblasRowMajor) && (transA[i] == CblasTrans)) || 92 ((layout == CblasColMajor) && (transA[i] == CblasNoTrans))) ? lda[i] * k[i] : m[i] * lda[i]; 93 sizeb = (((layout == CblasRowMajor) && (transB[i] == CblasTrans)) || 94 ((layout == CblasColMajor) && (transB[i] == CblasNoTrans))) ? ldb[i] * n[i] : k[i] * ldb[i]; 95 sizec = (layout == CblasColMajor) ? ldc[i] * n[i] : ldc[i] * m[i]; 96 for (j = 0; j < group_size[i]; j++) { 97 a_array[idx] = (double *)mkl_malloc(sizeof(double) * sizea, 64); 98 a_array_dev[idx] = a_array[idx]; 99 sizea_array[idx] = sizea; 100 if (a_array[idx] == NULL) { 101 printf("cannot allocate a matrices\n"); 102 return 1; 103 } 104 b_array[idx] = (double *)mkl_malloc(sizeof(double) * sizeb, 64); 105 b_array_dev[idx] = b_array[idx]; 106 sizeb_array[idx] = sizeb; 107 if (b_array[idx] == NULL) { 108 printf("cannot allocate b matrices\n"); 109 return 1; 110 } 111 c_array[idx] = (double *)mkl_malloc(sizeof(double) * sizec, 64); 112 c_array_dev[idx] = c_array[idx]; 113 sizec_array[idx] = sizec; 114 if (c_array[idx] == NULL) { 115 printf("cannot allocate c matrices\n"); 116 return 1; 117 } 118 c_ref_array[idx] = (double *)mkl_malloc(sizeof(double) * sizec, 64); 119 if (c_ref_array[idx] == NULL) { 120 printf("cannot allocate c_ref matrices\n"); 121 return 1; 122 } 123 init_double_array(sizea, a_array[idx], 1); 124 init_double_array(sizeb, b_array[idx], 1); 125 init_double_array(sizec, c_array[idx], 1); 126 for (p = 0; p < sizec_array[idx]; p++) c_ref_array[idx][p] = c_array[idx][p]; 127 idx++; 128 } 129 } 130 131 // run gemm_batch on host, use standard oneMKL interface 132 cblas_dgemm_batch(layout, transA, transB, m, n, k, alpha, (const double **) a_array, lda, 133 (const double **) b_array, ldb, beta, c_ref_array, ldc, GROUP_COUNT, group_size); 134 135 double *a, *b, *c; 136 for (i = 0; i < total_batch_size; i++) { 137 a = a_array[i]; 138 b = b_array[i]; 139 c = c_array[i]; 140#pragma omp target enter data map(to:a[0:sizea_array[i]],b[0:sizeb_array[i]],c[0:sizec_array[i]]) 141#pragma omp target data use_device_ptr(a,b,c) 142 { 143 a_array_dev[i] = a; 144 b_array_dev[i] = b; 145 c_array_dev[i] = c; 146 } 147 } 148#pragma omp target data map(to:a_array_dev[0:total_batch_size], \ 149 b_array_dev[0:total_batch_size], \ 150 c_array_dev[0:total_batch_size]) device(dnum) 151 { 152 153#pragma omp dispatch 154 cblas_dgemm_batch(layout, transA, transB, m, n, k, alpha, (const double **) a_array_dev, lda, (const double **) b_array_dev, ldb, beta, c_array_dev, ldc, GROUP_COUNT, group_size); 155 156 } 157 158 for (i = 0; i < total_batch_size; i++) { 159 a = a_array[i]; 160 b = b_array[i]; 161 c = c_array[i]; 162#pragma omp target exit data map(from:a[0:sizea_array[i]],b[0:sizeb_array[i]],c[0:sizec_array[i]]) 163 } 164 165 double computed, reference, diff; 166 MKL_INT l; 167 idx = 0; 168 for (p = 0; p < GROUP_COUNT; p++) { 169 for (l = 0; l < group_size[p]; l++) { 170 for (i = 0; i < m[p]; i++) { 171 for (j = 0; j < n[p]; j++) { 172 if (layout == CblasColMajor) { 173 computed = c_array[idx][i + j * ldc[p]]; 174 reference = c_ref_array[idx][i + j * ldc[p]]; 175 } 176 else { 177 computed = c_array[idx][j + i * ldc[p]]; 178 reference = c_ref_array[idx][j + i * ldc[p]]; 179 } 180 diff = computed - reference; 181 diff = (diff > 0) ? diff : -diff; 182 if (diff > 0.0001) { 183#ifdef MKL_ILP64 184 printf("Error in matrix %lld (group = %lld, matrix index in group = %lld) at index [%lld][%lld], computed = %lf, reference = %lf, difference = %lf\n", idx, p, l, i, j, computed, reference, diff); 185#else 186 printf("Error in matrix %d at index [%d][%d], computed = %lf, reference = %lf, difference = %lf\n", idx, i, j, computed, reference, diff); 187#endif 188 189 free_double_matrices(a_array, total_batch_size); 190 free_double_matrices(b_array, total_batch_size); 191 free_double_matrices(c_array, total_batch_size); 192 free_double_matrices(c_ref_array, total_batch_size); 193 mkl_free(a_array); 194 mkl_free(b_array); 195 mkl_free(c_array); 196 mkl_free(c_ref_array); 197 mkl_free(a_array_dev); 198 mkl_free(b_array_dev); 199 mkl_free(c_array_dev); 200 mkl_free(sizea_array); 201 mkl_free(sizeb_array); 202 mkl_free(sizec_array); 203 mkl_free(transA); mkl_free(transB); 204 mkl_free(m); mkl_free(n); mkl_free(k); 205 mkl_free(lda); mkl_free(ldb); mkl_free(ldc); mkl_free(group_size); 206 mkl_free(alpha); mkl_free(beta); 207 return 1; 208 } 209 } 210 } 211 idx++; 212 } 213 } 214 215 printf("Validation PASSED\n"); 216 free_double_matrices(a_array, total_batch_size); 217 free_double_matrices(b_array, total_batch_size); 218 free_double_matrices(c_array, total_batch_size); 219 free_double_matrices(c_ref_array, total_batch_size); 220 mkl_free(a_array); 221 mkl_free(b_array); 222 mkl_free(c_array); 223 mkl_free(c_ref_array); 224 mkl_free(a_array_dev); 225 mkl_free(b_array_dev); 226 mkl_free(c_array_dev); 227 mkl_free(sizea_array); 228 mkl_free(sizeb_array); 229 mkl_free(sizec_array); 230 mkl_free(transA); mkl_free(transB); 231 mkl_free(m); mkl_free(n); mkl_free(k); 232 mkl_free(lda); mkl_free(ldb); mkl_free(ldc); mkl_free(group_size); 233 mkl_free(alpha); mkl_free(beta); 234 return 0; 235}
加速独立连续 GEMM 调用
有多种方法可以加速独立执行的连续 GEMM 调用的执行。 一种方法是如上所示通过调用 GEMM 的批量版本来批量 GEMM 调用。
另一种方法是将对 GEMM 的调用包含在 OpenMP parallel
构造中, 这样执行并行区域的每个 OpenMP 线程都会分派其中一个 GEMM 调用。下面的示例说明了这种 parallel
方法。
1#include <stdio.h> 2#include <stdlib.h> 3#include <math.h> 4#include <omp.h> 5#include "mkl.h" 6#include "mkl_omp_offload.h" 7 8#define min(x,y) (((x) < (y)) ? (x) : (y)) 9#define epsilon 0.0000001f 10 11bool compare(double x, double y) 12{ 13 // returns true if x and y are the same 14 return fabs(x - y) <= epsilon; 15} 16 17int main() 18{ 19 double *A1, *B1, *C1, *C1_fl; 20 double *A2, *B2, *C2, *C2_fl; 21 int m, n, k, i, j, q; 22 double alpha, beta; 23 double sum; 24 int fail; 25 double t_start, t_end; 26 27 m = 2000, k = 200, n = 1000; 28 alpha = 1.0; beta = 0.0; 29 30 printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" 31 " performance \n\n"); 32 A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 33 B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 34 C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 35 C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 36 37 A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 38 B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 39 C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 40 C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 41 42 if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || 43 A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { 44 printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); 45 return 1; 46 } 47 48 printf (" Intializing matrix data \n\n"); 49 for (i = 0; i < (m*k); i++) { 50 A1[i] = A2[i] = (double)(i+1); 51 } 52 53 for (i = 0; i < (k*n); i++) { 54 B1[i] = B2[i] = (double)(-i-1); 55 } 56 57 for (i = 0; i < (m*n); i++) { 58 C1[i] = C2[i] = 0.0; 59 C1_fl[i] = C2_fl[i] = 0.0; 60 } 61 62 printf (" \nComputing matrix product using Intel MKL cblas_dgemm function \n"); 63 64 t_start = omp_get_wtime(); 65 66 #pragma omp target data \ 67 map(to: A1[0:m*k], B1[0:k*n], A2[0:m*k], B2[0:k*n]) \ 68 map(tofrom: C1[0:m*n], C2[0:m*n]) 69 { 70 #pragma omp parallel num_threads(2) 71 { 72 int id = omp_get_thread_num(); 73 74 if (id == 0) { 75 #pragma omp dispatch 76 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 77 m, n, k, alpha, A1, k, B1, n, beta, C1, n); 78 } 79 else if (id == 1) { 80 #pragma omp dispatch 81 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 82 m, n, k, alpha, A2, k, B2, n, beta, C2, n); 83 } 84 } 85 } 86 87 t_end = omp_get_wtime(); 88 89 printf ("\n Top left corner of matrix C1: \n"); 90 for (i=0; i<min(m,6); i++) { 91 for (j=0; j<min(n,6); j++) { 92 printf ("%12.5G", C1[j+i*n]); 93 } 94 printf ("\n"); 95 } 96 97 printf ("\n Top left corner of matrix C2: \n"); 98 for (i=0; i<min(m,6); i++) { 99 for (j=0; j<min(n,6); j++) { 100 printf ("%12.5G", C2[j+i*n]); 101 } 102 printf ("\n"); 103 } 104 105 printf (" \nComputing matrix product using for-loops \n"); 106 107 for (i = 0; i < m; i++) { 108 for (j = 0; j < n; j++) { 109 sum = 0.0; 110 for (q = 0; q < k; q++) 111 sum += A1[k*i+q] * B1[n*q+j]; 112 C1_fl[n*i+j] = sum; 113 } 114 } 115 116 for (i = 0; i < m; i++) { 117 for (j = 0; j < n; j++) { 118 sum = 0.0; 119 for (q = 0; q < k; q++) 120 sum += A2[k*i+q] * B2[n*q+j]; 121 C2_fl[n*i+j] = sum; 122 } 123 } 124 125 printf ("\n Top left corner of matrix C1: \n"); 126 for (i=0; i<min(m,6); i++) { 127 for (j=0; j<min(n,6); j++) { 128 printf ("%12.5G", C1_fl[j+i*n]); 129 } 130 printf ("\n"); 131 } 132 133 printf ("\n Top left corner of matrix C2: \n"); 134 for (i=0; i<min(m,6); i++) { 135 for (j=0; j<min(n,6); j++) { 136 printf ("%12.5G", C2_fl[j+i*n]); 137 } 138 printf ("\n"); 139 } 140 141 printf ("\n Computations completed. Verifying... \n\n"); 142 143 fail = 0; 144 for (i = 0; i < (m*n); i++) { 145 if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { 146 fail = 1; 147 break; 148 } 149 } 150 151 if (fail) { 152 printf (" **** FAIL **** \n"); 153 } 154 else { 155 printf(" time = %lf seconds\n", t_end - t_start); 156 printf (" **** PASS **** \n"); 157 } 158 159 mkl_free(A1); 160 mkl_free(B1); 161 mkl_free(C1); 162 mkl_free(C1_fl); 163 mkl_free(A2); 164 mkl_free(B2); 165 mkl_free(C2); 166 mkl_free(C2_fl); 167 168 return 0; 169}
另一种加速独立连续 GEMM 调用的方法是在 dispatch
构造上使 nowai
子句, 因此主线程不必等待分派的 GEMM 调用完成后再分派下一个。在最后一个 GEMM 调用之后, 我们插入一个 OpenMP taskwait
指令以保证所有分派的 MKL 调用在主线程继续进行之前全部完成。 下面的示例说明了这 nowai
方法。
1#include <stdio.h> 2#include <stdlib.h> 3#include <math.h> 4#include <omp.h> 5#include "mkl.h" 6#include "mkl_omp_offload.h" 7 8#define min(x,y) (((x) < (y)) ? (x) : (y)) 9#define epsilon 0.0000001f 10 11bool compare(double x, double y) 12{ 13 // returns true if x and y are the same 14 return fabs(x - y) <= epsilon; 15} 16 17int main() 18{ 19 double *A1, *B1, *C1, *C1_fl; 20 double *A2, *B2, *C2, *C2_fl; 21 int m, n, k, i, j, q; 22 double alpha, beta; 23 double sum; 24 int fail; 25 double t_start, t_end; 26 27 m = 2000, k = 200, n = 1000; 28 alpha = 1.0; beta = 0.0; 29 30 printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" 31 " performance \n\n"); 32 A1 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 33 B1 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 34 C1 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 35 C1_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 36 37 A2 = (double *)mkl_malloc (m*k*sizeof( double ), 64 ); 38 B2 = (double *)mkl_malloc (k*n*sizeof( double ), 64 ); 39 C2 = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 40 C2_fl = (double *)mkl_malloc (m*n*sizeof( double ), 64 ); 41 42 if (A1 == NULL || B1 == NULL || C1 == NULL || C1_fl == NULL || 43 A2 == NULL || B2 == NULL || C2 == NULL || C2_fl == NULL) { 44 printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); 45 return 1; 46 } 47 48 printf (" Intializing matrix data \n\n"); 49 for (i = 0; i < (m*k); i++) { 50 A1[i] = A2[i] = (double)(i+1); 51 } 52 53 for (i = 0; i < (k*n); i++) { 54 B1[i] = B2[i] = (double)(-i-1); 55 } 56 57 for (i = 0; i < (m*n); i++) { 58 C1[i] = C2[i] = 0.0; 59 C1_fl[i] = C2_fl[i] = 0.0; 60 } 61 62 printf (" \nComputing matrix product using Intel MKL cblas_dgemm function \n"); 63 64 t_start = omp_get_wtime(); 65 66 #pragma omp target data \ 67 map(to: A1[0:m*k], B1[0:k*n], A2[0:m*k], B2[0:k*n]) \ 68 map(tofrom: C1[0:m*n], C2[0:m*n]) 69 { 70 #pragma omp dispatch nowait 71 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 72 m, n, k, alpha, A1, k, B1, n, beta, C1, n); 73 74 #pragma omp dispatch nowait 75 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 76 m, n, k, alpha, A2, k, B2, n, beta, C2, n); 77 78 #pragma omp taskwait 79 } 80 81 t_end = omp_get_wtime(); 82 83 printf ("\n Top left corner of matrix C1: \n"); 84 for (i=0; i<min(m,6); i++) { 85 for (j=0; j<min(n,6); j++) { 86 printf ("%12.5G", C1[j+i*n]); 87 } 88 printf ("\n"); 89 } 90 91 printf ("\n Top left corner of matrix C2: \n"); 92 for (i=0; i<min(m,6); i++) { 93 for (j=0; j<min(n,6); j++) { 94 printf ("%12.5G", C2[j+i*n]); 95 } 96 printf ("\n"); 97 } 98 99 printf (" \nComputing matrix product using for-loops \n"); 100 101 for (i = 0; i < m; i++) { 102 for (j = 0; j < n; j++) { 103 sum = 0.0; 104 for (q = 0; q < k; q++) 105 sum += A1[k*i+q] * B1[n*q+j]; 106 C1_fl[n*i+j] = sum; 107 } 108 } 109 110 for (i = 0; i < m; i++) { 111 for (j = 0; j < n; j++) { 112 sum = 0.0; 113 for (q = 0; q < k; q++) 114 sum += A2[k*i+q] * B2[n*q+j]; 115 C2_fl[n*i+j] = sum; 116 } 117 } 118 119 printf ("\n Top left corner of matrix C1: \n"); 120 for (i=0; i<min(m,6); i++) { 121 for (j=0; j<min(n,6); j++) { 122 printf ("%12.5G", C1_fl[j+i*n]); 123 } 124 printf ("\n"); 125 } 126 127 printf ("\n Top left corner of matrix C2: \n"); 128 for (i=0; i<min(m,6); i++) { 129 for (j=0; j<min(n,6); j++) { 130 printf ("%12.5G", C2_fl[j+i*n]); 131 } 132 printf ("\n"); 133 } 134 135 printf ("\n Computations completed. Verifying... \n\n"); 136 137 fail = 0; 138 for (i = 0; i < (m*n); i++) { 139 if (! compare(C1[i], C1_fl[i]) || ! compare(C2[i], C2_fl[i])) { 140 fail = 1; 141 break; 142 } 143 } 144 145 if (fail) { 146 printf (" **** FAIL **** \n"); 147 } 148 else { 149 printf(" time = %lf seconds\n", t_end - t_start); 150 printf (" **** PASS **** \n"); 151 } 152 153 mkl_free(A1); 154 mkl_free(B1); 155 mkl_free(C1); 156 mkl_free(C1_fl); 157 mkl_free(A2); 158 mkl_free(B2); 159 mkl_free(C2); 160 mkl_free(C2_fl); 161 162 return 0; 163}
在 GEMM 计算中使用的矩阵填充
通过填充矩阵 a
、 b``和 ``c
的前导维度 lda
、 ldb
和 ldc
,可以加速 GEMM 调用。
矩阵的前导维度取决于矩阵在内存中的布局:
-
列主要布局(C/C++):矩阵的前导维度是矩阵的列数。
-
行主要布局(Fortran):矩阵的前导维度是矩阵的行数。
选择传递给 DGEMM 调用的矩阵大小时,有两条规则需要记住。
规则 1: 为了获得最佳性能,传递给 GEMM 调用的矩阵的前导维度应该是 64 字节 (完整缓存行大小)的倍数。对于单精度数据,这意味着前导维度应该是 (64 / sizeof(float))
的倍数,即 16。 对于双精度数据,这意味着前导维度应该是 (64 / sizeof(double))
的倍数,即 8。
最好所有矩阵(a
、 b``和 ``c
)都应该填充。但是,填充矩阵 c
与填充 a
和 b
相比不那么重要。
规则 2: 为了获得最佳性能,前导维度不应该是 2 的大幂次方(例如 4096 字节)的倍数。 在某些情况下,稍微增加前导维度(例如从 4096 字节增加到 4096+64 字节)可以提高性能。
填充示例 (Fortran)
下面的 Fortran 示例说明了如何对传递给 DGEMM 调用的矩阵进行填充以提高性能。
1 include "mkl_omp_offload.f90" 2 3 ! This subroutine reads command line arguments m1, k1, and n1. 4 subroutine get_arguments (m1, k1, n1) 5 implicit none 6 integer :: m1, k1, n1 7 character(len=32) :: m1_char, k1_char, n1_char 8 9 ! First, make sure that the right number of command line arguments 10 ! have been provided. 11 if (command_argument_count() .ne. 3) then 12 print *, "ERROR: Three command-line arguments expected; stopping." 13 stop 14 endif 15 16 ! Get command line arguments. 17 call get_command_argument(1, m1_char) 18 call get_command_argument(2, k1_char) 19 call get_command_argument(3, n1_char) 20 21 ! Convert arguments to integers. 22 read (m1_char,*) m1 23 read (k1_char,*) k1 24 read (n1_char,*) n1 25 end subroutine get_arguments 26 27 28 ! This function returns the smallest multiple of 8 that is >= n. 29 ! Examples: 30 ! if n = 3, then get_mul8 = 8 31 ! if n = 9, then get_mul8 = 16 32 ! if n = 30, then get_mul8 = 32 33 ! if n = 80, then get_mul8 = 8 34 integer function get_mul8 (n) 35 implicit none 36 integer :: n 37 integer :: mod 38 if (mod(n,8) .eq. 0) then 39 get_mul8 = n 40 else 41 get_mul8 = ((n/8) + 1) * 8 42 endif 43 end function get_mul8 44 45 46 ! This subroutine initializes matrices. 47 subroutine init_matrix (m, k, n, a, b, c) 48 implicit none 49 integer :: m, k, n 50 double precision :: a(m,k), b(k,n), c(m,n) 51 integer :: i, j 52 53 do i = 1, m 54 do j = 1, k 55 a(i,j) = (i-1) - (0.25 * k) 56 end do 57 end do 58 59 do i = 1, k 60 do j = 1, n 61 b(i,j) = -((i-1) + j) 62 end do 63 end do 64 65 do i = 1, m 66 do j = 1, n 67 c(i,j) = 0.2 + i - j 68 end do 69 end do 70 end subroutine init_matrix 71 72 73 program DGEMM_MAIN 74 75#if defined(MKL_ILP64) 76 use onemkl_blas_omp_offload_ilp64 77#else 78 use onemkl_blas_omp_offload_lp64 79#endif 80 use omp_lib 81 use iso_fortran_env 82 implicit none 83 84 interface 85 integer function get_mul8 (n) 86 implicit none 87 integer :: n 88 end function get_mul8 89 end interface 90 91 double precision :: alpha, beta 92 integer :: m1, k1, n1, m2, k2, n2 93 double precision, allocatable :: a1(:,:) 94 double precision, allocatable :: b1(:,:) 95 double precision, allocatable :: c1(:,:) 96 97 double precision, allocatable :: a2(:,:) 98 double precision, allocatable :: b2(:,:) 99 double precision, allocatable :: c2(:,:) 100 101 double precision :: start_t1, end_t1 102 double precision :: start_t2, end_t2 103 104 ! Read command line arguments m1, k1, and n1. 105 106 call get_arguments (m1, k1, n1) 107 108 ! 109 ! Initialize alpha, beta, and m2, k2, n2 110 ! 111 112 alpha = 1.025 113 beta = 0.75 114 115 m2 = get_mul8(m1) 116 k2 = get_mul8(k1) 117 n2 = get_mul8(n1) 118 119 ! 120 ! Allocate and initialize matrices. 121 ! 122 allocate( a1(1:m1,1:k1) ) 123 allocate( b1(1:k1,1:n1) ) 124 allocate( c1(1:m1,1:n1) ) 125 allocate( a2(1:m2,1:k2) ) 126 allocate( b2(1:k2,1:n2) ) 127 allocate( c2(1:m2,1:n1) ) 128 call init_matrix (m1, k1, n1, a1, b1, c1) 129 call init_matrix (m2, k2, n2, a2, b2, c2) 130 131 132 !$omp target data map(to: a1, b1, a2, b2) map(tofrom: c1, c2) 133 134 ! Warm up run on device 135 !$omp dispatch 136 call DGEMM('N','N',m1,n1,k1,alpha,a1,m1,b1,k1,beta,c1,m1) 137 138 ! 139 ! Run DGEMM on device (using matrices a1, b1, and c1) 140 ! 141 start_t1 = omp_get_wtime() 142 143 !$omp dispatch 144 call DGEMM('N','N',m1,n1,k1,alpha,a1,m1,b1,k1,beta,c1,m1) 145 146 end_t1 = omp_get_wtime() 147 148 149 ! Warm up run on device 150 !$omp dispatch 151 call DGEMM('N','N',m2,n2,k2,alpha,a2,m2,b2,k2,beta,c2,m2) 152 153 ! 154 ! Run DGEMM on device (using padded matrices a2, b2, and c2) 155 ! 156 start_t2 = omp_get_wtime() 157 158 !$omp dispatch 159 call DGEMM('N','N',m2,n2,k2,alpha,a2,m2,b2,k2,beta,c2,m2) 160 161 end_t2 = omp_get_wtime() 162 163 !$omp end target data 164 165 print 100, alpha, beta 166 print * 167 print 101, m1, n1, k1 168 print 111, (end_t1 - start_t1) 169 print * 170 print 102, m2, n2, k2 171 print 112, (end_t2 - start_t2) 172 173 100 format(7x, "ALPHA =", f10.4, " BETA =",f10.4) 174 101 format(7x, "M1 =", i5," N1 =", i5, " K1 =",i5) 175 111 format(7x, "Time (non-padded arrays) =", f10.4, " sec") 176 102 format(7x, "M2 =", i5," N2 =", i5, " K2 =",i5) 177 112 format(7x, "Time (padded arrays) =", f10.4, " sec") 178 179 end
在上面的示例中,数组边界 (m1
, k1
, 和 n1
)作为命令行参数输入。 分配并初始化了矩阵 a1(1:m1, 1:k1)
, b1(1:k1, 1:n1)
, 和 c1(1:m1, 1:n1)
。
此外,分配并初始化了填充后的矩阵 a2(1:m2, 1:k2)
, b2(1:k2, 1:n2)
, 和 c1(1:m2, 1:n2)
。 m2
是大于等于 m1
且为 8 的倍数的最小整数。同样, k2
是大于等于 k1
且为 8 的倍数的最小整数, 而 n2
是大于等于 n1
且为 8 的倍数的最小整数。
该程序比较了 DGEMM 计算在 a1
, b1
, 和 c1
上所花费时间与 DGEMM 计算 在 a2
, b2
, 和 c2
上所花费时间。
下面显示了使用的编译、链接和运行命令。
Compile: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -fpp -free -c dgemm_pad_f_01.f Link: ifx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL dgemm_pad_f_01.o Run: ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=0 ./a.out 12001 12001 12001
在特定 GPU 上的输出(仅 1 堆栈)如下:
ALPHA = 1.0250 BETA = 0.7500 M1 =12001 N1 =12001 K1 =12001 Time (non-padded arrays) = 0.2388 sec M2 =12008 N2 =12008 K2 =12008 Time (padded arrays) = 0.1648 sec
上面显示,通过填充数组 a
, b
和 c
,DGEMM 调用所花费的时间从 0.2388 秒减少到了 0.1648 秒。
填充示例 (C/C++)
下面是一个 C/C++ 示例,说明了如何对传递给 DGEMM 调用的矩阵进行填充以提高性能。
1#include "mkl.h" 2#include "mkl_omp_offload.h" 3#include <float.h> 4#include <math.h> 5#include <omp.h> 6#include <stdio.h> 7#include <stdlib.h> 8#include <sys/time.h> 9 10#if defined(PRECISION) 11#if PRECISION == 1 12#define FLOAT double 13#else 14#define FLOAT float 15#endif 16#else 17#define PRECISION 1 18#define FLOAT double 19#endif 20 21#define index(i, j, ld) (((j) * (ld)) + (i)) 22 23#define RAND() ((FLOAT)rand() / (FLOAT)RAND_MAX * 2.0 - 1.0) 24 25#define MALLOC(x) mkl_malloc((x), 64); 26#define FREE mkl_free 27 28#define MALLOC_CHECK(p) \ 29 if (p == NULL) { \ 30 fprintf(stderr, "%s:%d: memory allocation error\n", __FILE__, __LINE__); \ 31 return EXIT_FAILURE; \ 32 } 33 34#ifndef max 35#define max(a, b) (((a) > (b)) ? (a) : (b)) 36#endif 37#ifndef min 38#define min(a, b) (((a) < (b)) ? (a) : (b)) 39#endif 40 41void printMat(FLOAT *P, int m, int n, int ld) { 42 int i, j; 43 for (i = 0; i < m; i++) { 44 printf("\n"); 45 for (j = 0; j < n; j++) 46 printf("%f ", P[index(i, j, ld)]); 47 } 48 printf("\n"); 49} 50 51void gemm_naive(int m, int n, int k, FLOAT alpha, const FLOAT *A, int lda, 52 const FLOAT *B, int ldb, FLOAT beta, FLOAT *C, int ldc) { 53 int i, j, kk; 54 FLOAT temp; 55 for (j = 0; j < n; j++) { 56 for (i = 0; i < m; i++) { 57 temp = 0.0; 58 for (kk = 0; kk < k; kk++) { 59 temp += alpha * A[index(i, kk, lda)] * B[index(kk, j, ldb)]; 60 } 61 C[index(i, j, ldc)] = temp + beta * C[index(i, j, ldc)]; 62 } 63 } 64} 65 66FLOAT infinity_norm_error(int m, int n, int ld, const FLOAT *ans, 67 const FLOAT *ref) { 68 int i, j, ind; 69 FLOAT norm = 0.0, temp; 70 for (i = 0; i < m; i++) { 71 temp = 0.0; 72 for (j = 0; j < n; j++) { 73 ind = index(i, j, ld); 74 temp += fabs(ref[ind] - ans[ind]); 75 } 76 norm = max(norm, temp); 77 } 78 return norm; 79} 80 81double mysecond() { 82 struct timeval tp; 83 struct timezone tzp; 84 int i; 85 i = gettimeofday(&tp, &tzp); 86 if (i != 0) { 87 fprintf(stderr, "%s:%d: timing error %d\n", __FILE__, __LINE__, i); 88 return EXIT_FAILURE; 89 } 90 return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6); 91} 92 93#if defined(USE_MKL) 94int dnum = 0; 95#endif 96 97#if PRECISION == 1 98#define LD_ALIGN 256 99#define LD_BIAS 8 100#else 101#define LD_ALIGN 512 102#define LD_BIAS 16 103#endif 104 105#define HPL_PTR(ptr_, al_) ((((size_t)(ptr_) + (al_)-1) / (al_)) * (al_)) 106 107#if defined(PAD_LD) 108static inline int getld(int x) { 109 int ld; 110 ld = HPL_PTR(x, LD_ALIGN); // Rule 1 111 if (ld - LD_BIAS >= x) 112 ld -= LD_BIAS; 113 else 114 ld += LD_BIAS; // Rule 2 115 return ld; 116} 117#else 118static inline int getld(int x) { return x; } 119#endif 120 121int main(int argc, char **argv) { 122 int i, j; 123 124 if ((argc < 4) || (argc > 4 && argc < 8)) { 125 printf("Performs a DGEMM test C = alpha*A*B + beta*C\n"); 126 printf("A matrix is MxK and B matrix is KxN\n"); 127 printf("All matrices are stored in column-major format\n"); 128 printf("Run as ./dgemm_cublas <M> <K> <N> [<alpha> <beta> <iterations> " 129 "<verify>]\n"); 130 printf("Required inputs are:\n"); 131 printf(" M: number of rows of matrix A\n"); 132 printf(" K: number of cols of matrix A\n"); 133 printf(" N: number of cols of matrix B\n"); 134 printf("Optional inputs are (all must be provided if providing any):\n"); 135 printf(" alpha: scalar multiplier (default: 1.0)\n"); 136 printf(" beta: scalar multiplier (default: 0.0)\n"); 137 printf(" iterations: number of blocking DGEMM calls to perform " 138 "(default: 10)\n"); 139 printf(" verify: set to 1 to check solution against CPU reference, " 140 "not recommended for large M|K|N (default: 0)\n"); 141 return EXIT_FAILURE; 142 } 143 144 FLOAT alpha, beta; 145 int niter, verify; 146 int HA = atoi(argv[1]); 147 int WA = atoi(argv[2]); 148 int WB = atoi(argv[3]); 149 150 if ((HA == 0) || (WA == 0) || (WB == 0)) 151 exit(1); 152 153 if (argc > 4) { 154 155#if PRECISION == 1 156 sscanf(argv[4], "%lf", &alpha); 157 sscanf(argv[5], "%lf", &beta); 158#else 159 sscanf(argv[4], "%f", &alpha); 160 sscanf(argv[5], "%f", &beta); 161#endif 162 niter = atoi(argv[6]); 163 verify = atoi(argv[7]); 164 } else { 165 alpha = 1.0; 166 beta = 0.0; 167 niter = 10; 168 verify = 0; 169 } 170 171#if PRECISION == 1 172 printf("DGEMM performance test\n"); 173#else 174 printf("SGEMM performance test\n"); 175#endif 176 177 int HB = WA; 178 int WC = WB; 179 int HC = HA; 180 181 int ldA = getld(HA); 182 int ldB = getld(HB); 183 int ldC = getld(HC); 184 185 printf("M = %d, K = %d, N = %d, ldA = %d, ldB = %d, ldC = %d, alpha = %f, " 186 "beta = %f, iterations = %d, verify? = %d\n", 187 HA, WA, WB, ldA, ldB, ldC, alpha, beta, niter, verify); 188 189 double start_t, end_t, tot_t = 0.0, best_t = DBL_MAX; 190 191 /*ALLOCATE HOST ARRAYS*/ 192 FLOAT *A = (FLOAT *)MALLOC(ldA * WA * sizeof(FLOAT)); 193 MALLOC_CHECK(A); 194 FLOAT *B = (FLOAT *)MALLOC(ldB * WB * sizeof(FLOAT)); 195 MALLOC_CHECK(B); 196 FLOAT *C = (FLOAT *)MALLOC(ldC * WC * sizeof(FLOAT)); 197 MALLOC_CHECK(C); 198 FLOAT *Cref = NULL; 199 if (verify) { 200 Cref = (FLOAT *)MALLOC(ldC * WC * sizeof(FLOAT)); 201 MALLOC_CHECK(Cref); 202 } 203 204 printf("\n---------------------\n"); 205 printf("Array A: %d x %d\n", ldA, WA); 206 printf("Array B: %d x %d\n", ldB, WB); 207 printf("Array C: %d x %d\n", ldC, WC); 208 printf("---------------------\n"); 209 210 /*INITIALIZE WITH PSEUDO-RANDOM DATA*/ 211 srand(2864); 212 for (j = 0; j < WA; j++) 213 for (i = 0; i < HA; i++) 214 A[index(i, j, ldA)] = RAND(); 215 216 for (j = 0; j < WB; j++) 217 for (i = 0; i < HB; i++) 218 B[index(i, j, ldB)] = RAND(); 219 220 if (beta != 0.0) { 221 for (j = 0; j < WC; j++) 222 for (i = 0; i < HC; i++) 223 C[index(i, j, ldC)] = RAND(); 224 } else { 225 for (j = 0; j < WC; j++) 226 for (i = 0; i < HC; i++) 227 C[index(i, j, ldC)] = 0.0; 228 } 229 230 if (verify) { 231 for (j = 0; j < WC; j++) 232 for (i = 0; i < HC; i++) 233 Cref[index(i, j, ldC)] = C[index(i, j, ldC)]; 234 } 235 236#if defined(USE_MKL) 237 size_t sizea = (size_t)ldA * WA; 238 size_t sizeb = (size_t)ldB * WB; 239 size_t sizec = (size_t)ldC * WC; 240 241#pragma omp target data map(to: A [0:sizea], B [0:sizeb]) map(tofrom: C [0:sizec]) 242 { 243 244 // warm-up run 245 #pragma omp dispatch 246#if PRECISION == 1 247 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, A, 248 ldA, B, ldB, beta, C, ldC); 249#else 250 cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, A, 251 ldA, B, ldB, beta, C, ldC); 252#endif 253 254 // run gemm on gpu, using dispatch construct 255 for (i = 0; i < niter; i++) { 256 start_t = mysecond(); 257 258 #pragma omp dispatch 259#if PRECISION == 1 260 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, 261 A, ldA, B, ldB, beta, C, ldC); 262#else 263 cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, HA, WB, WA, alpha, 264 A, ldA, B, ldB, beta, C, ldC); 265#endif 266 267 end_t = mysecond(); 268 tot_t += end_t - start_t; 269 best_t = min(best_t, end_t - start_t); 270 } // end for 271 } 272#endif // end #pragma omp target data 273 274 double tflop_count; 275 tflop_count = (double)2.0 * HA * WB * WA; 276 if (beta != 0.0) 277 tflop_count += (double)HA * WB; 278 tflop_count *= 1.E-12; 279 280 printf("Total runtime for %d iterations: %f seconds.\n", niter, tot_t); 281 printf("Mean TFLOP/s: %f\n", (double)niter * tflop_count / tot_t); 282 printf("Best TFLOP/s: %f\n", (double)tflop_count / best_t); 283 284 if (verify) { 285 // compute reference solution on host (1 added to niter to account for the 286 // warm-up run) 287 for (i = 0; i < (niter + 1); i++) { 288 gemm_naive(HA, WB, WA, alpha, A, ldA, B, ldB, beta, Cref, ldC); 289 } 290 291 printf("Error Matrix Infinity Norm = %f\n", 292 infinity_norm_error(HA, WB, ldC, C, Cref)); 293 } 294 295 FREE(A); 296 FREE(B); 297 FREE(C); 298 if (verify) 299 FREE(Cref); 300 301 return EXIT_SUCCESS; 302}
该程序读取指定矩阵 a
, b
和 c``的维度以及一些其他(可选)参数的命令行参数。如果定义了 ``PAD_LD
宏, 则根据上面的 Tul1
和 Rule2
填充矩阵的前导维度。然后,程序调用 cblas_dgemm
或 cblas_sgemm niter
次, 并输出总运行时间。
编译和运行 (``LD_PAD`` 宏未定义)
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -DUSE_MKL -DPRECISION=1 -c dgemm_pad_c_01.cpp icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL dgemm_pad_c_01.o -o dgemm_nopad.exe Run: ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=0 ./dgemm_nopad.exe 4001 4001 4001 1.0 0.0 100 0
在特定 GPU 上的性能(无填充)如下(仅 1 堆栈):
DGEMM performance test M = 15000, K = 15000, N = 15000, ldA = 15000, ldB = 15000, ldC = 15000, alpha = 1.000000, beta = 0.000000, iterations = 100, verify? = 0 --------------------- Array A: 15000 x 15000 Array B: 15000 x 15000 Array C: 15000 x 15000 --------------------- Total runtime for 100 iterations: 138.368065 seconds. Mean TFLOP/s: 4.878293 Best TFLOP/s: 5.657242
编译和运行 (``LD_PAD`` 宏被定义)
Compile: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -DUSE_MKL -DPRECISION=1 -DPAD_LD -c dgemm_pad_c_01.cpp Link: icpx -fiopenmp -fopenmp-targets=spir64 -qmkl -lOpenCL dgemm_pad_c_01.o -o dgemm_pad.exe Run: ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=0 ./dgemm_pad.exe 4001 4001 4001 1.0 0.0 100 0
在特定 GPU 上的性能(填充后,其中矩阵 a
, b
, 和 c
的前导维度从 15000 增加到了 15096) ,如下(仅 1 堆栈)。
DGEMM performance test M = 15000, K = 15000, N = 15000, ldA = 15096, ldB = 15096, ldC = 15096, alpha = 1.000000, beta = 0.000000, iterations = 100, verify? = 0 --------------------- Array A: 15096 x 15000 Array B: 15096 x 15000 Array C: 15096 x 15000 --------------------- Total runtime for 100 iterations: 145.219459 seconds. Mean TFLOP/s: 4.648137 Best TFLOP/s: 5.525268