HPC应用&NumPy兼容的多维数组GPU加速计算工具库CuPy详细安装与使用

技术瘾君子1573

于 2024-08-24 06:30:00 发布

阅读量767

点赞数 25

分类专栏： Linux并行计算&HPC高性能计算文章标签： numpy GPU加速工具库 HPC应用 DCU 海光 GPGPU

本文链接：https://blog.csdn.net/qq_27815483/article/details/141185843

版权

Linux并行计算&HPC高性能计算专栏收录该内容

70 篇文章 16 订阅

订阅专栏

CuPy简介

CuPy是一个NumPy兼容的多维数组GPU加速计算工具库，最初是一家日本人工智能公司Preferred Networks为其深度学习框架Chainer开发的GPU后端。Chainer最初使用的是PyCUDA，但由于PyCUDA是为使用python的CUDA用户设计的，并未对深度学习框架中广泛使用的多维数组提供广泛的支持，所以Preferred Networks放弃了PyCUDA，专门开发了CuPy。随后，CuPy不断发展，不仅提供NumPy API兼容的GPU实现，而且也可以像PyCUDA一样编写用户自定义核函数，CuPy自动包装并编译核函数代码段，生成可执行代码。CuPy在github开源，目前已经有了较好的社区支持。CuPy对ROCm的支持正在开发中，目前对ROCm 3.5以上版本已有基本的支持。

下载安装

CuPy可以从github下载：

git clone --recursive https://github.com/cupy/cupy.git

在昆山集群，加载rocm 3.9.1和anacoda3环境：

module rm compiler/rocm/2.9

module add compiler/rocm/3.9.1 apps/anaconda3·

在可以连接外网的情况下，创建虚拟环境：

conda create -n cupy python=3.8

加载虚拟环境：

source activate cupy

安装Cython 0.28.0或以上版本：

pip install Cython

进入CuPy源码目录切换到master分支：

git checkout master

设置所需环境变量：

export HCC_AMDGPU_TARGET=gfx906 # This value should be changed based on your GPU

export __HIP_PLATFORM_HCC__

export CUPY_INSTALL_USE_HIP=1

export ROCM_HOME=/public/software/compiler/rocm/rocm-3.9.1

最后，在CuPy源码顶层目录执行：

pip install .

将自动安装所需的其他依赖，然后从源码编译安装CuPy。安装完成后执行：

conda list

可以看到cupy 已经在已安装列表里了。

如无法连接外网，也可以下载源码包手动安装。pyCUDA需要以下依赖：

Cython >= 0.28.0

numpy >=1.15

fastrlock >= 0.3

功能测试

注意，在rocm平台运行前，需要设置以下环境变量：

export HCC_AMDGPU_TARGET=gfx906

NumPy 兼容API

cupy.ndarray基础

CuPy在GPU上实现了NumPy API的一个子集。

>>> import numpy as np

>>> import cupy as cp

>>> x_gpu = cp.array([1, 2, 3])

上述代码中，x_gpu是cupy.ndarray的一个实例，其创建方式与NumPy几乎完全一致，只是将np替换为cp。该语句会在GPU上分配内存创建数组。

>>> x_cpu = np.array([1, 2, 3])

>>> l2_cpu = np.linalg.norm(x_cpu)

>>> l2_cpu

3.7416573867739413

>>> l2_gpu = cp.linalg.norm(x_gpu)

>>> l2_gpu

array(3.74165739)

上述代码首先在主机端使用NumPy API创建了x_cpu数组，并调用NumPy API计算了该数组的L2 范数；然后调用CuPy 中的相同API计算了x_gpu数组的L2范数，可见其结果与x_cpu一致。

默认情况下CuPy使用当前GPU进行计算。cupy.cuda.Device.use()可以切换GPU，cupy.ndarray.device可以查看存储该对象的GPU编号：

>>> x_gpu.device

<CUDA Device 0>

>>> cp.cuda.Device(1).use()

>>> x_gpu1 = cp.array([4,5,6])

>>> x_gpu1.device

<CUDA Device 1>

使用cupy.cuda.Device.use() API切换设备后，后续操作将一直在切换后的设备上进行，直到再次切换：

>>> x_gpu1_again = cp.array([7,8,9])

>>> x_gpu1_again.device

<CUDA Device 1>

临时的一次性切换设备，可以使用with语句：

>>> with cp.cuda.Device(0): #before this clause, the current device is
Device 1

... x_on_gpu0 = cp.array([1,2,3])

...

>>> x_on_gpu0.device # x_on_gpu0 is created within the with
cp.cuda.Device(0)

<CUDA Device 0> #clause, so it’s on Device 0

>>> x_on_gpu1 = cp.array([1,2,3])

>>> x_on_gpu1.device # x_on_gpu1 is created out of the with
cp.cuda.Device(0)

<CUDA Device 1> #clause, so it’s on Device 1

如果操作的对象不在当前设备上，将会遇到报错：

>>> x_on_gpu1 * 2

array([2, 4, 6])

>>> x_on_gpu0 * 2

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "cupy/core/core.pyx", line 1084, in cupy.core.core.ndarray.__mul__

File "cupy/core/_kernel.pyx", line 1041, in cupy.core._kernel.ufunc.__call__

File "cupy/core/_kernel.pyx", line 107, in cupy.core._kernel._preprocess_args

File "cupy/core/_kernel.pyx", line 77, in
cupy.core._kernel._check_array_device_id

ValueError: Array device must be same as the current device: array device = 0
while current = 1

数据迁移

cupy.asarray()可以将numpy.ndarray 列表或者任何可以作为numpy.array()参数的对象拷贝到当前GPU：

>>> x_cpu = np.array([1, 2, 3])

>>> x_gpu = cp.asarray(x_cpu) # move the data to the current device.

>>> x_gpu.device

<CUDA Device 1>

cupy.asarray()也可以接受cupy.ndarray作为参数，在设备之间拷贝对象：

>>> with cp.cuda.Device(0):

... x_gpu_0 = cp.ndarray([1, 2, 3]) # create an array in GPU 0

>>> with cp.cuda.Device(1):

... x_gpu_1 = cp.asarray(x_gpu_0) # move the array to GPU 1

cupy.asnumpy()可以将设备端数组拷贝到主机端：

>>> x_gpu = cp.array([1, 2, 3]) # create an array in the current device

>>> x_cpu = cp.asnumpy(x_gpu) # move the array to the host.

cupy.ndarray.get()也可以起相同作用：

>>> x_cpu = x_gpu.get()

CPU/GPU兼容代码

cupy.get_array_module()根据参数实际存储位置返回numpy或者cupy对象，可以用于编写主机/设备无关代码：

>>> # Stable implementation of log(1 + exp(x))

>>> def softplus(x):

... xp = cp.get_array_module(x)

... return xp.maximum(0, x) + xp.log1p(xp.exp(-abs(x)))

用户自定义核函数

elementwise kernels

elementwise kernel可以通过ElememtwiseKernel类来定义，该类的实例定义一个可以通过call方法启动的GPU核函数。elementwise kernel的定义包括四部分：输入参数列表，输出参数列表，循环体代码，以及核函数名称。如下代码定义并测试了计算方差的核函数：

import numpy as np

import cupy as cp

squared_diff = cp.ElementwiseKernel(

'float32 x, float32 y',

'float32 z',

'z = (x - y) * (x - y)',

'squared_diff')

x = cp.arange(10, dtype=np.float32).reshape(2, 5)

y = cp.arange(5, dtype=np.float32)

z = squared_diff(x, y)

print(z)

执行结果：

[[ 0. 0. 0. 0. 0.]

[25. 25. 25. 25. 25.]]

参数列表包含逗号分隔的参数定义，每个参数定义由一个类型指示符和一个参数名称组成。NumPy的数据类型名称可以作为类型指示符。注意，字母n，i以及以下划线_开头的名称是内部使用的保留名称，不能用作用户自定义核函数中的变量名称。

如果类型指示符是一个单个的字母，它将被视为一个类型占位符，其值将在调用时根据实参类型推断得到。同一核函数中相同字母的类型占位符代表同一种类型。ElementwiseKernel类首先检查输出参数，然后检查输入参数，如果调用时没有指定输出参数，将使用输入实参进行参数类型推断。

>>> squared_diff_super_generic = cp.ElementwiseKernel(

... 'X x, Y y',

... 'Z z',

... 'z = (x - y) * (x - y)',

... 'squared_diff_super_generic')

调用上述函数时要求显式指定输出参数类型，因为类型Z无法自动从输入参数中推断。

ElementwiseKernel类通过广播机制自动执行索引，这对于定义大多数基于元素的计算很有用。通过在类型说明符之前添加raw关键字，可以告诉ElementwiseKernel类使用手动索引。使用特殊的变量i和_ind.size()方法进行手动索引。i表示循环内的索引。_ind.size()表示要应用elementwise操作的元素总数。注意，它表示广播操作之后的大小。

以下代码定义了计算两个向量反向相加的核函数：

import numpy as np

import cupy as cp

add_reverse = cp.ElementwiseKernel(

'T x, raw T y', 'T z',

'z = x + y[_ind.size() - i - 1]',

'add_reverse')

x = cp.arange(100000, dtype=np.float32)

y = cp.arange(100000, dtype=np.float32)

z = cp.zeros(100000, dtype=np.float32)

add_reverse(x, y, z)

print(z)

执行结果如下：

[99999. 99999. 99999. ... 99999. 99999. 99999.]

注意raw参数不会被广播，如果想让所有参数都加上raw特性，必须在调用时指定size参数，来定义_ind.size()的值。

reduction kernels

reduction核函数可以由ReductionKernel类定义。该类的实例需要指定7个参数：输入参数列表，输出参数列表， Mapping expression，Reduction expression， Post mapping expression，Identity value，以及核函数名称。其中，Mapping expression用于定义归约前每个元素的前处理过程；Reduction expression定义归约操作，特殊的变量名a和b在这里作为操作对象；Post mapping expression用于转换得到的归约值，特殊变量a被用作它的输入，输出应为输出参数；Identity value指定每个元素的初始值。

ReductionKernel类会自动插入高效灵活的归约实现所需要的其他代码片段。如下代码定义并测试了计算指定轴L2范数的代码：

import cupy as cp

import numpy as np

l2norm_kernel = cp.ReductionKernel(

'T x', # input params

'T y', # output params

'x * x', # map

'a + b', # reduce

'y = sqrt(a)', # post-reduction map

'0', # identity value

'l2norm' # kernel name

)

x = cp.arange(10, dtype=np.float32).reshape(2, 5)

print(l2norm_kernel(x, axis=1))

计算结果：

[ 5.4772253 15.9687195]

注意，如果需要对至少一个参数使用raw特性，axis参数必须是0或从0开始的连续递增的整数序列，如(0,1)、(0,1,2)等。

raw kernels

raw kernels可以由RawKernel类定义。通过使用raw kernels，可以从原始的的CUDA/HIP代码定义核函数。RawKernel对象允许使用CUDA/HIP的核函数发起接口调用核函数，因此可以控制grid、block、shared memory尺寸和stream。

如下代码定义并调用了一个向量加核函数：

import numpy as np

import cupy as cp

add_kernel = cp.RawKernel(r'''

extern "C" __global__

void my_add(const float* x1, const float* x2, float* y) {

int tid = blockDim.x * blockIdx.x + threadIdx.x;

y[tid] = x1[tid] + x2[tid];

}

''', 'my_add')

x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5)

x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)

y = cp.zeros((5, 5), dtype=cp.float32)

add_kernel((5,), (5,), (x1, x2, y)) # grid, block and arguments

print(y)

执行结果：

[[ 0. 2. 4. 6. 8.]

[10. 12. 14. 16. 18.]

[20. 22. 24. 26. 28.]

[30. 32. 34. 36. 38.]

[40. 42. 44. 46. 48.]]

需要注意的是，CuPy不会对传递给核函数的参数(包括参数的类型和数量)执行任何检查。特别注意，当传递ndarray时，它的dtype应该与在CUDA/HIP代码中声明的参数的类型相匹配(除非有意转换数据类型)。例如，cupy.float32和cupy.uint64数组必须做为float和unsigned long long传参。对于Python的基本类型，int、float、complex和bool分别映射到long long、double、cuDoubleComplex和bool。

RawKernel支持CUDA的动态并行，但在ROCm平台，HIP不支持该特性，因此在ROCm平台CuPy也不支持该特性。此外，由于HIP的已知bug（见第4节），如果HIP代码中有头文件包含（#include XXX语句），需要指定后端为nvcc（在ROCm平台上该参数实际是将hiprtc切换为hipcc，即不使用运行时编译。CuPy计划修改这个令人迷惑的操作）。示例如下：

import numpy as np

import cupy as cp

complex_kernel = cp.RawKernel(r'''

#include <cupy/complex.cuh>

extern "C" __global__

void my_func(const complex<float>* x1, const complex<float>* x2,

complex<float>* y, float a) {

int tid = blockDim.x * blockIdx.x + threadIdx.x;

y[tid] = x1[tid] + a * x2[tid];

}

''', 'my_func', backend='nvcc') #specify backend as nvcc to avoid the hiprtc
bug

x1 = cp.arange(25, dtype=cp.complex64).reshape(5, 5)

x2 = 1j*cp.arange(25, dtype=cp.complex64).reshape(5, 5)

y = cp.zeros((5, 5), dtype=cp.complex64)

complex_kernel((5,), (5,), (x1, x2, y, cp.float32(2.0))) # grid, block and
arguments

print(y)

执行结果：

[[ 0. +0.j 1. +2.j 2. +4.j 3. +6.j 4. +8.j]

[ 5.+10.j 6.+12.j 7.+14.j 8.+16.j 9.+18.j]

[10.+20.j 11.+22.j 12.+24.j 13.+26.j 14.+28.j]

[15.+30.j 16.+32.j 17.+34.j 18.+36.j 19.+38.j]

[20.+40.j 21.+42.j 22.+44.j 23.+46.j 24.+48.j]]

raw modules

对于处理大型CUDA/HIP源文件或加载现有的CUDA/HIP二进制文件，RawModule类会更加方便。它可以接受CUDA/HIP源代码或CUDA/HIP二进制的路径（二选一）。可以通过调用get_function()方法来检索所需的核函数，该方法返回的RawKernel实例可以按照之前介绍的方法调用。示例如下：

import cupy as cp

import numpy as np

loaded_from_source = r'''

extern "C"{

__global__ void test_sum(const float* x1, const float* x2, float* y, \

unsigned int N)

{

unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x;

if (tid < N)

{

y[tid] = x1[tid] + x2[tid];

}

}

__global__ void test_multiply(const float* x1, const float* x2, float* y, unsigned int N)

{

unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x;

if (tid < N)

{

y[tid] = x1[tid] * x2[tid];

}

}

}'''

module = cp.RawModule(code=loaded_from_source)

ker_sum = module.get_function('test_sum')

ker_times = module.get_function('test_multiply')

N = 10

x1 = cp.arange(N**2, dtype=cp.float32).reshape(N, N)

x2 = cp.ones((N, N), dtype=cp.float32)

y = cp.zeros((N, N), dtype=cp.float32)

ker_sum((N,), (N,), (x1, x2, y, N**2)) # y = x1 + x2

print(y)

ker_times((N,), (N,), (x1, x2, y, N**2)) # y = x1 * x2

print(y)

计算结果：

[[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]

[ 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]

[ 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.]

[ 31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]

[ 41. 42. 43. 44. 45. 46. 47. 48. 49. 50.]

[ 51. 52. 53. 54. 55. 56. 57. 58. 59. 60.]

[ 61. 62. 63. 64. 65. 66. 67. 68. 69. 70.]

[ 71. 72. 73. 74. 75. 76. 77. 78. 79. 80.]

[ 81. 82. 83. 84. 85. 86. 87. 88. 89. 90.]

[ 91. 92. 93. 94. 95. 96. 97. 98. 99. 100.]]

[[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]

[10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]

[20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]

[30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

[40. 41. 42. 43. 44. 45. 46. 47. 48. 49.]

[50. 51. 52. 53. 54. 55. 56. 57. 58. 59.]

[60. 61. 62. 63. 64. 65. 66. 67. 68. 69.]

[70. 71. 72. 73. 74. 75. 76. 77. 78. 79.]

[80. 81. 82. 83. 84. 85. 86. 87. 88. 89.]

[90. 91. 92. 93. 94. 95. 96. 97. 98. 99.]]

同RawKernel一样，如果核函数代码中有头文件包含，需加入backend=’nvcc’参数。

kernel fusion

cupy.fuse()是一个装饰器，可以用来比使用ElementwiseKernel或ReductionKernel类更容易地定义ementwisekernel或Reduction Kernel。3.2.1小节中的squared_diff函数可以采用cupy.fuse()定义如下：

>>> @cp.fuse()

... def squared_diff(x, y):

... return (x - y) * (x - y)

上述核函数可以像普通函数一样在标量、NumPy数组或CuPy数组上调用：

>>> x_cp = cp.arange(10)

>>> y_cp = cp.arange(10)[::-1]

>>> squared_diff(x_cp, y_cp)

array([81, 49, 25, 9, 1, 1, 9, 25, 49, 81])

>>> x_np = np.arange(10)

>>> y_np = np.arange(10)[::-1]

>>> squared_diff(x_np, y_np)

array([81, 49, 25, 9, 1, 1, 9, 25, 49, 81])

在squared_diff的第一次调用中，融合函数根据参数信息(例如它们的dtype和ndims)分析原始函数，创建并缓存一个实际的CUDA/HIP核函数。在具有相同输入类型的第二次squared_diff函数调用中，融合函数调用先前缓存的核函数，因此建议重用相同的修饰函数，而不是修饰多次定义的本地函数。

cupy. fuse()还支持简单reduction kernels，例如：

>>> @cp.fuse()

... def sum_of_products(x, y):

... return cp.sum(x * y, axis = -1)

目前，cupy.fuse()只支持简单的elementwise kernels和reduction kernels。大多数其他函数(例如，cupy.matmul()、cupy.reshape())不能支持。

ROCm平台支持情况

HIP的运行时编译存在一个bug，无法识别-I参数，导致核函数代码中如果有非系统路径下的头文件包含，会在运行时遇到头文件找不到的错误。例如如下代码片段：

static constexpr auto saxpy{

R"(

#include "test.h" //this cause the problem

#include <hip/hip_runtime.h>

extern "C"

__global__

void saxpy(float a, float* x, float* y, float* out, size_t n)

{

size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < n) {

out[tid] = a * x[tid] + y[tid];

}

}

)"};

......

hiprtcProgram prog;

hiprtcCreateProgram(&prog, // prog

saxpy, // buffer

"saxpy.cu", // name

0, // numHeaders

nullptr, // headers

nullptr); // includeNames

hipDeviceProp_t props;

int device = 0;

hipGetDeviceProperties(\&props, device);

std::string sarg = "--gpu-architecture=gfx906";

std::string sarg1 = "-I.";

const char* options[2];

options[1] = sarg.c_str();

options[0] = sarg1.c_str();

for (int i = 0; i < 2; ++i)

printf("option %d is %s\n", i, options[i]);

hiprtcResult compileResult{hiprtcCompileProgram(prog, 2, options)};

包含的test.h在当前目录中，并且有一个-I.选项在hiprtcCompileProgram的options参数中。可以使用hipcc正常编译saxpy.cpp，但是生成的可执行文件在运行时失败，输出如下:

/tmp/comgr-77d2a4/input/CompileSource:2:10: fatal error: 'test.h' file not found

#include "test.h"  
^~~~~~~~  
1 error generated when compiling for gfx906.  
Error: Failed to compile opencl source (from CL or HIP source to LLVM IR).

error: Compilation failed.  
error: TEST FAILED  
Aborted

该问题已在github上向HIP开发者提出并得到了确认，目前直到ROCm 3.9仍存在上述问题。CuPy针对上述问题给出了一个暂时的解决办法，就是在自动生成GPU源代码时全部include头文件的绝对路径，不过这一搜索过程会使得运行时编译更慢。另外，如果使用的是raw kernels或者raw modules模块，且GPU代码里有头文件包含时，需要添加backend=’nvcc’关掉运行时编译，否则，即使在GPU代码中给出头文件的绝对路径，如果头文件中还有其他头文件包含，还是会报头文件找不到的错误。