文章目录
2.3组织并行线程
如果使用了合适的网格和块大小来正确地组织线程,那么可以对内核性能产生很大的影响。在向量加法的例子中,为了实现最佳性能我们调整了块的大小,并基于块大小和向量数据大小计算出了网格大小。
现在通过一个矩阵加法的例子来进一步说明这一点。对于矩阵运算,传统的方法是在内核中使用一个包含二维网格与二维块的布局来组织线程。但是,这种传统的方法无法获得最佳性能。在矩阵加法中使用以下布局将有助于了解更多关于网格和块的启发性的用法:
- 由二维线程块构成的二维网格
- 由一维线程块构成的一维网格
- 由一维线程块构成的二维网格
2.3.1使用块和线程建立矩阵索引
通常情况一个矩阵用行优先方法在全局内存中进行线性存储,如下图的例子。
在一个矩阵加法核函数中,一个线程通常被分配一个数据元素来处理,首先要完成的是从块和线程索引中得到数据在全局内存中的位置。对一个二维示例来说通常有需要3种索引:
- 线程和块索引;
- 矩阵给定点的坐标;
- 全局线性内存中的偏移量。
对于一个给定的线程,首先可以通过把线程和块索引映射到矩阵坐标上来获取线程块和线程索引的全局内存偏移量,然后将这些矩阵坐标映射到全局内存的存储单元中。
- 可以用以下公式把线程和块索引映射到矩阵坐标上:
ix = threadIdx.x + blockIdx.x * blockDim.x;
iy = threadIdx.y + blockIdx.y * blockDim.y;
- 可以用以下公式把矩阵坐标映射到全局内存中的索引/存储单元上:
idx = iy * nx + ix;
其中,nx = blockDim.x * gridDim.x
下图说明了块和线程索引、矩阵坐标以及线性全局内存索引之间的对应关系
2.3.1使用二维网格和二维块对矩阵求和
使用一个二维网格和二维块来编写一个矩阵加法核函数
#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>
/*
* This example demonstrates a simple vector sum on the GPU and on the host.
* sumArraysOnGPU splits the work of the vector sum across CUDA threads on the
* GPU. A 2D thread block and 2D grid are used. sumArraysOnHost sequentially
* iterates through vector elements on the host.
*/
void initialData(float *ip, const int size)
{
int i;
for(i = 0; i < size; i++)
{
ip[i] = (float)(rand() & 0xFF) / 10.0f;
}
return;
}
void sumMatrixOnHost(float *A, float *B, float *C, const int nx,
const int ny)
{
float *ia = A;
float *ib = B;
float *ic = C;
for (int iy = 0; iy < ny; iy++)
{
for (int ix = 0; ix < nx; ix++)
{
ic[ix] = ia[ix] + ib[ix];
}
ia += nx;
ib += nx;
ic += nx;
}
return;
}
void checkResult(float *hostRef, float *gpuRef, const int N)
{
double epsilon = 1.0E-8;
bool match = 1;
for (int i = 0; i < N; i++)
{
if (abs(hostRef[i] - gpuRef[i]) > epsilon)
{
match = 0;
printf("host %f gpu %f\n", hostRef[i], gpuRef[i]);
break;
}
}
if (match)
printf("Arrays match.\n\n");
else
printf("Arrays do not match.\n\n");
}
// grid 2D block 2D
__global__ void sumMatrixOnGPU2D(float *MatA, float *MatB, float *MatC, int nx,
int ny)
{
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
unsigned int idx = iy * nx + ix;
if (ix < nx && iy < ny)
MatC[idx] = MatA[idx] + MatB[idx];
}
int main(int argc, char **argv)
{
printf("%s Starting...\n", argv[0]);
// set up device
int dev = 0;
cudaDeviceProp deviceProp;
CHECK(cudaGetDeviceProperties(&deviceProp, dev));
printf("Using Device %d: %s\n", dev, deviceProp.name);
CHECK(cudaSetDevice(dev));
// set up data size of matrix
int nx = 1 << 12;
int ny = 1 << 12;
int nxy = nx * ny;
int nBytes = nxy * sizeof(float);
printf("Matrix size: nx %d ny %d\n", nx, ny);
// malloc host memory
float *h_A, *h_B, *hostRef, *gpuRef;
h_A = (float *)malloc(nBytes);
h_B = (float *)malloc(nBytes);
hostRef = (float *)malloc(nBytes);
gpuRef = (float *)malloc(nBytes);
// initialize data at host side
double iStart = seconds();
initialData(h_A, nxy);
initialData(h_B, nxy);
double iElaps = seconds() - iStart;
printf("Matrix initialization elapsed %f sec\n", iElaps);
memset(hostRef, 0, nBytes);
memset(gpuRef, 0, nBytes);
// add matrix at host side for result checks
iStart = seconds();
sumMatrixOnHost(h_A, h_B, hostRef, nx, ny);
iElaps = seconds() - iStart;
printf("sumMatrixOnHost elapsed %f sec\n", iElaps);
// malloc device global memory
float *d_MatA, *d_MatB, *d_MatC;
CHECK(cudaMalloc((void **)&d_MatA, nBytes));
CHECK(cudaMalloc((void **)&d_MatB, nBytes));
CHECK(cudaMalloc((void **)&d_MatC, nBytes));
// transfer data from host to device
CHECK(cudaMemcpy(d_MatA, h_A, nBytes, cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(d_MatB, h_B, nBytes, cudaMemcpyHostToDevice));
// invoke kernel at host side
int dimx = 32;
int dimy = 32;
dim3 block(dimx, dimy);
dim3 grid((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y);
iStart = seconds();
sumMatrixOnGPU2D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);
CHECK(cudaDeviceSynchronize());
iElaps = seconds() - iStart;
printf("sumMatrixOnGPU2D <<<(%d,%d), (%d,%d)>>> elapsed %f sec\n", grid.x,
grid.y,
block.x, block.y, iElaps);
// check kernel error
CHECK(cudaGetLastError());
// copy kernel result back to host side
CHECK(cudaMemcpy(gpuRef, d_MatC, nBytes, cudaMemcpyDeviceToHost));
// check device results
checkResult(hostRef, gpuRef, nxy);
// free device global memory
CHECK(cudaFree(d_MatA));
CHECK(cudaFree(d_MatB));
CHECK(cudaFree(d_MatC));
// free host memory
free(h_A);
free(h_B);
free(hostRef);
free(gpuRef);
// reset device
CHECK(cudaDeviceReset());
return (0);
}
关键部分
// grid 2D block 2D
__global__ void sumMatrixOnGPU2D(float *MatA, float *MatB, float *MatC, int nx,
int ny)
{
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
unsigned int idx = iy * nx + ix;
if (ix < nx && iy < ny)
MatC[idx] = MatA[idx] + MatB[idx];
}
增加块的数量不一定能提升内核性能
2.3.3使用一维网格和一维块对矩阵求和
为了使用一维网格和一维块,你需要写一个新的核函数,其中每个线程处理ny个数据元素
// grid 1D block 1D
__global__ void sumMatrixOnGPU1D(float *MatA, float *MatB, float *MatC, int nx,
int ny)
{
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
if (ix < nx )
for (int iy = 0; iy < ny; iy++)
{
int idx = iy * nx + ix;
MatC[idx] = MatA[idx] + MatB[idx];
}
}
int dimx = 32;
dim3 block(dimx, 1);
dim3 grid((nx + block.x - 1) / block.x, 1);
2.3.4使用二维网格和一维块对矩阵求和
当使用一个包含一维块的二维网格时,每个线程都只关注一个数据元素并且网格的第二个维数等于ny,如图2-14所示。这可以看作是含有一个二维块的二维网格的特殊情况,其中块的第二个维数是1。因
此,从块和线程索引到矩阵坐标的映射就变成
// grid 2D block 1D
__global__ void sumMatrixOnGPU2D(float *MatA, float *MatB, float *MatC, int nx,
int ny)
{
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int iy = blockIdx.y;
if (ix<nx && iy<ny)
{
unsigned int idx = iy * nx + ix;
MatC[idx] = MatA[idx] + MatB[idx];
}
}
int dimx = 32;
dim3 block(dimx, 1);
dim3 grid((nx + block.x - 1) / block.x, ny);
小结
·改变执行配置对内核性能有影响
·传统的核函数实现一般不能获得最佳性能
·对于一个给定的核函数,尝试使用不同的网格和线程块大小可以获得更好的性能
2.4设备管理
NVIDIA提供了几种查询和管理GPU设备的方法,学会如何查询GPU设备,可以在运行时使用它来帮助设置内核的执行配置。
本节介绍两种设备查询方法,一种是CUDA Runtime API,另一种是系统管理界面命令行程序nvidia-smi
2.4.1使用运行时API查询GPU信息
使用cudaGetDeviceProperties
函数可以查询关于GPU设备的所有信息,其原型如下:
cudaError_t cudaGetDeviceProperties(cudaDeviceProp* prop, int device);
cudaDeviceProp结构体返回GPU设备的属性
#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>
/*
* Display a variety of information on the first CUDA device in this system,
* including driver version, runtime version, compute capability, bytes of
* global memory, etc.
*/
int main(int argc, char **argv)
{
printf("%s Starting...\n", argv[0]);
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
if (deviceCount == 0)
{
printf("There are no available device(s) that support CUDA\n");
}
else
{
printf("Detected %d CUDA Capable device(s)\n", deviceCount);
}
int dev = 0, driverVersion = 0, runtimeVersion = 0;
CHECK(cudaSetDevice(dev));
cudaDeviceProp deviceProp;
CHECK(cudaGetDeviceProperties(&deviceProp, dev));
printf("Device %d: \"%s\"\n", dev, deviceProp.name);
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n",
driverVersion / 1000, (driverVersion % 100) / 10,
runtimeVersion / 1000, (runtimeVersion % 100) / 10);
printf(" CUDA Capability Major/Minor version number: %d.%d\n",
deviceProp.major, deviceProp.minor);
printf(" Total amount of global memory: %.2f GBytes (%llu "
"bytes)\n", (float)deviceProp.totalGlobalMem / pow(1024.0, 3),
(unsigned long long)deviceProp.totalGlobalMem);
printf(" GPU Clock rate: %.0f MHz (%0.2f "
"GHz)\n", deviceProp.clockRate * 1e-3f,
deviceProp.clockRate * 1e-6f);
printf(" Memory Clock rate: %.0f Mhz\n",
deviceProp.memoryClockRate * 1e-3f);
printf(" Memory Bus Width: %d-bit\n",
deviceProp.memoryBusWidth);
if (deviceProp.l2CacheSize)
{
printf(" L2 Cache Size: %d bytes\n",
deviceProp.l2CacheSize);
}
printf(" Max Texture Dimension Size (x,y,z) 1D=(%d), "
"2D=(%d,%d), 3D=(%d,%d,%d)\n", deviceProp.maxTexture1D,
deviceProp.maxTexture2D[0], deviceProp.maxTexture2D[1],
deviceProp.maxTexture3D[0], deviceProp.maxTexture3D[1],
deviceProp.maxTexture3D[2]);
printf(" Max Layered Texture Size (dim) x layers 1D=(%d) x %d, "
"2D=(%d,%d) x %d\n", deviceProp.maxTexture1DLayered[0],
deviceProp.maxTexture1DLayered[1], deviceProp.maxTexture2DLayered[0],
deviceProp.maxTexture2DLayered[1],
deviceProp.maxTexture2DLayered[2]);
printf(" Total amount of constant memory: %lu bytes\n",
deviceProp.totalConstMem);
printf(" Total amount of shared memory per block: %lu bytes\n",
deviceProp.sharedMemPerBlock);
printf(" Total number of registers available per block: %d\n",
deviceProp.regsPerBlock);
printf(" Warp size: %d\n",
deviceProp.warpSize);
printf(" Maximum number of threads per multiprocessor: %d\n",
deviceProp.maxThreadsPerMultiProcessor);
printf(" Maximum number of threads per block: %d\n",
deviceProp.maxThreadsPerBlock);
printf(" Maximum sizes of each dimension of a block: %d x %d x %d\n",
deviceProp.maxThreadsDim[0],
deviceProp.maxThreadsDim[1],
deviceProp.maxThreadsDim[2]);
printf(" Maximum sizes of each dimension of a grid: %d x %d x %d\n",
deviceProp.maxGridSize[0],
deviceProp.maxGridSize[1],
deviceProp.maxGridSize[2]);
printf(" Maximum memory pitch: %lu bytes\n",
deviceProp.memPitch);
exit(EXIT_SUCCESS);
}
./checkDeviceInfor Starting...
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1050"
CUDA Driver Version / Runtime Version 11.2 / 11.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 1.95 GBytes (2097479680 bytes)
GPU Clock rate: 1493 MHz (1.49 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
2.4.2确定最优GPU
一些系统支持多GPU。在每个GPU都不同的情况下,选择性能最好的GPU运行核函数是非常重要的。通过比较GPU包含的多处理器的数量选出计算能力最佳的GPU。如果你有一个多GPU系统,可以使用以下代码来选择计算能力最优的设备:
2.4.3使用nvidia-smi查询GPU信息
nvidia-smi是一个命令行工具,用于管理和监控GPU设备,并允许查询和修改设备状态。
- 确定系统中安装了多少个GPU以及每个GPU的设备ID,可以使用以下命令
nvidia-smi -L
- 可以使用下面的命令获取GPU 0的详细信息
nvidia-smi -q -i 0
可以使用下面的参数精简nvidia-smi的显式信息:
比如,只显示设备内存信息,可以使用:
nvidia-smi -q -i 0 -d MEMORY | tail -n 5
只显示设备使用信息,可以使用:
nvidia-smi -q -i 0 -d UTILIZATION | tail -n 4
2.4.4在运行时设置设备
在一个有N个GPU的系统中,可以使用环境变量CUDA_VISIBLE_DEVICES
来使用指定ID的设备,而无需更改程序。
如果运行时设置CUDA_VISIBLE_DEVICES=2
,nvidia驱动程序会屏蔽其他GPU,这是设备2作为设备0出现在程序中。
也可以指定多个设备,比如设置CUDA_VISIBLE_DEVICES =2,3
,运行时Nvidia驱动程序将只使用ID为2和3的设备,且分别映射为0和1。