CUDA C编程手册: 编程接口（二）

最新推荐文章于 2024-08-23 23:22:28 发布

xholes

最新推荐文章于 2024-08-23 23:22:28 发布

阅读量390

点赞数

分类专栏： CUDA 文章标签： cuda gpu runtime 多线程共享内存

本文链接：https://blog.csdn.net/xholes/article/details/106226084

版权

CUDA 专栏收录该内容

9 篇文章 4 订阅

订阅专栏

CUDA C编程手册: 编程接口（二）

CUDA C 运行时

CUDA C 运行时

运行时封装在 _cudart_库中，应用可以通过静态库_cudart.lib_或者_cudart.a_链接运行时库，或者通过动态库_cudart.dll_或者_cudart.so_来链接。如果应用是通过动态链接来引入运行时库，在打包应用的依赖时需要包含cuda 的动态运行时库。

该库中所有的入口点都是以cuda 作为前缀。CUDA编程模型假定一个系统由主机和设备构成，都有各自独立的内存。

初始化

对于运行时来说，没有明显的初始化函数。当一个运行时函数被调用时就是进行初始化。当对运行时函数的调用进行计时或者对错误代码进行分析的时候，需要记住这一点。在初始化的时候，运行时会为每一个设备创建CUDA上下文。这个上下文是设备的基本上下文primary context, 这个上下文会被应用的所有线程共享。设备端代码作为上下文创建的一部分，必要的时候会进行即时编译并加载进入设备内存中。这些都会在底层进行，并不会将基本上下文暴露给应用程序。当一个主机端线程调用cudaDeviceReset(), 这会销毁被这个设备端线程所使用的设备的基本上下文。下一个运行时函数在调用的时候，同时会创建一个新的上下文。

设备内存

核函数在设备内存上运行，因此运行时提供函数来分配、释放设备内存，同时负责在设备内存和主机内存之间进行数据传输。设备内存可以分配线性内存linear memory<\kbd>或者CUDA数组CUDA array。

CUDA数组是一种不透明的，它优化了内存布局以便纹理访问。

线性内存存在于一个40位地址空间的设备上，因此可以通过指针来进行内存的访问。典型的线性内存分配方法是通过cudaMalloc()，同时可以使用cudaFree()来
进行释放。

设备内存和主机内存之间的数据传输可以通过cudaMemCpy()来进行。

线性内存也可以通过cudaMallocPitch() 和 cudaMalloc3D()。推荐使用这些函数来分配2D或者3D的数组，它们会确保自动填充，以满足内村对齐的需求，这样能够在访问行地址或者进行2D数组与其他设备内存拷贝时能够获得更好的性能。数组元素的访问需要借助函数返回的pitch/.stride。

2D数组示例。

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
	for (int r = 0; r < height; ++r) {
		float* row = (float*)((char*)devPtr + r * pitch);
		for (int c = 0; c < width; ++c) {
			float element = row[c];
		}
	}
}

3D数组示例。

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);
// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
int width, int height, int depth)
{
char* devPtr = devPitchedPtr.ptr;
size_t pitch = devPitchedPtr.pitch;
size_t slicePitch = pitch * height;
for (int z = 0; z < depth; ++z) {
	char* slice = devPtr + z * slicePitch;
	for (int y = 0; y < height; ++y) {
		float* row = (float*)(slice + y * pitch);
		for (int x = 0; x < width; ++x) {
			float element = row[x];
		}
		}
	}
}

参考手册中列举了所有的关于内存拷贝的变体，对应于不同的内存，如通过cudaMalloc()分配的线性内存、通过cudaMallocPtich()或者cudaMalloc3D()分配的线性内存、CUDA 数组、全局内存、常量内存等。

如下示例了用运行时API访问全局变量的不同方法：

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

cudaGetSymbolAddress()函数用来恢复在全局内存空间中声明并分配的变量地址。其大小可以通过cudaGetSymbolSize()来获得。

共享内存

共享内存的分配通过说明符__shared__来标记。共享内存的访问相较于全局内存的访问速度来说，速度会快很多。在任何合适的场合可以使用共享内存的话就使用共享内存。

下列的实例代码实例直接实现了矩阵的乘法。每个线程读取A矩阵的一行和B矩阵的一列，然后计算C矩阵中对应元素的值。当整个C矩阵计算完成时， A被读取了B.width次数， B 被读取了A.height 次。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
	int width;
	int height;
	float* elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
	// Load A and B to device memory
	Matrix d_A;
	d_A.width = A.width; d_A.height = A.height;
	size_t size = A.width * A.height * sizeof(float);
	cudaMalloc(&d_A.elements, size);
	cudaMemcpy(d_A.elements, A.elements, size,
	cudaMemcpyHostToDevice);
	Matrix d_B;
	d_B.width = B.width; d_B.height = B.height;
	size = B.width * B.height * sizeof(float);
	cudaMalloc(&d_B.elements, size);
	cudaMemcpy(d_B.elements, B.elements, size,
	cudaMemcpyHostToDevice);

	// Allocate C in device memory
	Matrix d_C;
	d_C.width = C.width; d_C.height = C.height;
	size = C.width * C.height * sizeof(float);
	cudaMalloc(&d_C.elements, size);
	// Invoke kernel
	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
	dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
	MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
	// Read C from device memory
	cudaMemcpy(C.elements, Cd.elements, size,
	cudaMemcpyDeviceToHost);
	// Free device memory
	cudaFree(d_A.elements);
	cudaFree(d_B.elements);
	cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
	// Each thread computes one element of C
	// by accumulating results into Cvalue
	float Cvalue = 0;
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	for (int e = 0; e < A.width; ++e)
	Cvalue += A.elements[row * A.width + e]
	* B.elements[e * B.width + col];
	C.elements[row * C.width + col] = Cvalue;
}

当使用共享内存来进行实现时，每个线程块负责计算矩阵C的一个子矩阵 $C_{sub}$ ，块内的线程则负责计算这个子矩阵中对应的某一个元素。子矩阵 $C_{sub}$ 的计算其实对应的A矩阵的一个子矩阵和B的一个子矩阵的乘积。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
	int width;
	int height;
	int stride;
	float* elements;
} Matrix;

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
	return A.elements[row * A.stride + col];
}

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
	A.elements[row * A.stride + col] = value;
}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
	Matrix Asub;
	Asub.width = BLOCK_SIZE;
	Asub.height = BLOCK_SIZE;
	Asub.stride = A.stride;
	Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
	+ BLOCK_SIZE * col];
	return Asub;
}

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
	// Load A and B to device memory
	Matrix d_A;
	d_A.width = d_A.stride = A.width; d_A.height = A.height;
	size_t size = A.width * A.height * sizeof(float);
	cudaMalloc(&d_A.elements, size);
	cudaMemcpy(d_A.elements, A.elements, size,
	cudaMemcpyHostToDevice);
	Matrix d_B;
	d_B.width = d_B.stride = B.width; d_B.height = B.height;
	size = B.width * B.height * sizeof(float);
	cudaMalloc(&d_B.elements, size);
	cudaMemcpy(d_B.elements, B.elements, size,
	cudaMemcpyHostToDevice);
	
	// Allocate C in device memory
	Matrix d_C;
	d_C.width = d_C.stride = C.width; d_C.height = C.height;
	size = C.width * C.height * sizeof(float);
	cudaMalloc(&d_C.elements, size);
	
	// Invoke kernel
	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
	dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
	MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
	
	// Read C from device memory
	cudaMemcpy(C.elements, d_C.elements, size,
	cudaMemcpyDeviceToHost);
	
	// Free device memory
	cudaFree(d_A.elements);
	cudaFree(d_B.elements);
	cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
	// Block row and column
	int blockRow = blockIdx.y;
	int blockCol = blockIdx.x;
	
	// Each thread block computes one sub-matrix Csub of C
	Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
	
	// Each thread computes one element of Csub
	// by accumulating results into Cvalue
	float Cvalue = 0;
	
	// Thread row and column within Csub
	int row = threadIdx.y;
	int col = threadIdx.x;
	
	// Loop over all the sub-matrices of A and B that are
	// required to compute Csub
	// Multiply each pair of sub-matrices together
	// and accumulate the results
	for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
	// Get sub-matrix Asub of A
	Matrix Asub = GetSubMatrix(A, blockRow, m);
	// Get sub-matrix Bsub of B
	Matrix Bsub = GetSubMatrix(B, m, blockCol);
	// Shared memory used to store Asub and Bsub respectively
	__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
	__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
	
	// Load Asub and Bsub from device memory to shared memory
	// Each thread loads one element of each sub-matrix
	As[row][col] = GetElement(Asub, row, col);
	Bs[row][col] = GetElement(Bsub, row, col);
	// Synchronize to make sure the sub-matrices are loaded
	// before starting the computation
	__syncthreads();
	
	// Multiply Asub and Bsub together
	for (int e = 0; e < BLOCK_SIZE; ++e)
	Cvalue += As[row][e] * Bs[e][col];
	
	// Synchronize to make sure that the preceding
	// computation is done before loading two new
	// sub-matrices of A and B in the next iteration
	__syncthreads();
	}
	
	// Write Csub to device memory
	// Each thread writes one element
	SetElement(Csub, row, col, Cvalue);
}

锁页主机内存

运行时提供了可以使用锁页(paged-locked/pinned)主机内存的函数。我们常用的malloc()申请的是可分页的内存。

cudaHostAlloc()和cudaFreeHost()用来分配和使用锁页主机内存。
cudaHostRegister()来对使用malloc()分配的内存进行锁页，但也会存在一些缺陷。

使用锁页主机内存有以下几点优势：

在锁页主机内存和设备内存之间进行复制操作时，可以在核函数执行的同时进行。（通常这种拷贝操作会阻塞主机端代码）
在一些设备上，锁页内存可以被映射到设备内存地址空间，这样就不用显式地进行数据的复制。
对于总线前置的系统，锁页主机内存和设备内存之间具有较高的带宽。

锁页内存是稀有资源，因此相较于分配可分页的额内存，锁页内存的分配容易失败。除此之外，消耗太多锁页内存会减少系统可用的分页内存，这会影响系统的运行性能。

portable memory

锁页内存可以被系统中任意的设备使用，但默认情况下，锁页内存的优势只能被当时块所分配在的设备所获益。为了使得所有的设备获益，块需要在分配的时候传递一个标志cudaHostAllocPortable给cudaHostAlloc(), 或者在锁页的时候给cudaHostRegister()传递参数cudaHostRegisterPortable。

Write-Combining Memory

默认情况下，锁页的主机内存分配的是可缓存的。也可以通过传递参数cudaHostWriteComnined给cudaHostAlloc()从而将其分配为write-combining。write-combining内存会释放主机端的L1和L2缓存资源，从而为其他应用提供更多的可用缓存。除此之外，通过PCI-E总线转移数据这种内存数据的效率能提高40%。但是在主机端访问这样的内存，速度会变得非常慢，因此这样的内存应该只用于被写。

Mapped Memory

锁页内存可以被映射至设备的地址空间，仅需要在调用cudaHostAlloc()时传递参数cudaHostAllocMapped,或者传递参cudaHostRegisterMapped至cudaHostRegister。因此这样的内存块有两个地址：一个是cudaHostAlloc() / malloc()返回的主机端内存地址，一个是通过cudaHostGetDevicePointer()恢复的可以被核函数使用的设备内存地址。

直接访问主机内存具有以下几个优势：