Cuda Programming Interface (I)

   Cuda C是支持C/C++语言的。它只对C语言做了一个很小的扩展并且提供了一个C runtime library.

  想要知道Cuda是怎么运行的,我们首先要知道Cuda程序的编译过程。

  Compilation with NVCC

   Offline Compilation

    NVCC 的工作流主要分下面几步

      1,将程序中的host code 和 device code 区别开来。

      2,将device code进行转化可装配形式(assembly form (PTX code)),进而转化成2进制流,用于交给GPU处理。

      3,将Host code中不符合C语言标准的代码进行替换,然后按照正常的编译过程进行编译链接,在CPU中处理。

   

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    
    c[i] = a[i] + b[i];
}

     Initialization

     一般来说,没有一个明确的开始标志,当第一个runtime function被调用的时候,GPU section就被初始化了。

     在初始化过程中,会创建一个cuda context。这个上下文是primary context,被所有的Host Thread共享。

      cudaDeviceReset() 可以destory当前的上下文,直至下一个runtime function被call时,将重新创建primary context.

 

    Device Memory

      Device memory can be allocated either as linear memory or as CUDA arrays

      Cuda arrays是和Texture and Surface Memory相关的,我们后续再谈。

        linear memory 通常使用 cudaMalloc()进行内存分配, cudaFree()释放内存, cudaMemcpy()在host memory 和 device momory之间传递data。

        如果想要申请2D或者3D数组的内存可以使用cudaMallocPicth()和cudaMalloc3D(),对应使用cudaMemcpy2D()和cudaMemcpy3D()来拷贝data。

  

      Shared Memory

        shared memory通过 __shared__来标识。它比global memory要快。

 

        可以看到上一节,当我们想利用分块矩阵对GPU运算进行优化的时候,由于每一个线程只需要进行BLOCK_SIZE次乘法的运算。故对于每一个Cij,需要计算多次进行叠加,而叠加的过程必须通过共享内存和同步线程机智来完成。

        

The following code sample is an implementation of matrix multiplication that does take
advantage of shared memory. In this implementation, each thread block is responsible
for computing one square sub-matrix Csub of C and each thread within the block is
responsible for computing one element of Csub. As illustrated in Figure 10, Csub is equal
to the product of two rectangular matrices: the sub-matrix of A of dimension (A.width,
block_size) that has the same row indices as Csub, and the sub-matrix of B of dimension
(block_size, A.width )that has the same column indices as Csub. In order to fit into the
device's resources, these two rectangular matrices are divided into as many square
matrices of dimension block_size as necessary and Csub is computed as the sum of the
products of these square matrices. Each of these products is performed by first loading
the two corresponding square matrices from global memory to shared memory with one
thread loading one element of each matrix, and then by having each thread compute one
element of the product. Each thread accumulates the result of each of these products into
a register and once done writes the result to global memory.
Programming Interface
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5 | 27
By blocking the computation this way, we take advantage of fast shared memory and
save a lot of global memory bandwidth since A is only read (B.width / block_size) times
from global memory and B is read (A.height / block_size) times.
The Matrix type from the previous code sample is augmented with a stride field, so that
sub-matrices can be efficiently represented with the same type. __device__ functions are
used to get and set elements and build any sub-matrix from a matrix.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}
// Thread block size
#define BLOCK_SIZE 16
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
Programming Interface
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5 | 28
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);
// Allocate C in device memory
Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
Programming Interface
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5 | 29
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);

 

转载于:https://www.cnblogs.com/stormhan/p/5465149.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值