CUDA并行编程

iceylia

已于 2024-01-05 19:18:03 修改

阅读量1.2k

点赞数 17

文章标签： c语言

于 2024-01-05 16:15:06 首次发布

本文链接：https://blog.csdn.net/m0_46228204/article/details/134488990

版权

引入

并行能帮我们提高代码性能，而GPU在硬件上是能够更好地实现一些并行工作的，能够更大程度地并行。

CUDA是SIMT模型，只用写一个程序，其中就包含了CPU部分和GPU并行部分的代码，那么就需要指明哪些是CPU（一般称为HOST），哪些是在GPU上运行(一般称为device)。

device变量 对于变量，也就是数据，我们需要存放在GPU上，以便GPU端代码调用，我们没有创建特别的数据类似，而是使用指针，在分配内存时分配的是device端内存。

核函数 对于代码，如何让它在GPU上运行呢，CUDA定义了一种device函数类型，一般形式如下。如果是GPU的入口函数，也就是第一个调用的函数(其它的device函数一般在这个入口函数中调用)，cuda称为核函数。

__global__ void Kernel(参数)
{
	//GPU并行代码
}

核函数的调用也是不一样的。我们在main函数中这么调用它，比寻常的函数调用多了个<<<>>>，里面的参数是调用并行资源的参数，之后会提到。其中d_A, d_B此时都已经是指针，指向的是GPU中的地址(显存)，而不是CPU。而n只是一个int型的，只用传形参即可，不用指针。

vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);

GPU的结构

硬件层面：GPU被分成一个个的SM（Streaming Multiprocessor计算单元），资源是以一个SM为单位的，寄存器、控制单元和缓存在SM里面都有，而一个GPU中有许多个SM，这也是GPU并行的硬件支持。其中sp就是基础的CUDA核心，用于处理线程中的指令，一个SM有多个sp。

warp ：warp是cuda编程中的重要概念，一个warp是一个线程束，如今包含了32个线程。GPU最基本的调度单元不是线程，而是一个warp，也就是32个线程，多个sp会共同处理一个warp的指令。

软件层面：grid->block->thread。grid是最大的单位，包含了一个程序中所有的thread，block的意义在于一个block下的所有线程必定在同一个SM下工作，并且共享内存是以block为单位来划分的，也就是说只有一个block中的线程之间能共享内存。

在编程时，我们需要注意的就是如何划分block和thread，之前提到，在调用核函数时，就能分配一个grid中有多少block，一个block中有多少thread。如下所示，为vecAddKernel函数分配了5个block，每一个block中有256个thread，即当调用这个函数时，将会有5*256个线程并行。

vecAddKernel<<<5,256>>>(d_A, d_B, d_C, n);

3维结构：grid中的block与block中的thread是可以以2维以及3维的形式构成的。我们可以用cuda提供的dim3结构来表示。如下所示，一个grid中的block是以100502的形式排列，block中的thread是以161616的形式排列。对于这样的结构，在处理一些图像等特殊结构时能够更加清晰。

dim3 DimGrid(100, 50, 2);
dim3 DimBlock(16, 16, 16);
PictureKernel<<<DimGrid,DimBlock>>>(d_Pin, d_Pout, m, n);

地址映射：现在我们将一个10241024的矩阵中每个值平方，那么构建了一个grid是3232，block是32*32的结构，让每个线程处理一个值。
传入的是二维数组的指针和矩阵的宽1024，为什么是指针之前提到过，device变量在CUDA中用指针指向device内存，那么对于二维数组，我们其实在用一维的方式求解。

dim3 DimGrid(32, 32, 1);
dim3 DimBlock(32, 32, 1);
sqKernel<<<DimGrid,DimBlock>>>(matrix, width);

对于核函数，我们现在只知道有1024*1024个线程在工作，怎么让它们工作到对应的值，就是地址映射需要做的事情。一个块中的线程有编号，一个grid中的block也有编号，通过这些编号，我们可以对应到矩阵。

__global__ void sqKernel(int* matrix, int width)
{
	int tx = threadIdx.x;//获取x方向线程号：0到31
	int ty = threadIdx.y;//获取y方向线程号：0到31
	int blockWidth = blockDim.x;//获取一个块在x方向宽度：32
	int tid = threadIdx.y * blockDim.x + threadIdx.x;//获取一个块中全局线程号：y*width+x 0到1023
	int bx = blockIdx.x;//获取x方向块号：0到31
	int by = blockIdx.y;//获取y方向块号：0到31
	int x = threadIdx.x + blockIdx.x*blockDim.x;//获取全局的x编号：0到1023
	int y = threadIdx.y + blockIdx.y*blockDim.y;//获取全局的y编号：0到1023
	int id = y * width + x;//获取全局id：y*width+x 0到1024*1024-1

	//matrix[id]平方
}

最后我们通过全局id，即可对应上需要处理的值在哪个位置。

结构优化：一个SM中一般有1536个线程，已知一个block中所有线程必定在一个SM中运行，所以为了让SM中的线程全部用上，我们在设计时需要让block的线程数是1536的公约数，比如256就比较合适，或者16*16。

device变量

初始化

前文已经提到过device变量实质上是在host初始化的，用指针分配的device内存

下面是一个host二维数组的变量，大小为HEIGHT*WIDTH

int* h_A;
size_t bytesCount = HEIGHT * WIDTH * sizeof(int);
h_A = new int[bytesCount];

delete[] h_A;

下面是一个device二维数组的变量，大小也为HEIGHT*WIDTH

int* d_A;
size_t bytesCount = HEIGHT * WIDTH * sizeof(int);
cudaMalloc((void **)&d_A, bytesCount);//需要用cudaMalloc来分配内存。

cudaFree(d_A);//用cudaFree来释放内存。

可以发现其中有一个问题就是cudaMalloc使用了双重指针，这是由于cudaMalloc依旧有一个错误信息的返回值，那么如果需要将d_A修改，需要传的不是一个空壳，而是需要传地址，才能实现在函数中的变量值修改了，函数外的变量值也会改变。
所以就需要传地址，又由于传的本身就是一个指针，那么需要的是指针的地址，就是双重的void指针。

数据传递

对于一个device端的变量，d_A是不能直接在host下使用的，不能读也不能写。
那么怎么赋值，怎么接收数据，cuda中有一个函数可以做到cudaMemcpy

cudaMemcpy(d_A, h_A, bytesCount, cudaMemcpyHostToDevice);//d_A = h_A,从host拷贝到device

cudaMemcpy(h_A, d_A, bytesCount, cudaMemcpyDeviceToHost);//h_A = d_A,从device拷贝到host

此时已经明白了一个基本的cuda结构，可以去看一看文章最后的第一个矩阵加法是如何实现的。

共享内存与矩阵乘法

引入

我们直接用cudaMalloc创建的是位于global memory全局内存上的，最基本、容量最大，也是最慢的一个存储位置，share memory是比全局内存要快很多的，但缺点就是很小。
并且共享变量是在和函数上声明的，所以可以直接用二维数组，而不是一维转二维。

对于普通的矩阵乘法，在核函数中，我们可以直接将该线程对应的一整行和一整列都读取上来，然后相乘相加，如下所示。

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[y*Width+k]*N[k*Width+x];
}
P[y*Width+x] = Pvalue;

但是（0,0）线程和（0,1）线程都要读取第0行，分别与第一列和第二列相乘，显然重复读取了第0行。
于是定义了一个共享内存，想要直接先读取所有需要的数据到共享内存中，然后再通过共享内存计算。

//kernel假设块大小是16*16，矩阵是1024*1024，那么M需要16*1024的数据，N需要1024*16的数据。
__shared__ float ds_M[16][1024];//共享变量用__shared__修饰
__shared__ float ds_N[1024][16];
//...省略中间读取到共享内存部分.
for (int k = 0; k < Width; ++k) {
	Pvalue += ds_M[y*Width+k]*ds_N[k*Width+x];
}

但是共享内存是很小的，只能放很小的一部分。

算法

所以我们成几个部分读取，分块的大小为TILE_WIDTH，一般与块大小一致，所以这是也是16，如果是32也是没有问题的。

//
__shared__ float ds_M[16][TILE_WIDTH];//共享变量用__shared__修饰
__shared__ float ds_N[TILE_WIDTH][16];

那么现在就需要一个块一个块的读取，M矩阵向X方向读取，N向y方向读取
比如线程是(0,1)，那么M矩阵，读取的就是M[0, 0tilewidth+1], [0, 1tilewidth+1]…，那么线程(0,1)到(31,1)并行下，就能将第一列的所有数据读取进来。
N矩阵读取的就是N[0+tilewidth0, 1], N[0+tilewidth1, 1], N[0 + 16*tilewidth, 1]
写成代码就是

for (int p = 0; p < n/TILE_WIDTH; ++p) {//p代表的就是现在到第几个块了。
	ds_M[ty][tx] = M[y * width + p*TILE_WIDTH+tx];
	ds_N[ty][tx] = N[(p * TILE_WIDTH+ty)*width+ x];
	__syncthreads();//所有线程都读取完了第P块
	//然后就将这一块的结果暂存起来
	for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
	__synchthreads();//等所有线程都计算完了这一个块再进行下一个块。
}
P[Row*Width+Col] = Pvalue;//最后所有块都执行完毕

一块一块的读取一块一块的计算。
还有一个问题没有提到就是边界处理，假如矩阵不是刚好1024而是1023，那么有些线程就需要进行边界处理。用if判断一下即可，具体处理方式可见最后完整代码。

control divergence

已知GPU调度的基本调度单位是warp，一个warp有32个线程，一个warp的线程一般都在一个块里面按照顺序的。
比如现在有（16*16）个线程在一个块里面，那么第(0,0)线程到(1,15)号线程就是一个warp，所以块一般是32的倍数。

而这一个warp假如工作内容是不同的，就会降低效率，本来一批的东西被分成多批，而不同warp本来就是不同批次。
我们称这种降低效率的情况叫control divergence
比如下面0到4干的活就与5到31干的事不一样。

if(tid < 5){
	printf("1");
}

假如我们改成下面就不会有这种问题。

if(tid < 32){
	printf("1");
}

原子操作与直方图

引入

现在的案例是有一串很长的英文字母，agfweifowjfei这样的，现在需要统计a到g的字母有多少个，g到o的字母有多少，以此类推
字符串已经长到我们不能让一个线程对应一个字母，所以我们还是需要分成多个部分。比如有102400个字母，分成1024个线程，一个线程处理100个字母。
那么第1个线程处理前100个，第二个处理第二批100个，实际上是这样吗？不是的，由于局部性原理，即假如缓存没有需要的数据，会将附近的数据读入到缓存中，所以我们在编程序时最好将要处理的数据放在一起。
现在的做法是第一个线程在处理第1个字母时，第二个线程在处理第101个，不能很好地利用缓存机制。

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo)
{
  int i = threadIdx.x + blockIdx.x * blockDim.x;

  // stride就是每一次处理多少字母，等于grid中所有线程。
  int stride = blockDim.x * gridDim.x;

  while (i < size)
  {
    int alphabet_position = buffer[i] - 'a';
    if (alphabet_position >= 0 && alphabet_position < 26)
      atomicAdd(&(histo[alphabet_position / 4]), 1);
    i += stride;
  }
}

可以看到上面的代码处理的就是相近的数据，通过stride步长来定位到第n次批次，比如现在线程全局id是i，第5次处理，那么对应的字母就是buffer[5*stride+i];
其中有一个新的函数atomicAdd（）,这个函数是原子操作，如果多个线程同时对一个类别进行加，就可能造成竞争，所以这个加法操作就是临界区
为了防止多线程对一个变量同时操作，cuda使用原子操作来处理这种情况。
int atomic(int * address, int val)这个函数有多个重构，可以处理int, unsigned int, unsigned long long , float。其返回值是原子操作之前的旧值，新值通过指针已经放在变量中了。
同样，可以看后文的完整代码。

共享直方图算法

对于一个算法在一步步地向后计算的途中，对于buffer每次读取的位置是不同的，无法优化，但是histo却只有7个，多次重复读取，所以可以使用共享内存。
可以在一个块中放一个共享的私有的histo，用来作为中转暂存，等将所有字母统计完毕，将私有的histo再汇集到全局的histo上。
大致代码如下，完整代码附在后文，前7个线程同时负责私有histo的初始化和汇集，不可避免地有control divergence

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo)
{
    __shared__ unsigned int histo_private[7];
    // 初始化值共享内存为0；
    if (threadIdx.x < 7)
        histo_private[threadIdx.x] = 0;
    __syncthreads();
    while (i < size)
    {
        atomicAdd(&(histo_private[alphabet_position / 4]), 1);
        i += stride;
    }
    // 等待所有线程执行完毕
    __syncthreads();

    if (threadIdx.x < 7)
    {
        atomicAdd(&(histo[threadIdx.x]), histo_private[threadIdx.x]);
    }
}

卷积计算

引入

cuda并行也非常适合卷积计算。
卷积：一个图像有一个卷积核和一个输入数组，卷积核的大小一般比输入数组小很多。比如输入数组是一个2维的图像像素，10241024，那么卷积核可能是33，对于每一个像素点，卷积核将先将自己矩阵转置，再取周围3*3的像素点，与自身进行一定操作，最后得到的值就是这个像素点新的值。

卷积核：之前提到卷积核在进行卷积的时候需要先将自己进行转置，所以对称的卷积核方便操作。

例子：现在一维数组nums={1234567}，卷积核为M={34543}，现在需要对nums[2]进行卷积操作了，那么长度为5的卷积核将取以nums[2]为中心左右两边的数进行操作，最后新的nums[2] = nums[0]*M[0]+nums[1]*M[1]+nums[2]*M[2]+nums[3]*M[3]+nums[4]*M[4]=57

幽灵元素：假如上面例子中，轮到nums[0]进行卷积，左边已经没有元素，那么就直接补0，称为幽灵元素。

那么在写核函数的时候，只需要注意当前从哪个地方开始乘即可。

int start_point = i – (Mask_Width/2);//2为中心，即2 - 5 / 2 = 0,从0开始积和。

对于2d的卷积函数也是如此。

int N_start_col = Col - (maskwidth / 2);
int N_start_row = Row - (maskwidth / 2);

如果想要看完整代码附在后文。

共享内存卷积

与矩阵乘法共享类似，卷积也是在计算本像素点数值时需要其它像素点的数值，所以可以用类似的思想，将共用的地方用共享内存先读取。

与矩阵乘法不同的是，卷积需要多处理块的边界问题，比如现在一个块16*16大小，那么在块中第0号线程和第255号线程在进行读取时很有可能无法读取到周围块的元素。
接下来介绍的算法将共享内存=block大小>计算大小，就是需要计算的实际上比block线程小一圈，这样就能读取到周围的元素。

#define O_TILE_WIDTH 12//块大小是16*16，mask大小为5*5，那么12*12就是实际读取的大小
#define BLOCK_WIDTH 16

那么这个时候全局的x和全局的y已经不能和需要计算的xy对应起来了，这个时候一个很巧妙的做法就是直接想象块的大小就是12*12，而不是16，在计算地址的时候按照O_TILE_WIDTH 16来计算。

__shared__ int Ns[BLOCK_WIDTH][BLOCK_WIDTH];//16*16的共享内存
int row_o = blockIdx.y * O_TILE_WIDTH + ty;
int col_o = blockIdx.x * O_TILE_WIDTH + tx;
int row_i = row_o - 2;//输入地址的映射不变，照常-2，不过之前是y和x，现在是row_o
int col_i = col_o - 2;

由于读取时需要所有线程都工作，所以我们不必加什么限制,比如现在是第0块，那么o就是0到11，读取时就是-2到13，到第二块的时候输出index就是1*12+tx,就是12到23，读取的index就是12-2到12-2+16，完美对应。

if ((row_i >= 0) && (row_i < height) && (col_i >= 0) && (col_i < width))
{
    Ns[ty][tx] = in[row_i * width + col_i];
}
else
{
    Ns[ty][tx] = 0;
}

接下来的重要一环就是计算，首先只需要1212参与计算，这里选择了前面的1212个线程
不是选择中间的12*12那么就有一个问题就是当线程是0,0的时候，中心点其实在2，2，所以需要计算的是00到44与mask的00到44

int value = 0;
if (ty < O_TILE_WIDTH && tx < O_TILE_WIDTH)
{
    for (int i = 0; i < maskwidth; i++)
    {
        for (int j = 0; j < maskwidth; j++)
        {
            value += mask[i * width + j] * Ns[i + ty][j + tx];
        }
    }
}

restrict

对于只读的指针，我们可以通过__restrict__来修饰
如const int * __ restrict__ M

reduction规约

simple

规约和扫描都是cuda并行的计算模式。
现在是一个规约的例子，为了求一个矩阵每一行的和。直接用共享内存
reduction就是多个数逐渐减少，最后汇聚到一个变量上，这里的代码就是用一个块计算的是两个块大小的数，为了一个块计算一个数，所以一个块有512个线程，共1024个块，那么前512个线程就是第一行。

__global__ void rowSum(int *input, int *output, int width)
{
    __shared__ int partialSum[2 * BLOCK_SIZE];//共享内存大小是block的两倍,1024
    unsigned int tx = threadIdx.x;
    unsigned int start = blockIdx.x * 2 * blockDim.x;//块为0的时候，start=512，第一个块就读取了1024个数据。
    partialSum[tx] = input[start + tx];//一个线程负责读取两个数据。
    partialSum[blockDim.x + tx] = input[start + blockDim.x + tx];
}

一个块用一个stride，每次翻倍，通过tx % stride == 0。然后通过2*tx就能定位到，为什么是2倍就是因为共享内存是两倍。
stride=1的时候，就是(0,1), (2, 3), (4,5)
stride=2的时候，就是(0,2),(4, 6), (8, 10)
stride=512的时候，就是(0, 512)

// reduce
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
{
    __syncthreads();
    if (tx % stride == 0)
    {
        partialSum[2 * tx] += partialSum[2 * tx + stride];
    }
}
// 将最后的结果存放到
if (tx == 0)
{
    output[blockIdx.x] = partialSum[0];
}

better

前一个版本的在运行到后面的时候，运行的线程分布很开，不仅有control divergence而且不能很好地利用缓存。
下面的版本就是将原本stride从小到大改成了从大到小。

for (unsigned int stride = blockDim.x;stride > 0; stride /= 2)
{
	__syncthreads();
	if (t < stride)
		partialSum[t] += partialSum[t+stride];
}

现在比如一个block有1024个线程，那么在stride到32之前，都没有divergence，只有到16,8之后才会有。

scan扫描

work_inefficient

scan扫描的一个例子就是计算前缀和。现在有一个二维矩阵，需要计算其前缀和，那么我们首先需要计算每一行的前缀和。
下面就是每一行计算前缀和的代码，所以我们现在一个块的大小就是一行的大小，读取就是一一对应到一个共享内存中。
然后是迭代扫描计算，inefficient的扫描就是每次获取stide2的数据，比如现在是stride=8，那么tid=17的时候，获取的是28个数据，2到17的和。
stride = 1的时候0:0, 1:01, 2:12, 3:23, 4:34, 5:45, 6:56…
stride = 2的时候0:0, 1:01, 2:2+0=012, 3:1+3=0123, 4:2+4=1234, 5:5+3=2345
stride = 4的时候0:0, 1:01, 2:012, 3:0123, 4:1234, 5:5+1=012345

__global__ void work_inefficient_scan_kernel(int *inputMatrix, int *output, int width)
{
    __shared__ int XY[SECTION_SIZE];
    int tid = threadIdx.x;                 // 线程内的局部id
    int i = blockIdx.x * blockDim.x + tid; // 全局id
    if (i < width * width)
    {
        XY[tid] = inputMatrix[i] * inputMatrix[i];
    }
    // 执行迭代扫描
    for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
    {
        __syncthreads();
        int in1 = 0;
        if (tid >= stride)
        {
            in1 = XY[tid - stride];
        }
        __syncthreads();
        XY[tid] += in1;
    }
    __syncthreads();
    if (i < width * width)
    {
        output[i] = XY[tid];
    }
}

最后的效率是O(N* log2(N)) ，这串代码可能会在执行资源饱和的情况下，效率低于串行。

work_efficient

首先我们可以用之前的思想，一个块读取两个块的数据

#define BLOCK_SIZE 512
#define SECTION_SIZE 1024

 __shared__ int XY[SECTION_SIZE];
 unsigned int tx = threadIdx.x;
 unsigned int start = blockIdx.x * 2 * blockDim.x;
 XY[tx] = inputMatrix[start + tx];
 XY[blockDim.x + tx] = inputMatrix[start + blockDim.x + tx];

这是用到了二叉树的思想，首先将子节点的值都聚集到身上。
在这里插入图片描述

for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
{
    __syncthreads();
    int index = (threadIdx.x + 1) * 2 * stride - 1;
    if (index < SECTION_SIZE)
    {
        XY[index] += XY[index - stride];
    }
}

下面的从根开始遍历，第一个index的值就是第一层节点，让第一层节点和右边的第二层节点相加，形成新的第一层节点
在这里插入图片描述
到步长为1的时候，就轮到叶子节点了，叶子节点与之前已经得到的前缀根相加，叶子节点就也能存储前缀和。

代码如下

for (int stride = blockDim.x / 2; stride > 0; stride /= 2)
{
    __syncthreads();
    int index = (threadIdx.x + 1) * stride * 2 - 1;
    if (index + stride < SECTION_SIZE)
    {
        XY[index + stride] += XY[index];
    }
}

最后的时间复杂度差不多是O(log2(N)*log2(N))，并且更节省资源。
完整代码附在后文。

稀疏矩阵

下面是一个例子，当然这个矩阵不够大，零也不够多，只是方便学习。现在需要nn的稀疏矩阵与n1的矩阵X相乘
在这里插入图片描述
对于这种乘法，我们不需要关注零，只需要将第一行的非零元素与矩阵X对应位置相乘相加就是第一个值。

CSR

通过三个数组来存储稀疏矩阵，第一个数组data存放非零元素，可能就是data[7]={3,1,2,4,1,1,1}
第二个数组col_index对应的是上一个数组元素对应的列,col_index[7] = {0,2,1,2,3,0,3}
第三个数组用来存放行的信息，比如data的第一行元素data[0]和data[1],第二行没有，第三行是2到4，那么row_ptr={0,2,2,5,7}

CSR能不能做到随机读取，就是我想读到哪一行就读哪一行，而不需要遍历，显然是可以的，但是不能直接定位到列，只能定位到行。
比如我读取ROW 2的代码如下

for(int index = row_ptr[2]; index<row_ptr[3]; index++){
	int col = col_index[index];//对应原本矩阵的坐标就是(2, col)，值是value
	int value = data[index];
}

那么如何将稀疏矩阵乘法并行化就很清晰了，每一个线程处理一行。

int temp = 0;
int row = blockIdx.x * blockDim.x + threadIdx.x;
for(int index = row_ptr[row]; index<row_ptr[row+1]; index++){
	int col = col_index[index];//对应原本矩阵的坐标就是(row, col)，值是value
	temp += data[index] * X[col];
}

但是很显然，当稀疏矩阵一些行有特别多的非零元素，与一些行全是零的时候，其对应的线程就会出现负载不均衡的情况
并且control divergence的现象严重。

ELL

首先将进行压缩，将有最多非零元素的那一行作为矩阵的宽，非零元素全部向左边靠，列信息保留。
在这里插入图片描述
这样我们就知道一行是3个元素，然后遍历3次就可以处理完一行的数据。
但还有一个问题就是当多个线程都在处理一行的第一个元素的时候，没有很好利用缓存。
所以我们在存储values时，可以采用列优先的方式。321, 141,*1,对应的列坐标也是一样010,223,*3.
最后计算

__global__ void SpMV_ELL(int num_rows, float *data, int *col_index, int num_elem, float *x, float *y)
{
	int row = blockIdx.x * blockDim.x + threadIdx.x;
	int temp = 0;
	if (row < num_rows) {//不要超过行数
		for (int i = 0; i < num_elem; i++) {
			col = col_index[row+i*num_rows];
			temp += data[row+i*num_rows] * X[col];//所以在计算data坐标和col时，用的是row+i*num_rows，将行优先的行列坐标转成列优先坐标
		}
	}

这个方案也有缺点，就是当你某一行的非零元素特别多的时候，需要很多存储空间，极端的就是某一行一个零没有，就是原本矩阵了，没有压缩一点。

COO

这种压缩矩阵的模式是将每个非零元素对应的行和列都用数组储存，
缺点是当你用这种方式存储时，无法直接得到它们位置上的对应关系，比如哪些数据是在同一行。
所以一般我们是将coo与其它的方式结合起来，用以减少其存储空间。

比如将ell和coo结合起来。
将ell中非零元素多的那几列用coo存储，比如上面的案例中，将第三列的1用coo单独存储，就是
data = 1, rowindex = 2, colindex = 3,那么coo的存储数组中只有2列，断了一截

//ell
for (i = 0; i < num_rows_ell; i++) {//此时num_rows = 2
	int col = col_index_ell[row + i * num_rows];
	value += data_ell[row+i*num_rows] * x[col];
}
//COO
for (i = 0; i < num_elem_coo; i++) {
	if (row_index_coo[i] == row) {//遍历coo中的所有元素，看哪个是在这一行里面的。
		dot += data_coo[i] * x[col_index_coo[i]];
	}
}

JDS

最后还有一种就是JDS模式，在ELL的基础上进行行变换，按照长度从长到短排列
存储方式采用的CSR的数组存储，只不过需要多一个行的数组。现在的行数组就是row_index={2,0,3,1}row2是最长的，所以放到第一行了。
data数组和col数组也是，不同的是rowptr存储不是行从哪到哪，而是改成了部分，相同非零元素的是同一部分，row0和row3就是一部分，而row1由于没有元素，这个部分就不参与计算。
每一部分单独有一个核函数进行

merge sort

merge sort就是将两个已经排好序的数组合成一个数组，听起来应该是顺序执行的。
如果我们将原本两个数组叫做AB，第三个数组叫做C，那么假如以9为块，前9个用一个线程跑，后9个用一个线程跑。
但问题在于，现在你知道C需要9个数，那么多少来源于A，多少来源于B呢。
这个计算多少数据来源于A是能求出来的，将这个函数称作co_rank
在这里插入图片描述
其中k是指需要从A和B中取K个数，m是A的长度，n是B的长度，最后返回的i是说从A中取i个数，即0到i-1。
那么从B中取多少个数就等于k-i

__global__ void merge_basic_kernel(int* A, int m, int* B, int n, int* C){
	int id= blockIdx.x*blockDim.x + threadIdx.x;
	// 一个线程处理一部分，这一部分就是m+n/所有线程数
	int k_curr = id*ceil((m+n)/(blockDim.x*gridDim.x);
	int k_next = min((id+1) * ceil((m+n)/(blockDim.x*gridDim.x)), m+n);
	int i_curr= co_rank(k_curr, A, m, B, n);
	// 执行两遍co_rank就能获取到需要从哪到哪
	int i_next = co_rank(k_next, A, m, B, n);
	int j_curr = k_curr -i_curr;
	int j_next = k_next-i_next;
	/* 知道这些信息后就能执行 */
	merge_sequential(&A[i_curr], i_next-i_curr, &B[j_curr], j_next-j_curr,
	&C[k_curr] );
}

完整代码实例：矩阵加法

两矩阵相加，实现2个矩阵（Width=2048, Height=1024）的相加

这个代码就是需要注意height是矩阵中的行，width是列，在计算全局id的时候需要用y*width + x

c++代码

#include <iostream>
#include <time.h>
#include <vector>
using namespace std;
int HEIGHT = 1024, WIDTH = 2048;
int main(void)
{
    clock_t start, end;
    double cpu_time_used;
    start = clock();
    vector<vector<int>> h_A(HEIGHT, vector<int>(WIDTH, 4)), h_B(HEIGHT, vector<int>(WIDTH, 2)), h_C(HEIGHT, vector<int>(WIDTH));
    for (int i = 0; i < HEIGHT; i++)
    {
        for (int j = 0; j < WIDTH; j++)
        {
            h_C[i][j] = h_A[i][j] + h_B[i][j];
        }
        cout << endl;
    }
    end = clock();
    cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
    for (int i = 0; i < HEIGHT; i++)
    {
        for (int j = 0; j < WIDTH; j++)
        {
            cout << h_C[i][j] << "\t";
        }
    }
    printf("\nTime taken: %f seconds\n", cpu_time_used);
    return 0;
}

cuda代码

#include <iostream>
#include <time.h>
#include <cuda_runtime.h>
#include <vector>

using namespace std;

int HEIGHT = 1024, WIDTH = 2048;

__global__ void matrix_addition(int *A, int *B, int *C, const int height, const int width)
{
    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    const int y = blockIdx.y * blockDim.y + threadIdx.y;
    const int id = y * width + x;
    if (x < width && y < height)
    {
        C[id] = A[id] + B[id];
    }
}

int main(void)
{
    float elapsedTime;
    cudaEvent_t start, stop;
    // 初始化 CUDA 事件
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    // 记录开始事件
    cudaEventRecord(start, 0);

    int *h_A, *h_B, *h_C;
    int *d_A, *d_B, *d_C;
    size_t bytesCount = HEIGHT * WIDTH * sizeof(int);
    h_A = new int[bytesCount];
    h_B = new int[bytesCount];
    h_C = new int[bytesCount];

    // 初始化输入矩阵
    for (int i = 0; i < HEIGHT; i++)
    {
        for (int j = 0; j < WIDTH; j++)
        {
            h_A[i * WIDTH + j] = 4;
            h_B[i * WIDTH + j] = 2;
        }
    }

    // 分配设备内存，将数据从主机复制到设备
    cudaMalloc((void **)&d_A, bytesCount);
    cudaMalloc((void **)&d_B, bytesCount);
    cudaMalloc((void **)&d_C, bytesCount);
    cudaMemcpy(d_A, h_A, bytesCount, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytesCount, cudaMemcpyHostToDevice);

    // 核函数计算
    dim3 block(32, 32);
    dim3 grid(64, 32);
    matrix_addition<<<grid, block>>>(d_A, d_B, d_C, HEIGHT, WIDTH);

    // 记录结束事件
    cudaEventRecord(stop, 0);
    // 同步事件，确保所有任务完成
    cudaEventSynchronize(stop);
     // 计算经过的时间
    cudaEventElapsedTime(&elapsedTime, start, stop);


    // 将数据从设备复制回主机
    cudaMemcpy(h_C, d_C, bytesCount, cudaMemcpyDeviceToHost);

    // 输出结果
    for (int i = 0; i < HEIGHT; i++)
    {
        for (int j = 0; j < WIDTH; j++)
        {
            printf("%d ", h_C[i * WIDTH + j]);
        }
        printf("\n");
    }
    // 输出时间
    printf("Elapsed time: %3.1f ms\n", elapsedTime);
    // 销毁事件
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    // 释放内存
    delete[] h_A;
    delete[] h_B;
    delete[] h_C;
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cudaDeviceReset();

    return 0;
}

完整代码实例：用块并行计算矩阵乘法

#include <iostream>
#include <time.h>
#include <cuda_runtime.h>
#include <vector>
#define TILE_WIDTH 16
#define TEMP_WIDTH 2047 //
using namespace std;
int WIDTH_A = 1023, HEIGHT_A = 2047, WIDTH_B = 2047, HEIGHT_B = 511, HEIGHT_C = 511, WIDTH_C = 1023;

__global__ void matrix_addition(int *A, int *B, int *C, const int height, const int width)
{
    __shared__ int ds_A[TILE_WIDTH][TILE_WIDTH];
    __shared__ int ds_B[TILE_WIDTH][TILE_WIDTH];

    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    const int y = blockIdx.y * blockDim.y + threadIdx.y;
    const int tx = threadIdx.x;
    const int ty = threadIdx.y;
    int value = 0;
    for (int i = 0; i < (TEMP_WIDTH - 1) / TILE_WIDTH + 1; i++)
    { // 每次遍历一个块
        if (y < height && i * TILE_WIDTH + tx < TEMP_WIDTH)
        {
            ds_B[ty][tx] = B[y * TEMP_WIDTH + (i * TILE_WIDTH + tx)]; // id = y * width + x，每次遍历x+Tile_WIDTH,B的坐标用的是B的width
        }
        else
        {
            ds_B[ty][tx] = 0;
        }
        if (i * TILE_WIDTH + ty < TEMP_WIDTH && x < width) // 新的x不能越B的x界，新的y不能越A的y界
        {
            ds_A[ty][tx] = A[(i * TILE_WIDTH + ty) * width + x]; // A的坐标用的是A的width，即C的width
        }
        else
        {
            ds_A[ty][tx] = 0;
        }
        __syncthreads(); // 等待这个块内所有线程执行完毕,
        if (y < height && x < width)
        {
            for (int k = 0; k < TILE_WIDTH; k++)
            {
                value += ds_B[ty][k] * ds_A[k][tx]; // 再将第i个块所有线程的结果放到value里面。
            }
        }

        __syncthreads();
    }
    if (y < height && x < width)
    {
        C[y * width + x] = value;
    }
}

int main(void)
{
    float elapsedTime;
    cudaEvent_t start, stop;
    // 初始化 CUDA 事件
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    // 记录开始事件
    cudaEventRecord(start, 0);

    int *h_A, *h_B, *h_C;
    int *d_A, *d_B, *d_C;
    size_t bytesCountA = HEIGHT_A * WIDTH_A * sizeof(int);
    size_t bytesCountB = HEIGHT_B * WIDTH_B * sizeof(int);
    size_t bytesCountC = HEIGHT_C * WIDTH_C * sizeof(int);
    h_A = new int[bytesCountA];
    h_B = new int[bytesCountB];
    h_C = new int[bytesCountC];

    // 初始化输入矩阵
    for (int i = 0; i < HEIGHT_A; i++)
    {
        for (int j = 0; j < WIDTH_A; j++)
        {
            h_A[i * WIDTH_A + j] = 4;
        }
    }
    for (int i = 0; i < HEIGHT_B; i++)
    {
        for (int j = 0; j < WIDTH_B; j++)
        {
            h_B[i * WIDTH_B + j] = 2;
        }
    }

    // 分配设备内存，将数据从主机复制到设备
    cudaMalloc((void **)&d_A, bytesCountA);
    cudaMalloc((void **)&d_B, bytesCountB);
    cudaMalloc((void **)&d_C, bytesCountC);
    cudaMemcpy(d_A, h_A, bytesCountA, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytesCountB, cudaMemcpyHostToDevice);

    // 核函数计算
    dim3 block(16, 16);
    dim3 grid(64, 32);
    matrix_addition<<<grid, block>>>(d_A, d_B, d_C, HEIGHT_C, WIDTH_C);

    // 记录结束事件
    cudaEventRecord(stop, 0);
    // 同步事件，确保所有任务完成
    cudaEventSynchronize(stop);
    // 计算经过的时间
    cudaEventElapsedTime(&elapsedTime, start, stop);

    // 将数据从设备复制回主机
    cudaMemcpy(h_C, d_C, bytesCountC, cudaMemcpyDeviceToHost);

    // 输出结果
    for (int i = 0; i < HEIGHT_C; i++)
    {
        for (int j = 0; j < WIDTH_C; j++)
        {
            printf("%d ", h_C[i * WIDTH_C + j]);
        }
        printf("\n");
    }

    // 输出时间
    printf("Elapsed time: %3.1f ms\n", elapsedTime);

    // 销毁事件
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    // 释放内存
    delete[] h_A;
    delete[] h_B;
    delete[] h_C;
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cudaDeviceReset();

    return 0;
}

完整代码实例：一般的直方图统计字母频率

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo)
{
  int i = threadIdx.x + blockIdx.x * blockDim.x;

  // stride就是每一次处理多少字母，等于grid中所有线程。
  int stride = blockDim.x * gridDim.x;

  while (i < size)
  {
    int alphabet_position = buffer[i] - 'a';
    if (alphabet_position >= 0 && alphabet_position < 26)
      atomicAdd(&(histo[alphabet_position / 4]), 1);
    i += stride;
  }
}
int main(void)
{
  unsigned char *buffer, *d_buffer;
  unsigned int *histo, *d_histo;
  int histoSize = 7;
  long size = 1024;
  size_t byteSize = sizeof(unsigned char) * size;
  buffer = new unsigned char[byteSize];
  histo = new unsigned int[sizeof(unsigned int) * histoSize];
  cudaMalloc((void **)&d_buffer, sizeof(unsigned char) * size);
  cudaMalloc((void **)&d_histo, sizeof(unsigned int) * histoSize);

  for (long i = 0; i < size; i++)
  {
    buffer[i] = (char)(i % 26 + 'a');
  }
  for (long i = 0; i < histoSize; i++)
  {
    histo[i] = 0;
  }
  cudaMemcpy(d_buffer, buffer, byteSize, cudaMemcpyHostToDevice);
  cudaMemcpy(d_histo, histo, sizeof(unsigned int) * histoSize, cudaMemcpyHostToDevice);
  histo_kernel<<<2, 256>>>(d_buffer, size, d_histo);
  cudaDeviceSynchronize();
  cudaMemcpy(histo, d_histo, sizeof(unsigned int) * histoSize, cudaMemcpyDeviceToHost);
  for (long i = 0; i < histoSize; i++)
  {
    printf("histo[%ld]:%d\n", i, histo[i]);
  }
  cudaDeviceReset();
  delete[] buffer;
  delete[] histo;
  cudaFree(d_buffer);
  cudaFree(d_histo);
  return 0;
}

完整代码实例：共享内存的直方图

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo)
{
    __shared__ unsigned int histo_private[7];

    // 初始化值共享内存为0；
    if (threadIdx.x < 7)
        histo_private[threadIdx.x] = 0;
    __syncthreads();
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    // stride=grid上所有线程
    int stride = blockDim.x * gridDim.x;
    while (i < size)
    {
        int alphabet_position = buffer[i] - 'a';
        if (alphabet_position >= 0 && alphabet_position < 26)
            atomicAdd(&(histo_private[alphabet_position / 4]), 1);
        i += stride;
    }
    // 等待所有线程执行完毕
    __syncthreads();

    if (threadIdx.x < 7)
    {
        atomicAdd(&(histo[threadIdx.x]), histo_private[threadIdx.x]);
    }
}

int main(void)
{
    unsigned char *buffer, *d_buffer;
    unsigned int *histo, *d_histo;
    int histoSize = 7;
    long size = 1024;
    size_t byteSize = sizeof(unsigned char) * size;
    buffer = new unsigned char[byteSize];
    histo = new unsigned int[sizeof(unsigned int) * histoSize];
    cudaMalloc((void **)&d_buffer, sizeof(unsigned char) * size);
    cudaMalloc((void **)&d_histo, sizeof(unsigned int) * histoSize);

    for (long i = 0; i < size; i++)
    {
        buffer[i] = (char)(i % 26 + 'a');
    }
    for (long i = 0; i < histoSize; i++)
    {
        histo[i] = 0;
    }
    cudaMemcpy(d_buffer, buffer, byteSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_histo, histo, sizeof(unsigned int) * histoSize, cudaMemcpyHostToDevice);
    histo_kernel<<<2, 256>>>(d_buffer, size, d_histo);
    cudaDeviceSynchronize();
    cudaMemcpy(histo, d_histo, sizeof(unsigned int) * histoSize, cudaMemcpyDeviceToHost);
    for (long i = 0; i < histoSize; i++)
    {
        printf("histo[%ld]:%d\n", i, histo[i]);
    }
    cudaDeviceReset();
    delete[] buffer;
    delete[] histo;
    cudaFree(d_buffer);
    cudaFree(d_histo);
    return 0;
}

完整代码实例：2D卷积模板计算

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

__global__ void convolution_2D_basic_kernel(unsigned int *in, unsigned int *mask, unsigned int *out, int maskwidth, int w, int h)
{
    int Col = blockIdx.x * blockDim.x + threadIdx.x;
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    if (Col < w && Row < h)
    {
        int pixVal = 0;
        int N_start_col = Col - (maskwidth / 2);
        int N_start_row = Row - (maskwidth / 2);
        // Get the of the surrounding box
        for (int j = 0; j < maskwidth; ++j)
        {
            for (int k = 0; k < maskwidth; ++k)
            {
                int curRow = N_start_row + j;
                int curCol = N_start_col + k;
                // Verify we have a valid image pixel
                if (curRow > -1 && curRow < h && curCol > -1 && curCol < w)
                {
                    pixVal += in[curRow * w + curCol] * mask[j * maskwidth + k];
                }
            }
        }
        // Write our new pixel value out
        out[Row * w + Col] = (unsigned int)(pixVal);
    }
}
int main()
{
    unsigned int *mask, *d_mask, *d_output;
    unsigned int *input, *d_input, *output;
    int width = 1024, height = 1024;
    int maskwidth = 5;
    size_t inputSize = sizeof(unsigned int) * width * height;
    size_t maskSize = sizeof(unsigned int) * maskwidth * maskwidth;
    mask = new unsigned int[maskSize];
    input = new unsigned int[inputSize];
    output = new unsigned int[inputSize];

    cudaMalloc((void **)&d_mask, maskSize);
    cudaMalloc((void **)&d_input, inputSize);
    cudaMalloc((void **)&d_output, inputSize);

    for (int y = 0; y < maskwidth; y++)
    {
        for (int x = 0; x < maskwidth; x++)
        {
            mask[y * width + x] = 2;
        }
    }
    srand(time(NULL));
    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
        {
            input[y * width + x] = rand() % 200;
            output[y * width + x] = 0;
        }
    }
    cudaMemcpy(d_mask, mask, maskSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_input, input, inputSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_output, output, inputSize, cudaMemcpyHostToDevice);

    dim3 DimGrid(64, 64, 1);
    dim3 DimBlock(16, 16, 1);
    convolution_2D_basic_kernel<<<DimGrid, DimBlock>>>(d_input, d_mask, d_output, maskwidth, width, height);
    cudaDeviceSynchronize();
    cudaMemcpy(output, d_output, inputSize, cudaMemcpyDeviceToHost);
    for (int i = 0; i < 10; i++)
    {
        for (int j = 0; j < 10; j++)
        {
            printf("outpu[%d][%d]:%d\n", i, j, output[i * width + j]);
        }
    }
    cudaDeviceReset();
    return 0;
}

完整代码实例：共享内存的卷积模板计算

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define O_TILE_WIDTH 12
#define BLOCK_WIDTH 16
__global__ void convolution_2D_basic_kernel(unsigned int *in, unsigned int *mask, unsigned int *out, int maskwidth, int width, int height)
{
    __shared__ int Ns[BLOCK_WIDTH][BLOCK_WIDTH];
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int row_o = blockIdx.y * O_TILE_WIDTH + ty;
    int col_o = blockIdx.x * O_TILE_WIDTH + tx;
    int row_i = row_o - 2;
    int col_i = col_o - 2;
    if ((row_i >= 0) && (row_i < height) && (col_i >= 0) && (col_i < width))
    {
        Ns[ty][tx] = in[row_i * width + col_i];
    }
    else
    {
        Ns[ty][tx] = 0;
    }
    __syncthreads();
    int value = 0;
    if (ty < O_TILE_WIDTH && tx < O_TILE_WIDTH)
    {
        for (int i = 0; i < maskwidth; i++)
        {
            for (int j = 0; j < maskwidth; j++)
            {
                value += mask[i * width + j] * Ns[i + ty][j + tx];
            }
        }
    }
    if (row_o < height && col_o < width)
        out[row_o * width + col_o] = value;
}
int main()
{
    unsigned int *mask, *d_mask, *d_output;
    unsigned int *input, *d_input, *output;
    int width = 1024, height = 1024;
    int maskwidth = 5;
    size_t inputSize = sizeof(unsigned int) * width * height;
    size_t maskSize = sizeof(unsigned int) * maskwidth * maskwidth;
    mask = new unsigned int[maskSize];
    input = new unsigned int[inputSize];
    output = new unsigned int[inputSize];

    cudaMalloc((void **)&d_mask, maskSize);
    cudaMalloc((void **)&d_input, inputSize);
    cudaMalloc((void **)&d_output, inputSize);

    for (int y = 0; y < maskwidth; y++)
    {
        for (int x = 0; x < maskwidth; x++)
        {
            mask[y * width + x] = 2;
        }
    }
    srand(time(NULL));
    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
        {
            input[y * width + x] = rand() % 200;
            output[y * width + x] = 0;
        }
    }
    cudaMemcpy(d_mask, mask, maskSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_input, input, inputSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_output, output, inputSize, cudaMemcpyHostToDevice);

    dim3 DimGrid(86, 86, 1);
    dim3 DimBlock(16, 16, 1);
    convolution_2D_basic_kernel<<<DimGrid, DimBlock>>>(d_input, d_mask, d_output, maskwidth, width, height);
    cudaDeviceSynchronize();
    cudaMemcpy(output, d_output, inputSize, cudaMemcpyDeviceToHost);
    for (int i = 0; i < 10; i++)
    {
        for (int j = 0; j < 10; j++)
        {
            printf("outpu[%d][%d]:%d\n", i, j, output[i * width + j]);
        }
    }
    cudaDeviceReset();
    return 0;
}

完整代码实例：better版规约

#include <cuda_runtime.h>
#include <stdio.h>

#define BLOCK_SIZE 512

__global__ void rowSum(int *input, int *output, int width)
{
    __shared__ int partialSum[2 * BLOCK_SIZE];
    unsigned int tx = threadIdx.x;
    unsigned int start = blockIdx.x * 2 * blockDim.x;
    partialSum[tx] = input[start + tx];
    partialSum[blockDim.x + tx] = input[start + blockDim.x + tx];

    // reduce better
    for (unsigned int stride = blockDim.x; stride > 0; stride /= 2)
    {
        __syncthreads();
        if (tx < stride)
            partialSum[tx] += partialSum[tx + stride];
    }
    // 将最后的结果存放到
    if (tx == 0)
    {
        output[blockIdx.x] = partialSum[0];
    }
}

int main()
{
    float elapsedTime;
    cudaEvent_t start, stop;
    // 初始化 CUDA 事件
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    const int matrixSize = 1024 * 1024;
    const int blockSize = BLOCK_SIZE;
    const int numBlocks = (matrixSize + blockSize - 1) / blockSize / 2;
    const int outputSize = numBlocks;

    int *inputMatrix, *output;
    int *d_inputMatrix, *d_output;

    // 为input和output分配内存
    inputMatrix = new int[matrixSize];
    output = new int[outputSize];

    // 为d_input和d_output分配内存
    cudaMalloc((void **)&d_inputMatrix, matrixSize * sizeof(int));
    cudaMalloc((void **)&d_output, outputSize * sizeof(int));

    // 初始化input
    for (int i = 0; i < matrixSize; i++)
    {
        inputMatrix[i] = 1;
    }
    // 记录开始事件
    cudaEventRecord(start, 0);
    // 复制input到d_input
    cudaMemcpy(d_inputMatrix, inputMatrix, matrixSize * sizeof(int), cudaMemcpyHostToDevice);

    // 启动核函数
    rowSum<<<numBlocks, blockSize>>>(d_inputMatrix, d_output, matrixSize);

    // 将output结果返回
    cudaMemcpy(output, d_output, outputSize * sizeof(int), cudaMemcpyDeviceToHost);

    // 记录结束事件
    cudaEventRecord(stop, 0);
    // 同步事件，确保所有任务完成
    cudaEventSynchronize(stop);
    // 计算经过的时间
    cudaEventElapsedTime(&elapsedTime, start, stop);

    for (int i = 0; i < 10; i++)
    {
        printf("output[%d]: %d\n", i, output[i]);
    }
    // 输出时间
    printf("Elapsed time: %3.4f ms\n", elapsedTime);
    // 最后释放内存。
    delete[] inputMatrix;
    delete[] output;
    cudaFree(d_inputMatrix);
    cudaFree(d_output);

    return 0;
}

完整代码实例：work_efficient二维前缀和scan

#include <cuda_runtime.h>
#include <stdio.h>

#define BLOCK_SIZE 512
#define SECTION_SIZE 1024

__global__ void work_efficient_scan_kernel(int *inputMatrix, int *output, int width)
{
    __shared__ int XY[SECTION_SIZE];
    unsigned int tx = threadIdx.x;
    unsigned int start = blockIdx.x * 2 * blockDim.x;
    XY[tx] = inputMatrix[start + tx];
    XY[blockDim.x + tx] = inputMatrix[start + blockDim.x + tx];

    for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
    {
        __syncthreads();
        int index = (threadIdx.x + 1) * 2 * stride - 1;
        if (index < SECTION_SIZE)
        {
            XY[index] += XY[index - stride];
        }
    }
    for (int stride = blockDim.x / 2; stride > 0; stride /= 2)
    {
        __syncthreads();
        int index = (threadIdx.x + 1) * stride * 2 - 1;
        if (index + stride < SECTION_SIZE)
        {
            XY[index + stride] += XY[index];
        }
    }
    if (start + blockDim.x + tx < width * width)
    {
        output[start + tx] = XY[tx];
        output[start + blockDim.x + tx] = XY[blockDim.x+tx];
    }
}

int main()
{
    float elapsedTime;
    cudaEvent_t start, stop;
    // 初始化 CUDA 事件
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    const int matrixSize = 1024 * 1024;
    const int width = 1024;

    int *inputMatrix, *output;
    int *d_inputMatrix;
    int *d_output;

    inputMatrix = new int[matrixSize];
    output = new int[matrixSize];

    // 初始化input
    for (int i = 0; i < matrixSize; i++)
    {
        inputMatrix[i] = 1;
        output[i] = 0;
    }

    // 分配 GPU 内存
    cudaMalloc((void **)&d_inputMatrix, matrixSize * sizeof(int));
    cudaMalloc((void **)&d_output, matrixSize * sizeof(int));

    // 将输入数据从主机内存拷贝到 GPU 内存
    cudaMemcpy(d_inputMatrix, inputMatrix, matrixSize * sizeof(int), cudaMemcpyHostToDevice);

    // 记录开始事件
    cudaEventRecord(start, 0);

    dim3 gridSize(1024, 1);
    dim3 blockSize(BLOCK_SIZE, 1);
    work_efficient_scan_kernel<<<gridSize, blockSize>>>(d_inputMatrix, d_output, width);

    // 将输出数据从 GPU 内存拷贝到主机内存
    cudaMemcpy(output, d_output, matrixSize * sizeof(int), cudaMemcpyDeviceToHost);

    // y方向相加
    for (int x = 0; x < width; x++)
    {
        for (int y = 1; y < width; y++)
        {
            output[y * width + x] += output[(y - 1) * width + x];
        }
    }

    // 记录结束事件
    cudaEventRecord(stop, 0);
    // 同步事件，确保所有任务完成
    cudaEventSynchronize(stop);
    // 计算经过的时间
    cudaEventElapsedTime(&elapsedTime, start, stop);

    // 打印结果
    for (int i = 0; i < 5; i++)
    {
        for (int j = 0; j < 5; j++)
        {
            printf("output[%d][%d]: %d\n", i, j, output[i * width + j]);
        }
    }
    printf("output[%d][%d]: %d\n", 1023, 1023, output[1023 * width + 1023]);

    // 输出时间
    printf("Elapsed time: %3.4f ms\n", elapsedTime);

    // 释放 GPU 内存
    cudaFree(d_inputMatrix);
    cudaFree(d_output);

    // 最后释放主机内存
    delete[] inputMatrix;
    delete[] output;

    return 0;
}