Cuda 编程入门

最新推荐文章于 2024-03-20 13:30:00 发布

AI小兵

最新推荐文章于 2024-03-20 13:30:00 发布

阅读量1k

点赞数

分类专栏： MachineLearning

本文链接：https://blog.csdn.net/edric1261234/article/details/81674230

版权

MachineLearning 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

这篇文章写的特别好，https://blog.csdn.net/xiaohu2022/article/details/79599947，我基本就是参考这篇文章

最难理解的部门就是Grid和Block的概念，具体关系参见下图

è¿éåå¾çæè¿°

GPU之所以处理图片速度快，就是因为可以把图片的每一块区域分配给一个Thread来处理，每个Thread只负责处理图像的一块区域，当每个Thread都结束的时候，整张图就处理完成了，避免了大量for循环的使用

Grid和Block实际上组成了一张大网，当这张大网在图片上“扫一遍”之后，整张图片也就处理完了。尤其是当图片尺寸比较小，GPU资源丰富时，更是一次性就可以扫完。

要获取每个Thread对应的位置，可以使用

int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

举个简单例子，假如我的图片大小是12*12，如果对这144个像素点，每个像素点做归一化，且一次完成，则可以设置

dim3 gridSize(3, 3);
dim3 blockSize(4, 4);

这样实际就有12*12个Thread，每个Thread处理一个像素值，一次就完成了。

当输入图片尺寸很大，无法一次完成时，每个Thread就需要处理多个像素点，这时就需要“stride”了，比如此时图片大小变为24*24，gridSize和blockSize的值不变，则此时每个Thread需要处理4个像素点，比如处理（0,0）点的Thread，还要处理（0,12），（12,0），（12,12）这几个点。

下面例子是我自己编写的第一个CUDA代码，输入为一张1920*1080的热力图，输出为这张图的局部极值点，即改点的值比上下左右四个点的值都大。首先定义kernel_fun，然后在host上采用kernel_fun<<< grid, block >>>(prams...)的形式调用

// 获取矩阵A的(row, col)元素
	__device__ float getElement(float *A, int height, int width, int row, int col)
	{
		return A[row * width + col];
	}

	// 为矩阵A的(row, col)元素赋值
	__device__ void setElement(bool *A, int height, int width, int row, int col, bool value)
	{
		A[row * width + col] = value;
	}
	
	__global__ void getPeak(float *d_in, int height, int width, float thre,  bool *d_out)
	{
		int row = threadIdx.y + blockIdx.y * blockDim.y;
		int col = threadIdx.x + blockIdx.x * blockDim.x;
		
		int stride_row =  blockDim.y * gridDim.y;
		int stride_col =  blockDim.x * gridDim.x;
		
		int max_iter_row = (height + stride_row - 1) / stride_row;
		int max_iter_col = (width + stride_col - 1) / stride_col;
		
		for(int iter_row = 0; iter_row <= max_iter_row ; iter_row++)
		{
			col = threadIdx.x + blockIdx.x * blockDim.x;
			for (int iter_col = 0; iter_col <= max_iter_col; iter_col++) 
			{
				//setElement(d_out, height, width, row, col, 1);
				if (row - 1 <= 0 || col - 1 <= 0) continue;
				if (row + 1 >= height || col + 1 >= width) continue;
		
				float map_cen_val = getElement(d_in, height, width, row, col);
				if (
					(map_cen_val >= thre) &&
					(map_cen_val >= getElement(d_in, height, width, row - 1, col)) &&
					(map_cen_val >= getElement(d_in, height, width, row + 1, col)) &&
					(map_cen_val >= getElement(d_in, height, width, row, col - 1)) &&
					(map_cen_val >= getElement(d_in, height, width, row, col + 1)) 
					)
				{
					setElement(d_out, height, width, row, col, true);
				}
				
				col += stride_col;
			}
			row += stride_row;
		}
	}

Host上的调用方法为

void findPeakPose(float* input_data, int height, int width, float thre, std::vector<Point>& peak_points)
	{
		int nBytes = width * height * sizeof(float);
		int nBytes_output = width * height * sizeof(bool);
		// 申请托管内存
		float *d_in;
		bool  *d_out;
		cudaMalloc((void**)&d_in, nBytes);
		cudaMalloc((void**)&d_out, nBytes_output);
		
		// 将host数据拷贝到device
		cudaMemcpy((void*)d_in, (void*)input_data, nBytes, cudaMemcpyHostToDevice);

		// 定义kernel的执行配置
		dim3 blockSize(32, 32);
		dim3 gridSize(5, 5);
		
		// 执行kernel
		getPeak<<<gridSize, blockSize>>>(d_in, height, width, thre, d_out);
		
		// 同步device 保证结果能正确访问
		cudaDeviceSynchronize();
		
		bool* output_data = (bool*)malloc(nBytes_output);
		cudaMemcpy((void*)output_data, (void*)d_out, nBytes_output, cudaMemcpyDeviceToHost);
		 
		for(int i = 0; i < width * height; i++)
		{
			if (output_data[i])
			{
				Point p;
				p.width = i%1920;
				p.height = i/1920;
				peak_points.push_back(p);
			}	
		}
	
		cudaFree(d_in);
		cudaFree(d_out);
		free(output_data);
	}

简单说明一下__device__ __global__等函数的意义（引用自https://blog.csdn.net/xiaohu2022/article/details/79599947）

__global__：在device上执行，从host中调用（一些特定的GPU也可以从device上调用），返回类型必须是void，不支持可变参数参数，不能成为类成员函数。注意用__global__定义的kernel是异步的，这意味着host不会等待kernel执行完就执行下一步。
__device__：在device上执行，单仅可以从device中调用，不可以和__global__同时用。
__host__：在host上执行，仅可以从host上调用，一般省略不写，不可以和__global__同时用，但可和__device__，此时函数会在device和host都编译。

AI小兵

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Cuda 编程入门

这篇文章写的特别好，https://blog.csdn.net/xiaohu2022/article/details/79599947，我基本就是参考这篇文章最难理解的部门就是Grid和Block的概念，具体关系参见下图GPU之所以处理图片速度快，就是因为可以把图片的每一块区域分配给一个Thread来处理，每个Thread只负责处理图像的一块区域，当每个Thread都结束的时候，整张图...
复制链接

扫一扫