GPU高性能计算CUDA编程：使用共享内存作为缓冲区

阿里猫的编程小辈

于 2024-09-08 14:19:55 发布

阅读量400

点赞数 10

分类专栏： 007 文章标签： c++ 开发语言性能优化 gpu算力人工智能

本文链接：https://blog.csdn.net/m0_51165837/article/details/142026377

版权

007 专栏收录该内容

37 篇文章 0 订阅

订阅专栏

GPU高性能计算CUDA编程：使用共享内存作为缓冲区

声明：本文不做商用

代码10.1给出了两个使用共享内存翻转图像的核函数Hflip6()和Vflip6()，分别是Hflip()和 Vflip()核函数的改进版本。这两个核函数首先从GM读取一个像素(3个字节)到共享内存，然后从共享内存写回GM(到翻转的位置)来翻转图像。在两个核函数中，用以下代码分配共享内存:

__shared__ uch PixBuffer[3072];  // 存放3*1024字节(1024 个像素)

它分配了3072个类型为unsignedchar(uc)的元素。这里总共分配了3072个字节，以线程块为单位。所以，你启动的任何像这样运行多线程的线程块只会分配一个3072个字节的共享内存区域。例如，如果你的线程块有128个线程，则会为这128个线程分配3072个字节的缓冲区，对应于每个线程3072/128-24个字节。核函数的初始部分与原来的Hflip()和Vflip()相同，但是，每个线程访问共享内存的地址取决于其tid。根据核函数Hflip6()和Vflip6()的写人方式，它们只处理一个像素，即3个字节。因此，如果它们以每块128个线程启动，将只需384个字节的共享内存，从而使其余2688个字节的共享内存在其执行期间处于空闲状态。如果将分配的共享内存降为384个字节，那么不能以超过128个线程/块来启动这个核函数。因此，确定要声明多少共享内存需要一个复杂的公式。

// Improved Hflip3() kernel that flips the given image horizontally
// Each kernel: copies a pixel from GlobalMem into shared memory (PixBuffer[])
// and writes back into the flipped Global Memory location
__global__
void Hflip6(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui RowBytes)
{
	__shared__ uch PixBuffer[3072]; // holds 3*1024 Bytes (1024 pixels).

	ui ThrPerBlk = blockDim.x;
	ui MYbid = blockIdx.x;
	ui MYtid = threadIdx.x;
	ui MYtid3 = MYtid * 3;
	//ui MYgtid = ThrPerBlk * MYbid + MYtid;

	//ui NumBlocks = gridDim.x;
	//ui BlkPerRow = CEIL(Hpixels,ThrPerBlk);
	//ui RowBytes = (Hpixels * 3 + 3) & (~3);
	//ui MYrow = MYbid / BlkPerRow;
	//ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
	ui MYrow = blockIdx.y;
	ui MYcol = MYbid*ThrPerBlk + MYtid;
	if (MYcol >= Hpixels) return;			// col out of range
	ui MYmirrorcol = Hpixels - 1 - MYcol;
	ui MYoffset = MYrow * RowBytes;
	ui MYsrcIndex = MYoffset + 3 * MYcol;
	ui MYdstIndex = MYoffset + 3 * MYmirrorcol;

	// swap pixels RGB   @MYcol , @MYmirrorcol
	PixBuffer[MYtid3] = ImgSrc[MYsrcIndex];
	PixBuffer[MYtid3 + 1] = ImgSrc[MYsrcIndex + 1];
	PixBuffer[MYtid3 + 2] = ImgSrc[MYsrcIndex + 2];
	__syncthreads();
	ImgDst[MYdstIndex] = PixBuffer[MYtid3];
	ImgDst[MYdstIndex + 1] = PixBuffer[MYtid3 + 1];
	ImgDst[MYdstIndex + 2] = PixBuffer[MYtid3 + 2];
}

// Improved Vflip3() kernel that flips the given image vertically
// Each kernel: copies a pixel from GlobalMem into shared memory (PixBuffer[])
// and writes back into the flipped Global Memory location
__global__
void Vflip6(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels, ui RowBytes)
{
	__shared__ uch PixBuffer[3072]; // holds 3*1024 Bytes (1024 pixels).

	ui ThrPerBlk = blockDim.x;
	ui MYbid = blockIdx.x;
	ui MYtid = threadIdx.x;
	ui MYtid3 = MYtid*3;
	//ui MYgtid = ThrPerBlk * MYbid + MYtid;

	//ui NumBlocks = gridDim.x;
	//ui BlkPerRow = CEIL(Hpixels,ThrPerBlk);
	//ui RowBytes = (Hpixels * 3 + 3) & (~3);
	//ui MYrow = MYbid / BlkPerRow;
	//ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
	ui MYrow = blockIdx.y;
	ui MYcol = MYbid*ThrPerBlk + MYtid;
	if (MYcol >= Hpixels) return;			// col is out of range
	ui MYmirrorrow = Vpixels - 1 - MYrow;
	ui MYsrcOffset = MYrow       * RowBytes;
	ui MYdstOffset = MYmirrorrow * RowBytes;
	ui MYsrcIndex = MYsrcOffset + 3 * MYcol;
	ui MYdstIndex = MYdstOffset + 3 * MYcol;

	// swap pixels RGB   @MYrow , @MYmirrorrow
	PixBuffer[MYtid3] = ImgSrc[MYsrcIndex];
	PixBuffer[MYtid3 + 1] = ImgSrc[MYsrcIndex + 1];
	PixBuffer[MYtid3 + 2] = ImgSrc[MYsrcIndex + 2];
	__syncthreads();
	ImgDst[MYdstIndex] = PixBuffer[MYtid3];
	ImgDst[MYdstIndex + 1] = PixBuffer[MYtid3 + 1];
	ImgDst[MYdstIndex + 2] = PixBuffer[MYtid3 + 2];
}

参见【0voice C++】一旦声明了共享内存，SM在启动块之前就从其整个共享内存中分配相应大小的共享内存。在执行期间，下面代码

PixBuffer[MYtid3] = ImgSrc[MYsrcIndex]; ...

将GM中的像素(由mgSrc指向)复制到共享内存(PixBuffer数组)中。下面代码将其复制回GM(复制到翻转后的位置，由指针ImgDst指向):

ImgDst[MYdstIndex] = PixBuffer[MYtid3]; ...

下面这行代码确保在允许每个线程继续处理之前，块中的所有线程读入共享内存的操作都已完成。

__syncthreads();

阿里猫的编程小辈

关注

10
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录