GPU高性能计算CUDA编程:使用共享内存作为缓冲区
声明:本文不做商用
代码10.1给出了两个使用共享内存翻转图像的核函数Hflip6()和Vflip6(),分别是Hflip()和 Vflip()核函数的改进版本。这两个核函数首先从GM读取一个像素(3个字节)到共享内存,然后从共享内存写回GM(到翻转的位置)来翻转图像。在两个核函数中,用以下代码分配共享内存:
__shared__ uch PixBuffer[3072]; // 存放3*1024字节(1024 个像素)
它分配了3072个类型为unsignedchar(uc)的元素。这里总共分配了3072个字节,以线程块为单位。所以,你启动的任何像这样运行多线程的线程块只会分配一个3072个字节的共享内存区域。例如,如果你的线程块有128个线程,则会为这128个线程分配3072个字节的缓冲区,对应于每个线程3072/128-24个字节。核函数的初始部分与原来的Hflip()和Vflip()相同,但是,每个线程访问共享内存的地址取决于其tid。根据核函数Hflip6()和Vflip6()的写人方式,它们只处理一个像素,即3个字节。因此,如果它们以每块128个线程启动,将只需384个字节的共享内存,从而使其余2688个字节的共享内存在其执行期间处于空闲状态。如果将分配的共享内存降为384个字节,那么不能以超过128个线程/块来启动这个核函数。因此,确定要声明多少共享内存需要一个复杂的公式。
// Improved Hflip3() kernel that flips the given image horizontally
// Each kernel: copies a pixel from GlobalMem into shared memory (PixBuffer[])
// and writes back into the flipped Global Memory location
__global__
void Hflip6(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui RowBytes)
{
__shared__ uch PixBuffer[3072]; // holds 3*1024 Bytes (1024 pixels).
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYtid3 = MYtid * 3;
//ui MYgtid = ThrPerBlk * MYbid + MYtid;
//ui NumBlocks = gridDim.x;
//ui BlkPerRow = CEIL(Hpixels,ThrPerBlk);
//ui RowBytes = (Hpixels * 3 + 3) & (~3);
//ui MYrow = MYbid / BlkPerRow;
//ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
ui MYrow = blockIdx.y;
ui MYcol = MYbid*ThrPerBlk + MYtid;
if (MYcol >= Hpixels) return; // col out of range
ui MYmirrorcol = Hpixels - 1 - MYcol;
ui MYoffset = MYrow * RowBytes;
ui MYsrcIndex = MYoffset + 3 * MYcol;
ui MYdstIndex = MYoffset + 3 * MYmirrorcol;
// swap pixels RGB @MYcol , @MYmirrorcol
PixBuffer[MYtid3] = ImgSrc[MYsrcIndex];
PixBuffer[MYtid3 + 1] = ImgSrc[MYsrcIndex + 1];
PixBuffer[MYtid3 + 2] = ImgSrc[MYsrcIndex + 2];
__syncthreads();
ImgDst[MYdstIndex] = PixBuffer[MYtid3];
ImgDst[MYdstIndex + 1] = PixBuffer[MYtid3 + 1];
ImgDst[MYdstIndex + 2] = PixBuffer[MYtid3 + 2];
}
// Improved Vflip3() kernel that flips the given image vertically
// Each kernel: copies a pixel from GlobalMem into shared memory (PixBuffer[])
// and writes back into the flipped Global Memory location
__global__
void Vflip6(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels, ui RowBytes)
{
__shared__ uch PixBuffer[3072]; // holds 3*1024 Bytes (1024 pixels).
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYtid3 = MYtid*3;
//ui MYgtid = ThrPerBlk * MYbid + MYtid;
//ui NumBlocks = gridDim.x;
//ui BlkPerRow = CEIL(Hpixels,ThrPerBlk);
//ui RowBytes = (Hpixels * 3 + 3) & (~3);
//ui MYrow = MYbid / BlkPerRow;
//ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
ui MYrow = blockIdx.y;
ui MYcol = MYbid*ThrPerBlk + MYtid;
if (MYcol >= Hpixels) return; // col is out of range
ui MYmirrorrow = Vpixels - 1 - MYrow;
ui MYsrcOffset = MYrow * RowBytes;
ui MYdstOffset = MYmirrorrow * RowBytes;
ui MYsrcIndex = MYsrcOffset + 3 * MYcol;
ui MYdstIndex = MYdstOffset + 3 * MYcol;
// swap pixels RGB @MYrow , @MYmirrorrow
PixBuffer[MYtid3] = ImgSrc[MYsrcIndex];
PixBuffer[MYtid3 + 1] = ImgSrc[MYsrcIndex + 1];
PixBuffer[MYtid3 + 2] = ImgSrc[MYsrcIndex + 2];
__syncthreads();
ImgDst[MYdstIndex] = PixBuffer[MYtid3];
ImgDst[MYdstIndex + 1] = PixBuffer[MYtid3 + 1];
ImgDst[MYdstIndex + 2] = PixBuffer[MYtid3 + 2];
}
参见【0voice C++】一旦声明了共享内存,SM在启动块之前就从其整个共享内存中分配相应大小的共享内存。在执行期间,下面代码
PixBuffer[MYtid3] = ImgSrc[MYsrcIndex]; ...
将GM中的像素(由mgSrc指向)复制到共享内存(PixBuffer数组)中。下面代码将其复制回GM(复制到翻转后的位置,由指针ImgDst指向):
ImgDst[MYdstIndex] = PixBuffer[MYtid3]; ...
下面这行代码确保在允许每个线程继续处理之前,块中的所有线程读入共享内存的操作都已完成。
__syncthreads();