《基于 CUDA 的 GPU 并行程序开发指南》中的 Hflip7 的越界错误的原因

hijackedbycsdn

于 2023-12-26 10:44:19 发布

阅读量806

点赞数 15

分类专栏： CUDA 文章标签： c++ CUDA

本文链接：https://blog.csdn.net/hijackedbycsdn/article/details/135206465

版权

CUDA 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

越界问题

实际运行发现有越界的报错，经过测试，是 MYdstIndex 可能为 -1，转成 unsigned 变成了 2^32 - 1 ，是一个很大的数

计算出 MYdstIndex 判断一下他是不是负数就行了

// Improved Hflip6() kernel that flips the given image horizontally
// Each kernel: uses Shared Memory (PixBuffer[]) to read in 12 Bytes
// (4 pixels). 12Bytes are flipped inside Shared Memory 
// After that, they are written into Global Mem as 3 int's
// Horizontal resolution MUST BE A POWER OF 4.
__global__
void Hflip7(ui* ImgDst32, ui* ImgSrc32, ui RowInts)
{
	__shared__ ui PixBuffer[3072]; // holds 3*1024*4 Bytes (1024*4 pixels).

	ui ThrPerBlk = blockDim.x;
	ui MYbid = blockIdx.x;
	ui MYtid = threadIdx.x;
	ui MYtid3 = MYtid * 3;
	ui MYrow = blockIdx.y;
	ui MYcolIndex = (MYbid * ThrPerBlk + MYtid) * 3;
	if (MYcolIndex >= RowInts) return;			// index is out of range
	ui MYmirrorcol = RowInts - 1 - MYcolIndex;
	ui MYoffset = MYrow * RowInts;
	ui MYsrcIndex = MYoffset + MYcolIndex;
	ui MYdstIndex = MYoffset + MYmirrorcol - 2; // -2 is to copy 3 Bytes at a time
	if ((int)MYdstIndex < 0) return;

	uch SwapB;
	uch* SwapPtr;

	// read 4 pixel blocks (12B = 3 int's) into Shared Memory
	// PixBuffer:  [B0 G0 R0 B1] [G1 R1 B2 G2] [R2 B3 G3 R3]
	// Our Target: [B3 G3 R3 B2] [G2 R2 B1 G1] [R1 B0 G0 R0]
	PixBuffer[MYtid3] = ImgSrc32[MYsrcIndex];
	PixBuffer[MYtid3 + 1] = ImgSrc32[MYsrcIndex + 1];
	PixBuffer[MYtid3 + 2] = ImgSrc32[MYsrcIndex + 2];
	__syncthreads();

	// swap these 4 pixels inside Shared Memory
	SwapPtr = (uch*)(&PixBuffer[MYtid3]);      // [B0 G0 R0 B1] [G1 R1 B2 G2] [R2 B3 G3 R3]
	SWAP(SwapPtr[0], SwapPtr[9], SwapB)			// [B3 G0 R0 B1] [G1 R1 B2 G2] [R2 B0 G3 R3]
	SWAP(SwapPtr[1], SwapPtr[10], SwapB)		// [B3 G3 R0 B1] [G1 R1 B2 G2] [R2 B0 G0 R3]
	SWAP(SwapPtr[2], SwapPtr[11], SwapB)		// [B3 G3 R3 B1] [G1 R1 B2 G2] [R2 B0 G0 R0]
	SWAP(SwapPtr[3], SwapPtr[6], SwapB)			// [B3 G3 R3 B2] [G1 R1 B1 G2] [R2 B0 G0 R0]
	SWAP(SwapPtr[4], SwapPtr[7], SwapB)			// [B3 G3 R3 B2] [G2 R1 B1 G1] [R2 B0 G0 R0]
	SWAP(SwapPtr[5], SwapPtr[8], SwapB)			// [B3 G3 R3 B2] [G2 R2 B1 G1] [R1 B0 G0 R0]

	__syncthreads();

	//write the 4 pixels (3 int's) from Shared Memory into Global Memory
	ImgDst32[MYdstIndex] = PixBuffer[MYtid3];
	ImgDst32[MYdstIndex + 1] = PixBuffer[MYtid3 + 1];
	ImgDst32[MYdstIndex + 2] = PixBuffer[MYtid3 + 2];
}

序号计算逻辑

其实虽然我发现了是哪个序号出现了负数，但是我不知道他为什么会是负的

所以这里做一个分析

最初的函数例子

#define	IPH			ip.Hpixels
#define	IPV			ip.Vpixels
#define	CEIL(a,b)		((a+b-1)/b)

...

BlkPerRow = CEIL(IPH,ThrPerBlk);
NumBlocks = IPV*BlkPerRow;

...

Hflip <<< NumBlocks, ThrPerBlk >>> (GPUCopyImg, GPUImg, IPH);

他这里是一个线程处理一个像素的，所以基本思路就是你至少要启动跟像素总数相等的线程

然后有又因为线程块里面的线程数是固定的，所以肯定会有浪费

所以要向上取整

现在把他拆成一个二维的，对第一个维度除以 4，那就是一个线程处理 4 个像素

dim3 dimGrid2D4(CEIL(BlkPerRow,4), ip.Vpixels);

如果是一个线程处理一个像素

ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;

ui MYrow = blockIdx.y;
ui MYcol = MYbid*ThrPerBlk + MYtid;

这里算出来的 MYcol 就是当前行要处理的像素的列数

但是一个线程处理 4 个像素，一个像素 3 byte，4 个像素就是 12 byte，也等于 3 个 int

那么现在我不算当前行要处理的像素的列数，我把当前行的像素总数换算成 int 的个数

然后我现在要算的是当前行要处理的 int 的列数，也就是要处理第几个 int

ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;

ui MYrow = blockIdx.y;
ui MYcolIndex = (MYbid * ThrPerBlk + MYtid) * 3;

这个 MYcolIndex 就是当前行要处理的 int 的序号

那么接下来就很清楚了

ui MYoffset = MYrow * RowInts;
ui MYsrcIndex = MYoffset + MYcolIndex;
ui MYdstIndex = MYoffset + MYmirrorcol - 2;

当前行要处理的 int 的序号显然可以是 0 或者 1，如果是第 0 行的话，那么 MYoffset = 0

那么显然 MYdstIndex 可能是 -1 或者 -2

所以要加这个判断条件

水平翻转上的错误

原图

在这里插入图片描述

水平翻转之后的结果

在这里插入图片描述

这种错误只有我在开始用 flip7 之后才出现的

所以还是需要思考一下为什么会这样……

想了一天似乎还是没搞懂

比如我把程序开始和结束时的 img 指针的内容打印出来

...

printf("TheImg:\n");
for (int i = 0; i < IPV; ++i) {
	for (int j = 0; j < IPH; ++j) {
		printf("(%d, ", *(TheImg + i * IPHB + 3 * j));
		printf("%d, ", *(TheImg + i * IPHB + 3 * j + 1));
		printf("%d) ", *(TheImg + i * IPHB + 3 * j + 2));
	}
	printf("\n");
}

...

printf("CopyImg:\n");
for (int i = 0; i < IPV; ++i) {
	for (int j = 0; j < IPH; ++j) {
		printf("(%d, ", *(CopyImg + i * IPHB + 3 * j));
		printf("%d, ", *(CopyImg + i * IPHB + 3 * j + 1));
		printf("%d) ", *(CopyImg + i * IPHB + 3 * j + 2));
	}
	printf("\n");
}

输出：

TheImg:
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255)
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0)
(255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0)
(255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255)
(255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255)
(0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) (255, 255, 255)
CopyImg:
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 255, 255) (0, 0, 0) (0, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 255, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 255, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255) (255, 255, 255)
(0, 0, 255) (255, 255, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255)
(0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0)

这个输出的行的方向是从图的底部到图的顶部

如果在 kernel 里面加一个判断，只允许第 0 行和第 0 列的执行

if (MYsrcIndex != 0 || MYrow != 0) return;

TheImg:
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255)
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0)
(255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0)
(255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255)
(255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255)
(0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) (255, 255, 255)
CopyImg:
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 0)

如果是改成

if (MYcolIndex != 0) return;

输出 2

TheImg:
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255)
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0)
(255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0)
(255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255)
(255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255)
(0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) (255, 255, 255)
CopyImg:
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 0, 0) (0, 0, 0) (0, 0, 255) (0, 0, 0) (0, 255, 255) (255, 255, 255)
(0, 0, 0) (0, 0, 0) (0, 0, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255)
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0)

可见他这里是，每行的后面被吞了 2 个字节

把最终的输出排一下版

(0, 0, 0) (0, 0, 0)         (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 255, 255) (0, 0, 0)     (0, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 255, 0) (255, 0, 255)   (0, 0, 0) (0, 255, 255) (255, 255, 255) (255, 255, 255)
(0, 255, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255) (255, 255, 255)
(0, 0, 255) (255, 255, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255)
(0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0)

起码确认了从输出 2 到最终的输出，输出 2 这里的趋于没有被破坏

那就把最后面的也输出

...

printf("TheImg:\n");
for (int i = 0; i < IPV; ++i) {
	for (int j = 0; j < IPH; ++j) {
		printf("(%d, ", *(TheImg + i * IPHB + 3 * j));
		printf("%d, ", *(TheImg + i * IPHB + 3 * j + 1));
		printf("%d) ", *(TheImg + i * IPHB + 3 * j + 2));
	}

	for (int j = 3 * IPH; j < IPHB; ++j) {
		printf("%d ", *(TheImg + i * IPHB + j));
	}
	printf("\n");
}

...

printf("CopyImg:\n");
for (int i = 0; i < IPV; ++i) {
	for (int j = 0; j < IPH; ++j) {
		printf("(%d, ", *(CopyImg + i * IPHB + 3 * j));
		printf("%d, ", *(CopyImg + i * IPHB + 3 * j + 1));
		printf("%d) ", *(CopyImg + i * IPHB + 3 * j + 2));
	}

	for (int j = 3 * IPH; j < IPHB; ++j) {
		printf("%d ", *(CopyImg + i * IPHB + j));
	}
	printf("\n");
}

输出

TheImg:
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) 0 0
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) 0 0
(255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) 0 0
(255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) 0 0
(255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) 0 0
(0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) (255, 255, 255) 0 0
CopyImg:
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) 255 255
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) 255 255
(0, 0, 0) (0, 0, 0) (0, 0, 0) (0, 255, 255) (255, 255, 255) (255, 255, 255) 255 255
(0, 0, 0) (0, 0, 0) (0, 0, 255) (0, 0, 0) (0, 255, 255) (255, 255, 255) 255 255
(0, 0, 0) (0, 0, 0) (0, 0, 0) (255, 0, 255) (0, 0, 0) (0, 255, 255) 255 255
(0, 0, 0) (0, 0, 0) (0, 0, 255) (255, 255, 0) (255, 0, 255) (0, 0, 0) 0 255

这就证明了确实是每行的后面被吞了 2 个字节

假如把核函数中的条件改成

if (MYcolIndex == 0) return;

输出：

TheImg:
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) 0 0
(255, 255, 255) (255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) 0 0
(255, 255, 255) (255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) 0 0
(255, 255, 255) (255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) 0 0
(255, 255, 255) (0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) 0 0
(0, 0, 255) (255, 0, 0) (0, 255, 0) (255, 255, 255) (255, 255, 255) (255, 255, 255) 0 0
CopyImg:
(0, 0, 0) (0, 0, 0) (0, 0,         0) (0, 0, 0) (0, 0, 0) (0, 255, 255) 255 0
(0, 255, 255) (0, 0, 0) (0, 255,   0) (0, 0, 0) (0, 0, 0) (0, 255, 255) 255 0
(0, 255, 0) (255, 0, 255) (0, 0,     0) (0, 0, 0) (0, 0, 0) (0, 255, 255) 0 0
(0, 255, 255) (255, 255, 0) (255, 0, 0) (0, 0, 0) (0, 0, 0) (0, 0, 255) 255 0
(0, 0, 255) (255, 255, 255) (255, 255,   0) (0, 0, 0) (0, 0, 0) (0, 0, 0) 0 0
(0, 0, 255) (255, 255, 255) (255, 255,   0) (0, 0, 0) (0, 0, 0) (0, 0, 0) 0 0

这就是为什么明明 MYmirrorcol - 2 会为负，但是只有一个位置会报错，

因为算

ui MYsrcIndex = MYoffset + MYcolIndex;
ui MYdstIndex = MYoffset + MYmirrorcol - 2;

的时候还加上了一个 MYoffset

这样的话，其实如果你的 MYmirrorcol - 2 会为负数的话，其实会发生一件事就是，MYsrcIndex 涉及到的范围跨越了两行，MYdstIndex 涉及到的范围也跨越了两行

比如这个输出中可以看出

在这里插入图片描述

这样就……完全错了

现在我觉得我完全知道他为什么错了……但是我不想改了，已经花了很多时间了……

hijackedbycsdn

关注

15
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录