难理解的bank conflict

最新推荐文章于 2022-05-08 17:28:28 发布

元气少女缘结神

最新推荐文章于 2022-05-08 17:28:28 发布

阅读量5.9k

点赞数 13

分类专栏： Improve Performance 文章标签： OpenCL

本文链接：https://blog.csdn.net/wd1603926823/article/details/78326570

版权

Improve Performance 专栏收录该内容

39 篇文章 9 订阅

订阅专栏

之前看Nvidia-OpenCL-SDK里有一个例子讲到过bank conflict，但没怎么明白，它选择的是用奇数来避免。

#define BLOCK_DIM 16

// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster
// than the naive kernel below.  Note that the shared memory array is sized to 
// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory 
// so that bank conflicts do not occur when threads address the array column-wise.
__kernel void transpose(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{
	// read the matrix tile into shared memory
	unsigned int xIndex = get_global_id(0);
	unsigned int yIndex = get_global_id(1);

	if((xIndex + offset < width) && (yIndex < height))
	{
		unsigned int index_in = yIndex * width + xIndex + offset;
		block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];
	}

	barrier(CLK_LOCAL_MEM_FENCE);

	// write the transposed matrix tile to global memory
	xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);
	yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);
	if((xIndex < height) && (yIndex + offset < width))
    {
		unsigned int index_out = yIndex * height + xIndex;
		odata[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];
	}
}



// This naive transpose kernel suffers from completely non-coalesced writes.
// It can be up to 10x slower than the kernel above for large matrices.
__kernel void transpose_naive(__global float *odata, __global float* idata, int offset, int width, int height)
{
    unsigned int xIndex = get_global_id(0);
    unsigned int yIndex = get_global_id(1);
    
    if (xIndex + offset < width && yIndex < height)
    {
        unsigned int index_in  = xIndex + offset + width * yIndex;
        unsigned int index_out = yIndex + height * xIndex;
        odata[index_out] = idata[index_in]; 
    }
}

__kernel void simple_copy(__global float *odata, __global float* idata, int offset, int width, int height)
{
    unsigned int xIndex = get_global_id(0);
    unsigned int yIndex = get_global_id(1);
    
    if (xIndex + offset < width && yIndex < height)
    {
        unsigned int index_in  = xIndex + offset + width * yIndex;
        odata[index_in] = idata[index_in]; 
    }
}

__kernel void shared_copy(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{
	// read the matrix tile into shared memory
	unsigned int xIndex = get_global_id(0);
	unsigned int yIndex = get_global_id(1);

    unsigned int index_in = yIndex * width + xIndex + offset;
	if((xIndex + offset< width) && (yIndex < height))
	{
		//avoid bank conflicts
		block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];
	}

	barrier(CLK_LOCAL_MEM_FENCE);

	if((xIndex < height) && (yIndex+ offset < width))
    {
		odata[index_in] = block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)];
	}
}


__kernel void uncoalesced_copy(__global float *odata, __global float* idata, int offset, int width, int height)
{
    unsigned int xIndex = get_global_id(0);
    unsigned int yIndex = get_global_id(1);
    
    if (xIndex + offset < width && yIndex < height)
    {
        unsigned int index_in  = yIndex + height * (xIndex+ offset);
        odata[index_in] = idata[index_in]; 
    }
}

大神也叫我记住这一点就好了。但还是自己想弄明白，今天看了：但都是以CUDA的视角讲解的：

http://blog.csdn.net/lucky_greenegg/article/details/9992129 http://blog.csdn.net/qqlu_did/article/details/45883159

http://blog.csdn.net/lingerlanlan/article/details/32712749 http://blog.csdn.net/endlch/article/details/47043069

http://blog.csdn.net/smsmn/article/details/6336060 http://blog.csdn.net/o_oxo_o/article/details/4296281 http://www.cnblogs.com/leohan2013/p/3333950.html http://blog.sina.com.cn/s/blog_735f29100102vq84.html http://blog.csdn.net/u014800094/article/details/54290100 http://blog.163.com/volcanolin%40126/blog/static/17086553120110482221978/ http://www.cnblogs.com/biglucky/p/4235009.html

http://blog.csdn.net/u011934885/article/details/54706812

http://blog.csdn.net/endlch/article/details/47043069 这个人写得最好

这几人写得各有千秋，等我真的弄明白再好好讲清楚

看了一会儿，貌似茅塞顿开：我整理了一下，应该是这样理解：

再讲清晰点就是这样：

（3-way 图中写错了，速度是1/3，不是1/4）

对于1所示的存取方式：每次的half-warp中没有多个线程访问同一个bank的不同位置的情况，所以无bank conflict！对于存取方式2：bank0下有不同的线程访问它的不同位置，故发生bank conflict，同样bank4、bank8和bank12都同样发生了bank conflict，所以这是所谓的3 way bank conflict 即速度变为原来的1/3。

上面是根据查到的有关bank conflict相关的资料站在几年前CUDA视角理解的！

话说：以前怎么都不明白，包括自己查资料、看实例、问大神，但就是不明白；现在还是一样的看资料而且是同样的资料看了一点点竟然就明白了，这时候又非常想不通当初的自己为何一直不明白。发现：如别人所说：有的东西，懂了就是懂了，不懂的时候怎样都不会懂。

*************************************************************************************************************************************************

但大神说：

所以对于OpenCL而言，我上面的图要改动的：

1、现在的卡的bank至少都有32个banks，不变的是每个bank依旧是32bit带宽；

2、bank conflict发生在Local memory即LDS的存取时候！

3、执行方式时32个线程即一个warp或者half-wave，即当half-wave=32个线程中有多个线程访问同一个bank的不同位置时发生bank conflict！

4、另外图像常使用uchar4、char4、int、float等都是32bit

上面的图该改善为：

（2 way 不是8 way，图中写错了）

方式3是像CUDA中类似的广播，无conflict！

*********************************************************************************************************

对于开头的矩阵转置例子，这个例子默认是以前老式的16个banks以half-warp访问的老卡：

#define BLOCK_DIM 16
__kernel void transpose(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{
	// read the matrix tile into shared memory
	unsigned int xIndex = get_global_id(0);
	unsigned int yIndex = get_global_id(1);

	if((xIndex + offset < width) && (yIndex < height))
	{
		unsigned int index_in = yIndex * width + xIndex + offset;
		block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];
	}

	barrier(CLK_LOCAL_MEM_FENCE);

	// write the transposed matrix tile to global memory
	xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);
	yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);
	if((xIndex < height) && (yIndex + offset < width))
    {
		unsigned int index_out = yIndex * height + xIndex;
		odata[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];
	}
}

对于写入block变量那里，即使用16X16不会发生bank conflict；但读出来是会发生conflict，所以这也是为什么用16X17的原因！看图1：

对于图1 ：写入时：half-warp中所有线程不会发生conflict，比如itemID[0---15]负责写block[0---15]所以不会conflict！！！但读出来时，因为是矩阵转置，那么itemID[0]负责读block[0]；itemID[1]负责读block[16];......所以发生了16-way的bank conflict（即16个线程访问了同一个bank的不同位置）！所以改进方式是图2使用16X17的block，画X的地方表示没有实际使用的数。写入时，itemID[0--15]负责block[0---15]，itemID[16---32]负责block[17---32]......所以不会conflict！读出时itemID[0]访问block[0];itemID[1]访问block[17]；itemID[2]访问block[34].....从图2知，它们不在一个bank中，故不会conflict！！！所以这个例子对于老式的卡可以用奇数法规避conflict！！！那么后面那个kernel ：shared_copy 即可以自行理解了！很简单了！！！

*****************************************************************************************************************

对于新式卡，可以仿照上面用32X33来规避bank conflict！

********************************完结**********************************************

2020.7.13

因为重新学习《CUDA C编程权威指南》看到第5.2.2节（按列主序存/取时），又讲到CUDA中的bank conflict，因为是以warp（不再以half-warp访问），所以我看得有些疑惑：

书上说这是16 way bank conflict？可是我觉得这是32 way啊？！因为warp里的所有线程都访问了同一个bank的不同地址。

而且我查到这里也说是32 way啊，是书上错了吗？？？？？？？？

然后我看到网上这个[32][33]来避免bank conflict的图明白了

我是这样理解的，所以这样子可以避免bank conflict。

但是书上第197页说矩形共享内存Data[16][32]，按列主序读存会有8 way bank conflict（我觉得是16 way，书上总是我分析的一半，书上错了吗？？），然后解决办法是Data[16][34]，我纸上分析了一下，这样其实和上图差别不大，就是变成warp0访问0_B0, 32_B2,64_B4...480_B30，这样子就是warp内的线程（只使用了16个）访问了不同bank，所以没有conflict。

可是书上说如果使用Data[16][33]就会有2 way bank conflict？？？我分析了一下没有啊，warp0访问的变成了0_B0,32_B1,64_B2...480_B15而已，也没有conflict啊！！！？？？求大神解答

经过大神点拨，我已明白我的症结所在：我的症结就在以为warp0只访问第一列，有16个线程没有使用。现在从你们这知道了原来是两列，warp0里32个线程必定都存在使用！！！可以看到Data[16][33]按列访问时1_B1,32_B1都访问B1；33_B2,64_B2都访问B2....即图中紫色部分都是访问了同一个bank，而这些都在同一个warp内。所以出现conflict！！！！经过大神点拨，明白我之前卡在那个点，现在茅塞顿开。太开心了。

/*******************************************************************************************************************************/

2020.11.13

我以为我真的都理解了，然而 https://bbs.gpuworld.cn/index.php?topic=73410.0 这里大神又给我上了一课，并不是现在所有的卡都是warp访问。如果如大神所说4B是按照warp访问, 8B按照half warp来的, 而16B按照1/4 warp访问。那么这个网友这种情况的确不会出现conflict，因为a[0]放在bank0和bank1简称b0、b1;a[1]--b2、b3;依次类推，a[15]在b30、b31所以这16个线程中并没有出现不同线程访问同一个bank的情况！是我轻率了，看来还有很多要学习的地方。

再次和大神交流，依旧获益良多，包括回顾自己的笔记也感触连连。一下就回到几年前大家一起讨论的日子，虽然当时很菜但交流过程中慢慢明白很多原理，这是一种发自内心的快乐。现在依然很菜，没有大神帮助时，更多的是自己去查书、查网上的解答、查自己当初的笔迹，看当初的交流记录，有时干脆自己推论、验证，这也许就是学习中的成长。