CUDA-GPU programming Introduction (3)

最新推荐文章于 2022-02-06 01:48:01 发布

NewCoder_BL

最新推荐文章于 2022-02-06 01:48:01 发布

阅读量360

点赞数 1

分类专栏： cuda 文章标签：线程优化 gpu

本文链接：https://blog.csdn.net/u011934885/article/details/71746278

版权

cuda 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

关于提高performance的一些建议：
Important caveat：number of threads

并不是越多并行线程效率越高，因为每个线程都消耗一定的resource，主要是register和shared memory。所以开出再多的线程，GPU也只能在有限的资源下让一部分并行。优化应该根据资源需求。

unavoidable bottleneck: transfer between cpu and gpu

CPU和GPU之间的transfer是一个很大的bottleneck，而且长时间存在且无法避免，因为GPU能力有限，无法独立作业，必须依靠CPU来调用。简单的例子就是GPU无法access file system。针对这个瓶颈，基本的两个处理就是，使用pinned memory，以及asynchronous transfers（overlapping computation and transfer）。

Optimizing access to global memory

GPU有很多的core来进行大量的计算，但是数据必须从global memory获取，如果当core的计算量少于和global memory之间的传输量，瓶颈就会出现，因为这时候大部分的时间都花在data的传输上，GPU的使用变得不划算。而且这种瓶颈在很多问题无法避免。
Utilizing the memory architecture effectively tends to be the biggest challenge in CUDA algorithms

GPU和global memory之间的传输可以有很高的带宽high bandwidth，但同时也有很高的延迟，high latency。所以读写内存的方式很重要。

Using many threads, latency can be overcome by hiding it among many threads. The pattern of global memory access is also very important, as cache size of the GPU is very limited.

Global memory access is fast when coalesced
It is best for adjacent threads belonging to the same warp (group of 32 threads) to be accessing locations adjacent in memory (or as close as possible)
• Good access pattern: thread i accesses global memory array member a[i]
• Inferior access pattern: thread i accesses global memory array member as a[i*nstride] where nstride >1
• Clearly, random access of memory is a particularly bad paradigm on the GPU

但有些问题本质上就有不连续的内存读写，使得优化变得困难。
典型案例：矩阵旋转 matrix transpose
A bandwidth-limited problem that is dominated by memory access。
这里写图片描述

一个基本的写法就是：

__global__ void transpose_naive(float *odata, float *idata, int width,int height)
{
    int xIndex, yIndex, index_in, index_out;
    xIndex = blockDim.x * blockIdx.x + threadIdx.x;
    yIndex = blockDim.y * blockIdx.y + threadIdx.y;
    if (xIndex < width && yIndex < height)
    {
        index_in = xIndex + width * yIndex;
        index_out = yIndex + height * xIndex;
        odata[index_out] = idata[index_in];
    }
}

实际上数组在内存都是连续存储的，二维还是三维都只是直观上的表现，上述代码在内存中的实际表现如下：
这里写图片描述

这样看，不连续性就暴露的很彻底。这个问题还是可以解决的，因为我们还可以利用shared memory，因为shared memory不需要coalesced 读写，即使不是coalesced的也比global memory下的要快很多。尤其是需要多次访问一些数据的时候，放在shared memory比较好。但shared memory就是比较小，一般就是48kB或者16kB,而且必须要自己写代码的时候调整。

Each multiprocessor has some fast on-chip shared memory
• Threads within a thread block can communicate using the shared memory
• Each thread in a thread block has R/W access to all of the shared memory allocated to a block
• Threads can synchronize using the intrinsic __syncthreads();

这里写图片描述

具体代码如下：

__global__ void transpose(float *odata, float *idata,
int width, int height)
{
    __shared__ float block[BLOCK_DIM][BLOCK_DIM];
    unsigned int xIndex, yIndex, index_in, index_out;
    /* read the matrix tile into shared memory */
    xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
    yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
    if ((xIndex < width) && (yIndex < height))
    {
        index_in = yIndex * width + xIndex;
        block[threadIdx.y][threadIdx.x] = idata[index_in];
    }
    __syncthreads();
    /* write the transposed matrix tile to global memory */
    xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
    yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
    if ((xIndex < height) && (yIndex < width))
    {
        index_out = yIndex * height + xIndex;
        odata[index_out] = block[threadIdx.x][threadIdx.y];
    }
}

这里面，shared memory就是个中介的作用，让core和global memory之间不可避免的非连续读写在它这完成。单独说明一下odata的index计算。我们发现，xIndex和yIndex在计算block的位置的时候是符合转置关系的，但是在block内的位置依旧是原来的x,y关系，因为这里必须保证连续性，所以把转置关系留在shared memory里来做。同时补充一点，这里的x,y是一般意义上的x,y方向，即x是row方向，y是column方向。所以在一般内存里矩阵寻址的时候还是应该[y][x]。

这里写图片描述

同时，对于高度优化的代码还应该考虑bank conflict，因为shared memory中分了32个bank, 可以4 bytes和8 bytes，对应字长的。详细见另一篇博客。32是为了对应一个warp size，保证同时进来的warp里的thread不会落入同一个bank，不会有冲突。简单的解决方法就是把32x32大小的block加一个padding，32x33，这样就把可能有的冲突错开了。这个矩阵转置有冲突就是体现在从shared memory把内存拷到global memory的时候都是按列读取的，全在同一个bank,这样的话这些threads都从并行变成了serial处理，latency很高。简单处理后的结果是：

__global__ void transpose(float *odata, float *idata,
int width, int height)
{
    __shared__ float block[BLOCK_DIM][BLOCK_DIM + 1];
    unsigned int xIndex, yIndex, index_in, index_out;
    /* read the matrix tile into shared memory */
    xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
    yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
    if ((xIndex < width) && (yIndex < height))
    {
        index_in = yIndex * width + xIndex;
        block[threadIdx.y][threadIdx.x] = idata[index_in];
    }
    __syncthreads();
    /* write the transposed matrix tile to global memory */
    xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
    yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
    if ((xIndex < height) && (yIndex < width))
    {
        index_out = yIndex * height + xIndex;
        odata[index_out] = block[threadIdx.x][threadIdx.y];
    }
}