Accelerated Ray Tracing (九)

https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/

Matrix Transpose

对单精度值矩阵的转置,该矩阵可以在out-of-place的位置进行操作,即输入和输出是内存中的单独数组。为了表示简单,我们只考虑维数是32的整数倍的方阵。它由几个内核以及执行典型任务的主机代码组成,如主机和设备之间的分配和数据传输、启动和计时以及验证其结果、释放主机和设备内存。

除了执行几种不同的矩阵转置之外,我们还运行简单的矩阵复制内核,因为复制性能指示我们希望矩阵转置实现的性能。对于矩阵复制和转置,相关的性能指标是有效带宽,以GB / s表示,方法是将矩阵的单位为GB的大小除以两倍(一次用于加载矩阵,一次用于存储)除以执行秒数。

本研究中的所有内核均启动32×8线程的块(代码中TILE_DIM = 32,BLOCK_ROWS = 8),并且每个线程块转置(或复制)大小为32×32的图块。使用线程数少于平铺中的元素的线程块有利于矩阵转置,因为每个线程转置四个矩阵元素,因此,大部分索引计算成本将在这些元素上摊销。

本例中的内核使用笛卡尔(x,y)映射而不是行/列映射将线程映射到矩阵元素,以简化CUDA C自动变量组件的含义 :threadIdx.x是水平的和threadIdx.y是垂直的。这种映射取决于程序员。要记住的重要一点是,为了确保内存合并,我们希望将变化最快的组件映射到内存中的连续元素。在Fortran中,连续地址对应于多维数组的第一个索引,并且threadIdx.x和blockIdx.x分别在块和网格内变化最快。

Simple Matrix Copy

 

Naive Matrix Transpose

在transposeNaive中,从idata读取的内容与复制内核中的内容相同,但是对于我们的1024×1024测试矩阵,对odata的写入在连续线程之间具有1024个元素或4096字节的跨度。这使我们很好地陷入了strided的内存访问图的渐近线,并且我们希望此内核的性能会因此受到影响。 copy和transposeNaive内核的结果证明了这一点。

TransposeNaive内核仅达到复制内核的有效带宽的一小部分。由于此内核除了复制外几乎没有其他作用,因此我们希望更接近复制吞吐量。让我们看看我们如何做到这一点。

Coalesced Transpose Via Shared Memory

较差的转置性能的补救方法是使用共享内存,以避免通过全局内存大步前进。下图描述了转置中如何使用共享内存。

#include <stdio.h>
#include <assert.h>

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
    if (result != cudaSuccess) {
        fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
        assert(result == cudaSuccess);
    }
#endif
    return result;
}

const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;

// Check errors and print GB/s
void postprocess(const float* ref, const float* res, int n, float ms)
{
    bool passed = true;
    for (int i = 0; i < n; i++)
        if (res[i] != ref[i]) {
            printf("%d %f %f\n", i, res[i], ref[i]);
            printf("%25s\n", "*** FAILED ***");
            passed = false;
            break;
        }
    if (passed)
        printf("%20.2f\n", 2 * n * sizeof(float) * 1e-6 * NUM_REPS / ms);
}

// simple copy kernel
// Used as reference case representing best effective bandwidth.
__global__ void copy(float* odata, const float* idata)
{//数据拷贝
    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;//还请注意,必须在矩阵索引y的计算中使用TILE_DIM,而不是BLOCK_ROWS或blockDim.y。
    int width = gridDim.x * TILE_DIM;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)//每个线程在此例程结束时在循环中复制矩阵的四个元素,因为块中的线程数比tile中的元素数小四倍(TILE_DIM / BLOCK_ROWS)。
        odata[(y + j) * width + x] = idata[(y + j) * width + x];// 循环在第二个维度而不是第一个维度上进行迭代,以便连续线程加载和存储连续数据,并且对idata的所有读取和对odata的写入均被合并。
}

// copy kernel using shared memory
// Also used as reference case, demonstrating effect of using shared memory.
__global__ void copySharedMem(float* odata, const float* idata)
{
    __shared__ float tile[TILE_DIM * TILE_DIM];

    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        tile[(threadIdx.y + j) * TILE_DIM + threadIdx.x] = idata[(y + j) * width + x];

    __syncthreads();

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        odata[(y + j) * width + x] = tile[(threadIdx.y + j) * TILE_DIM + threadIdx.x];
}

// naive transpose
// Simplest transpose; doesn't use shared memory.
// Global memory reads are coalesced but writes are not.
__global__ void transposeNaive(float* odata, const float* idata)
{
    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        odata[x * width + (y + j)] = idata[(y + j) * width + x];//我们的第一个转置内核看起来与复制内核非常相似。唯一的区别是交换了odata的索引。
}

// coalesced transpose
// Uses shared memory to achieve coalesing in both reads and writes
// Tile width == #banks causes shared memory bank conflicts.
__global__ void transposeCoalesced(float* odata, const float* idata)
{
   // 以下内核执行此“tiled”转置。transposeCoalesced结果是对transposeNaive情况的改进,但与copy内核的性能仍然相去甚远。//我们可能会猜测,造成性能差距的原因是与使用共享内存和所需的同步阻塞__syncthreads()相关的开销。
    __shared__ float tile[TILE_DIM][TILE_DIM];//我们可以使用以下使用共享内存的复制内核轻松地对此进行测试。
    //对于32×32元素的共享内存tile,数据列中的所有元素都映射到同一共享内存库,从而导致内存库冲突的最坏情况:读取数据列将导致32路存储库冲突。

    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x];//在第一个do循环中,线程warp将idata中的连续数据读取到共享内存tile的行中。

    __syncthreads();//因为线程向odata写入的数据与从idata读取的数据不同,所以我们必须使用逐块阻塞同步__syncthreads()。
    //请注意,在这种情况下,从技术上讲不需要synchthreads()调用,因为对元素的操作是由同一线程执行的,但是我们在此处包括了它,以模仿转置行为。

    x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
    y = blockIdx.x * TILE_DIM + threadIdx.y;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        odata[(y + j) * width + x] = tile[threadIdx.x][threadIdx.y + j];
}


// No bank-conflict transpose
// Same as transposeCoalesced except the first tile dimension is padded 
// to avoid shared memory bank conflicts.
__global__ void transposeNoBankConflicts(float* odata, const float* idata)
{
    __shared__ float tile[TILE_DIM][TILE_DIM + 1];

    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x];

    __syncthreads();

    x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
    y = blockIdx.x * TILE_DIM + threadIdx.y;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        odata[(y + j) * width + x] = tile[threadIdx.x][threadIdx.y + j];
}

int main(int argc, char** argv)
{
    const int nx = 1024;
    const int ny = 1024;
    const int mem_size = nx * ny * sizeof(float);

    dim3 dimGrid(nx / TILE_DIM, ny / TILE_DIM, 1);
    dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1);

    int devId = 0;
    if (argc > 1) devId = atoi(argv[1]);

    cudaDeviceProp prop;
    checkCuda(cudaGetDeviceProperties(&prop, devId));
    printf("\nDevice : %s\n", prop.name);
    printf("Matrix size: %d %d, Block size: %d %d, Tile size: %d %d\n",
        nx, ny, TILE_DIM, BLOCK_ROWS, TILE_DIM, TILE_DIM);
    printf("dimGrid: %d %d %d. dimBlock: %d %d %d\n",
        dimGrid.x, dimGrid.y, dimGrid.z, dimBlock.x, dimBlock.y, dimBlock.z);

    checkCuda(cudaSetDevice(devId));

    float* h_idata = (float*)malloc(mem_size);
    float* h_cdata = (float*)malloc(mem_size);
    float* h_tdata = (float*)malloc(mem_size);
    float* gold = (float*)malloc(mem_size);

    float* d_idata, * d_cdata, * d_tdata;
    checkCuda(cudaMalloc(&d_idata, mem_size));
    checkCuda(cudaMalloc(&d_cdata, mem_size));
    checkCuda(cudaMalloc(&d_tdata, mem_size));

    // check parameters and calculate execution configuration
    if (nx % TILE_DIM || ny % TILE_DIM) {
        printf("nx and ny must be a multiple of TILE_DIM\n");
        goto error_exit;
    }

    if (TILE_DIM % BLOCK_ROWS) {
        printf("TILE_DIM must be a multiple of BLOCK_ROWS\n");
        goto error_exit;
    }

    // host
    for (int j = 0; j < ny; j++)
        for (int i = 0; i < nx; i++)
            h_idata[j * nx + i] = j * nx + i;

    // correct result for error checking
    for (int j = 0; j < ny; j++)
        for (int i = 0; i < nx; i++)
            gold[j * nx + i] = h_idata[i * nx + j];

    // device
    checkCuda(cudaMemcpy(d_idata, h_idata, mem_size, cudaMemcpyHostToDevice));

    // events for timing
    cudaEvent_t startEvent, stopEvent;
    checkCuda(cudaEventCreate(&startEvent));
    checkCuda(cudaEventCreate(&stopEvent));
    float ms;

    // ------------
    // time kernels
    // ------------
    printf("%25s%25s\n", "Routine", "Bandwidth (GB/s)");

    // ----
    // copy 
    // ----
    printf("%25s", "copy");
    checkCuda(cudaMemset(d_cdata, 0, mem_size));
    // warm up
    copy << <dimGrid, dimBlock >> > (d_cdata, d_idata);
    checkCuda(cudaEventRecord(startEvent, 0));
    for (int i = 0; i < NUM_REPS; i++)
        copy << <dimGrid, dimBlock >> > (d_cdata, d_idata);
    checkCuda(cudaEventRecord(stopEvent, 0));
    checkCuda(cudaEventSynchronize(stopEvent));
    checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
    checkCuda(cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost));
    postprocess(h_idata, h_cdata, nx * ny, ms);

    // -------------
    // copySharedMem 
    // -------------
    printf("%25s", "shared memory copy");
    checkCuda(cudaMemset(d_cdata, 0, mem_size));
    // warm up
    copySharedMem << <dimGrid, dimBlock >> > (d_cdata, d_idata);
    checkCuda(cudaEventRecord(startEvent, 0));
    for (int i = 0; i < NUM_REPS; i++)
        copySharedMem << <dimGrid, dimBlock >> > (d_cdata, d_idata);
    checkCuda(cudaEventRecord(stopEvent, 0));
    checkCuda(cudaEventSynchronize(stopEvent));
    checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
    checkCuda(cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost));
    postprocess(h_idata, h_cdata, nx * ny, ms);

    // --------------
    // transposeNaive 
    // --------------
    printf("%25s", "naive transpose");
    checkCuda(cudaMemset(d_tdata, 0, mem_size));
    // warmup
    transposeNaive << <dimGrid, dimBlock >> > (d_tdata, d_idata);
    checkCuda(cudaEventRecord(startEvent, 0));
    for (int i = 0; i < NUM_REPS; i++)
        transposeNaive << <dimGrid, dimBlock >> > (d_tdata, d_idata);
    checkCuda(cudaEventRecord(stopEvent, 0));
    checkCuda(cudaEventSynchronize(stopEvent));
    checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
    checkCuda(cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost));
    postprocess(gold, h_tdata, nx * ny, ms);

    // ------------------
    // transposeCoalesced 
    // ------------------
    printf("%25s", "coalesced transpose");
    checkCuda(cudaMemset(d_tdata, 0, mem_size));
    // warmup
    transposeCoalesced << <dimGrid, dimBlock >> > (d_tdata, d_idata);
    checkCuda(cudaEventRecord(startEvent, 0));
    for (int i = 0; i < NUM_REPS; i++)
        transposeCoalesced << <dimGrid, dimBlock >> > (d_tdata, d_idata);
    checkCuda(cudaEventRecord(stopEvent, 0));
    checkCuda(cudaEventSynchronize(stopEvent));
    checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
    checkCuda(cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost));
    postprocess(gold, h_tdata, nx * ny, ms);

    // ------------------------
    // transposeNoBankConflicts
    // ------------------------
    printf("%25s", "conflict-free transpose");
    checkCuda(cudaMemset(d_tdata, 0, mem_size));
    // warmup
    transposeNoBankConflicts << <dimGrid, dimBlock >> > (d_tdata, d_idata);
    checkCuda(cudaEventRecord(startEvent, 0));
    for (int i = 0; i < NUM_REPS; i++)
        transposeNoBankConflicts << <dimGrid, dimBlock >> > (d_tdata, d_idata);
    checkCuda(cudaEventRecord(stopEvent, 0));
    checkCuda(cudaEventSynchronize(stopEvent));
    checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
    checkCuda(cudaMemcpy(h_tdata, d_tdata, mem_size, cudaMemcpyDeviceToHost));
    postprocess(gold, h_tdata, nx * ny, ms);

error_exit:
    // cleanup
    checkCuda(cudaEventDestroy(startEvent));
    checkCuda(cudaEventDestroy(stopEvent));
    checkCuda(cudaFree(d_tdata));
    checkCuda(cudaFree(d_cdata));
    checkCuda(cudaFree(d_idata));
    free(h_idata);
    free(h_tdata);
    free(h_cdata);
    free(gold);
}

在这篇文章中,我们介绍了三个代表内核转置的各种优化的内核。内核展示了如何使用共享内存来合并全局内存访问,以及如何填充阵列以避免共享内存库冲突。从我们的内核的相对收益来看,coalescing global memory访问是迄今为止获得良好性能的最关键方面,这在许多应用程序中都是如此。由于全局内存合并非常重要,因此当我们在3D网格上查看有限差分计算时,我们将在下一篇文章中再次进行讨论。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值