CUDA学习（九）想好好解释一下利用shared memory去做matrixMul

最新推荐文章于 2024-07-05 17:43:50 发布

zhouzhouasishuijiao

最新推荐文章于 2024-07-05 17:43:50 发布

阅读量1k

点赞数 2

分类专栏： CUDA 并行计算

本文链接：https://blog.csdn.net/zhouzhouasishuijiao/article/details/84678035

版权

代码来自CUDA Samples\v9.1\0_Simple\matrixMul

主要注释了matrixMulCUDA（）

C=A*B

它的理念是把矩阵分成小块，每个线程利用两层循环，大循环在迭代subMatrix，小循环则是迭代每个小subMatrix的横纵坐标，

可以理解为一个线程计算出的Csub就是对应一个元素，然后你想象所有线程是并行的，所以所有的元素计算就可以理解过来了，但是单纯的一个线程是解决不了这个问题的，因为它小循环迭代的时候是需要所有元素的内容的，这也是使用__syncthreads()的作用，而你在使用global memory的时候，不需要用这个，因为每个线程各做各的，都需要去读global memory,这也是使用shared memory的作用，它可以减少对global的访问。

说的不对的地方请大家给我指出来，这各shard memory的使用是我觉得最难的

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/**
 * Matrix multiplication: C = A * B.
 * Host code.
 *
 * This sample implements matrix multiplication as described in Chapter 3
 * of the programming guide.
 * It has been written for clarity of exposition to illustrate various CUDA
 * programming principles, not with the goal of providing the most
 * performant generic kernel for matrix multiplication.
 *
 * See also:
 * V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
 * in Proc. 2008 ACM/IEEE Conf. on Supercomputing (SC '08),
 * Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
 */

// System includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// Helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>

/**
 * Matrix multiplication (CUDA Kernel) on the device: C = A * B
 * wA is A's width and wB is B's width
 */
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed by the block
    //这是A的第一个细分矩阵的索引,利用block处理（by=blockIdx.y）
    //也是从这里让aBegin根据不同block进行处理
    //不同block拥有不同的aBegin，即不同的block线程下的循环初始值的不同
    int aBegin = wA * BLOCK_SIZE * by;

    // Index of the last sub-matrix of A processed by the block
    //这是A的最后一个细分矩阵的索引，利用block处理
    //注意wA是A的width
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the sub-matrices of A
    //A的细分矩阵的步距
    int aStep  = BLOCK_SIZE;

    // Index of the first sub-matrix of B processed by the block
    //这是B的第一个细分矩阵的索引，也是利用block处理（bx=blockIdx.x）
    int bBegin = BLOCK_SIZE * bx;

    // Step size used to iterate through the sub-matrices of B
    //B的细分矩阵的迭代用步距
    int bStep  = BLOCK_SIZE * wB;

    // Csub is used t