代码来自CUDA Samples\v9.1\0_Simple\matrixMul
主要注释了matrixMulCUDA()
C=A*B
它的理念是把矩阵分成小块,每个线程利用两层循环,大循环在迭代subMatrix,小循环则是迭代每个小subMatrix的横纵坐标,
可以理解为一个线程计算出的Csub就是对应一个元素,然后你想象所有线程是并行的,所以所有的元素计算就可以理解过来了,但是单纯的一个线程是解决不了这个问题的,因为它小循环迭代的时候是需要所有元素的内容的,这也是使用__syncthreads()的作用,而你在使用global memory的时候,不需要用这个,因为每个线程各做各的,都需要去读global memory,这也是使用shared memory的作用,它可以减少对global的访问。
说的不对的地方请大家给我指出来,这各shard memory的使用是我觉得最难的
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Matrix multiplication: C = A * B.
* Host code.
*
* This sample implements matrix multiplication as described in Chapter 3
* of the programming guide.
* It has been written for clarity of exposition to illustrate various CUDA
* programming principles, not with the goal of providing the most
* performant generic kernel for matrix multiplication.
*
* See also:
* V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
* in Proc. 2008 ACM/IEEE Conf. on Supercomputing (SC '08),
* Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
*/
// System includes
#include <stdio.h>
#include <assert.h>
// CUDA runtime
#include <cuda_runtime.h>
// Helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>
/**
* Matrix multiplication (CUDA Kernel) on the device: C = A * B
* wA is A's width and wB is B's width
*/
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
//这是A的第一个细分矩阵的索引,利用block处理(by=blockIdx.y)
//也是从这里让aBegin根据不同block进行处理
//不同block拥有不同的aBegin,即不同的block线程下的循环初始值的不同
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
//这是A的最后一个细分矩阵的索引,利用block处理
//注意wA是A的width
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
//A的细分矩阵的步距
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
//这是B的第一个细分矩阵的索引,也是利用block处理(bx=blockIdx.x)
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
//B的细分矩阵的迭代用步距
int bStep = BLOCK_SIZE * wB;
// Csub is used t