CUDA学习(九)想好好解释一下利用shared memory去做matrixMul

 代码来自CUDA Samples\v9.1\0_Simple\matrixMul

主要注释了matrixMulCUDA()

C=A*B

它的理念是把矩阵分成小块,每个线程利用两层循环,大循环在迭代subMatrix,小循环则是迭代每个小subMatrix的横纵坐标,

可以理解为一个线程计算出的Csub就是对应一个元素,然后你想象所有线程是并行的,所以所有的元素计算就可以理解过来了,但是单纯的一个线程是解决不了这个问题的,因为它小循环迭代的时候是需要所有元素的内容的,这也是使用__syncthreads()的作用,而你在使用global memory的时候,不需要用这个,因为每个线程各做各的,都需要去读global memory,这也是使用shared memory的作用,它可以减少对global的访问。

说的不对的地方请大家给我指出来,这各shard memory的使用是我觉得最难的

 

 

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/**
 * Matrix multiplication: C = A * B.
 * Host code.
 *
 * This sample implements matrix multiplication as described in Chapter 3
 * of the programming guide.
 * It has been written for clarity of exposition to illustrate various CUDA
 * programming principles, not with the goal of providing the most
 * performant generic kernel for matrix multiplication.
 *
 * See also:
 * V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
 * in Proc. 2008 ACM/IEEE Conf. on Supercomputing (SC '08),
 * Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
 */

// System includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// Helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>

/**
 * Matrix multiplication (CUDA Kernel) on the device: C = A * B
 * wA is A's width and wB is B's width
 */
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed by the block
    //这是A的第一个细分矩阵的索引,利用block处理(by=blockIdx.y)
    //也是从这里让aBegin根据不同block进行处理
    //不同block拥有不同的aBegin,即不同的block线程下的循环初始值的不同
    int aBegin = wA * BLOCK_SIZE * by;

    // Index of the last sub-matrix of A processed by the block
    //这是A的最后一个细分矩阵的索引,利用block处理
    //注意wA是A的width
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the sub-matrices of A
    //A的细分矩阵的步距
    int aStep  = BLOCK_SIZE;

    // Index of the first sub-matrix of B processed by the block
    //这是B的第一个细分矩阵的索引,也是利用block处理(bx=blockIdx.x)
    int bBegin = BLOCK_SIZE * bx;

    // Step size used to iterate through the sub-matrices of B
    //B的细分矩阵的迭代用步距
    int bStep  = BLOCK_SIZE * wB;

    // Csub is used t
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值