CUDA——kernel执行逻辑

最新推荐文章于 2024-10-12 07:15:54 发布

UCAS_HMM

最新推荐文章于 2024-10-12 07:15:54 发布

阅读量901

点赞数 2

分类专栏： CUDA 文章标签：性能优化

本文链接：https://blog.csdn.net/ucas_hmm/article/details/127396510

版权

CUDA 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

warp发散算例

基本概念

由于GPU指令架构的限制，同一个warp中的32个threads在每一个周期内执行的指令都是一样的，如果遇到了if等分支语句，并且只有一部分thread满足时，那么该warp中的所有thread都会进入同一个分支，不满足逻辑条件的thread会处于非活状态。这种情况会导致warp的执行效率比较低下。

#include<stdlib.h>
#include<stdio.h>

__global__ void mathKernel1(int *arr, int nElem)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < nElem)
    {
        if (index % 2 == 0)
        {
            for ( int i = 0; i < 10000; i++)
            {
                arr[index] = 0;
            }
        }
        else
        {
            for ( int i = 0; i < 10000; i++)
            {
                arr[index] = 1;
            }
        }
    }
}

__global__ void mathKernel2(int *arr, int nElem)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < nElem)
    {
        if ((index / warpSize) % 2 == 0)
        {
            for ( int i = 0; i < 10000; i++)
            {
                arr[index] = 0;
            }
        }
        else
        {
            for ( int i = 0; i < 10000; i++)
            {
                arr[index] = 1;
            }
        }
    }
}

int main(int argc, char **argv)
{
    int *d_arr;
    int nElem = 1<<14;
    cudaMalloc((void **)&d_arr, nElem * sizeof(int));
    dim3 block(atoi(argv[1]));
    dim3 grid((nElem + block.x - 1)/block.x);
    mathKernel1<<<grid, block>>>(d_arr, nElem);
    mathKernel2<<<grid, block>>>(d_arr, nElem);
    cudaFree(d_arr);
    cudaDeviceReset();
    return 0;
}

性能分析

[mmhe@k231 chapter3]$ nvcc -g -G simpleDivergence.cu -o test
[mmhe@k231 chapter3]$ nvprof ./test 1024
==11640== NVPROF is profiling process 11640, command: ./test 1024
==11640== Profiling application: ./test 1024
==11640== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   66.70%  27.732ms         1  27.732ms  27.732ms  27.732ms  mathKernel1(int*, int)
                   33.30%  13.843ms         1  13.843ms  13.843ms  13.843ms  mathKernel2(int*, int)

可以看到，mathKernel1的运行时间几乎是mathKern2的两倍，这是由于第一个核函数的条件判断语句和第二个条件判断语句存在本质的区别：

int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index % 2 == 0)					//- mathKernel1
if ((index / warpSize) % 2 == 0)	//- mathKernel2

首先我们知道block中的thread划分成warp是根据threadIdx.x进行的，因此在本例子中，同一个warp的threadIdx.x是连续的，也即index是连续的。那么对于mathKernnel1来说，有一半的thread是满足if分支的，而另一半则满足else分支。但是由于指令执行的特点，所有线程都会执行所有的指令，因此warp执行效率为50%。
对于mathKernel2来说，index首先除以warpSize，然后再利用整数商来进行奇偶数判断。这样同一个warp的所有thread都会同时满足if分支或者else分支。这种情况，warp就不会出现发散的情况。

可以通过nvprof的--metrics warp_execution_efficiency, inst_per_warp来输出核函数的分支效率。

mmhe@k231 chapter3]$ nvprof --metrics warp_execution_efficiency,inst_per_warp ./test 1024
==24533== NVPROF is profiling process 24533, command: ./test 1024
==24533== Profiling application: ./test 1024
==24533== Profiling result:
==24533== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
Kernel: mathKernel2(int*, int)
1                 warp_execution_efficiency                 Warp Execution Efficiency     100.00%     100.00%     100.00%
1                             inst_per_warp                     Instructions per warp  4.3015e+05  4.3015e+05  4.3015e+05
Kernel: mathKernel1(int*, int)
1                 warp_execution_efficiency                 Warp Execution Efficiency      50.00%      50.00%      50.00%
          1                             inst_per_warp                     Instructions per warp  8.6011e+05  8.6011e+05  8.6011e+05

可以看到，两个核函数执行的指令数量和效率是符合预期的。但是对于绝大部分的实际算例来说，warp发散是无法避免的，但是了解这个问题的存在，可以在设计算法时，尽可能减缓这种问题带来的负面影响。

占有率

基本理论

$\begin{equation} \text{占有率} = \frac{\text{每个SM活动的warp}}{\text{每个SM最大允许的warp}} \end{equation}$

其中最大允许的warp通过最大允许的常驻thread来换算出来。这个占有率的指标可以通过nvprof的achieved_occupancy来获取。下面以矩阵加法为例进行测试。

性能分析

通过调整不同的线程组织配置参数可以得到一系列的结果。

[mmhe@k231 chapter3]$ nvcc -g -G -Xptxas -dlcm=ca -arch=sm_35 sumMatrixOnGPU-2D-grid-2D-block.cu -o test
[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 32 32
_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 32 16
nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 16 32
nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 16 16
==14181== NVPROF is profiling process 14181, command: ./test 32 32
matrix (16384, 16384) size is 268435456, memory size is 1024.00MB
gridDim:(512,512,1) blockDim:(32,32,1)
==14181== Some kernel(s) will be replayed on device 10 in order to collect all events/metrics.
Replaying kernel "sumMat(int*, int*, int*, int, int)" (done)
result is correctl events
==14181== Profiling application: ./test 32 32
==14181== Profiling result:
==14181== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (10)"
    Kernel: sumMat(int*, int*, int*, int, int)
          1                          gld_transactions                  Global Load Transactions    16777216    16777216    16777216
          1                          gst_transactions                 Global Store Transactions     8388608     8388608     8388608
          1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                        achieved_occupancy                        Achieved Occupancy    0.930404    0.930404    0.930404
[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 32 16
==14262== NVPROF is profiling process 14262, command: ./test 32 16
matrix (16384, 16384) size is 268435456, memory size is 1024.00MB
gridDim:(512,1024,1) blockDim:(32,16,1)
==14262== Some kernel(s) will be replayed on device 10 in order to collect all events/metrics.
Replaying kernel "sumMat(int*, int*, int*, int, int)" (done)
result is correctl events
==14262== Profiling application: ./test 32 16
==14262== Profiling result:
==14262== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (10)"
    Kernel: sumMat(int*, int*, int*, int, int)
          1                          gld_transactions                  Global Load Transactions    16777216    16777216    16777216
          1                          gst_transactions                 Global Store Transactions     8388608     8388608     8388608
          1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                        achieved_occupancy                        Achieved Occupancy    0.939280    0.939280    0.939280
[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 16 32
==14308== NVPROF is profiling process 14308, command: ./test 16 32
matrix (16384, 16384) size is 268435456, memory size is 1024.00MB
gridDim:(1024,512,1) blockDim:(16,32,1)
==14308== Some kernel(s) will be replayed on device 10 in order to collect all events/metrics.
Replaying kernel "sumMat(int*, int*, int*, int, int)" (done)
result is correct
==14308== Profiling application: ./test 16 32
==14308== Profiling result:
==14308== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (10)"
    Kernel: sumMat(int*, int*, int*, int, int)
          1                          gld_transactions                  Global Load Transactions    33554432    33554432    33554432
          1                          gst_transactions                 Global Store Transactions    16777216    16777216    16777216
          1                            gld_efficiency             Global Memory Load Efficiency      50.00%      50.00%      50.00%
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                        achieved_occupancy                        Achieved Occupancy    0.942908    0.942908    0.942908
[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,achieved_occupancy ./test 16 16
==14356== NVPROF is profiling process 14356, command: ./test 16 16
matrix (16384, 16384) size is 268435456, memory size is 1024.00MB
gridDim:(1024,1024,1) blockDim:(16,16,1)
==14356== Some kernel(s) will be replayed on device 10 in order to collect all events/metrics.
Replaying kernel "sumMat(int*, int*, int*, int, int)" (done)
result is correct
==14356== Profiling application: ./test 16 16
==14356== Profiling result:
==14356== Metric result:nsaction
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (10)"
    Kernel: sumMat(int*, int*, int*, int, int)
          1                          gld_transactions                  Global Load Transactions    33554432    33554432    33554432
          1                          gst_transactions                 Global Store Transactions    16777216    16777216    16777216
          1                            gld_efficiency             Global Memory Load Efficiency      50.00%      50.00%      50.00%
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                        achieved_occupancy                        Achieved Occupancy    0.949228    0.949228    0.949228

可以看到，随着block数量的增多，有效占有率在增加。但是比较奇怪的是，（16，32）配置比（32，16）配置的占有率还要高，但实际上这两中配置的block数量是一样的。可能的原因是，achieved_occupancy是平均一段时间内的占有率.

并行规约

问题描述

对一个一维数组进行求和。并行算法过程为：（1）每个block处理一段数据；（2）每个block将归属于本block的数据求和，并将和存放在首元素位置；（3）将数组传到主机，由主机代码处理block求和，得到最终答案。

最简单的实现

线程与数据一一对应，每次求和相邻的数据，间隔逐渐增大，直至将一个block的所有数据全部覆盖。这种做法会有warp发散问题。

__global__ void reduceNeighbored1(int *arr, int *out, int nElem)
{
    int *local_arr = arr + blockIdx.x * blockDim.x; //- 本block处理的数据首地址
    int index = threadIdx.x;
    if (index < blockDim.x)
    {
        for (int stride = 1; stride < blockDim.x; stride = stride * 2)
        {
            if (threadIdx.x % (2 * stride) == 0)
            {
                local_arr[index] += local_arr[index + stride];
            }
            __syncthreads();
        }
    }
    if (threadIdx.x == 0)
    {
        out[blockIdx.x] = local_arr[0];
    }
}

假设一个block中包含64个线程，被分配到某一段的64个数据元素。间隔stride初始化为1。

第一步：选择threadIdx.x为偶数的线程参与求和，基数的线程等待，warp发散发生。由于间隔为1，求和为相邻两个元素，即（0,1），（2,3），（4,5）等，并将求和结果存放在偶数编号的thread中。最后，间隔翻倍变成2.
第二步：选择threadIdx.x为4的倍数的线程参与求和，其他thread闲置，间隔为2，因此求和为（0,2）、（4,6）、（8,10）等，结果存放在左值，间隔翻倍变成4.
…
最后一步：此时间隔变成了32，要求threadIdx.x是64倍数的thread参与求和，其他thread闲置，求和只有一次即：（0,32），结果存放在0线程，间隔翻倍成64，超过blockDim.x，因此跳出循环。
最后将本block的求和结果存放到out数组对应的block元素中。

性能分析

[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,inst_per_warp,warp_execution_efficiency ./test 1024
==31073== NVPROF is profiling process 31073, command: ./test 1024
CPU result = 16384
==31073== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "reduceNeighbored1(int*, int*, int)" (done)
GPU result = 16384 events
==31073== Profiling application: ./test 1024
==31073== Profiling result:
==31073== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
Kernel: reduceNeighbored1(int*, int*, int)
1                          gld_transactions                  Global Load Transactions        6128        6128        6128
1                          gst_transactions                 Global Store Transactions        3072        3072        3072
1                            gld_efficiency             Global Memory Load Efficiency      25.01%      25.01%      25.01%
1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
1                             inst_per_warp                     Instructions per warp  974.218750  974.218750  974.218750
1                 warp_execution_efficiency                 Warp Execution Efficiency      71.95%      71.95%      71.95%

改进思路

问题的主要原因是一个block中的线程与arr数据一一对应，并且这种对应关系贯穿整个核函数生命周期。因此我们可以尝试每一步调整thread与数据的对应关系，从而使求和右值被放在同一个warp中而被释放。

改进1

__global__ void reduceNeighbored2(int *arr, int *out, int nElem)
{
    int *local_arr = arr + blockIdx.x * blockDim.x; //- 本block处理的数据首地址
    for (int stride = 1; stride < blockDim.x; stride = stride * 2)
    {
        int index =  threadIdx.x * stride * 2;
        if (index < blockDim.x) //- 要求索引小于block处理的数据范围
        {
            local_arr[index] += local_arr[index + stride];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0)
    {
        out[blockIdx.x] = local_arr[0];
    }
}

这个核函数里面的最大特点在于，随着归并循环的深入，threadIdx.x与数据索引的映射关系也在发生变化，并保持求和左值尽可能在同一个warp中，因此一定程度上避免了warp发散。

性能分析

[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,inst_per_warp,warp_execution_efficiency ./test 1024
==30261== NVPROF is profiling process 30261, command: ./test 1024
CPU result = 16384
==30261== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "reduceNeighbored2(int*, int*, int)" (done)
GPU result = 16384 events
==30261== Profiling application: ./test 1024
==30261== Profiling result:
==30261== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
    Kernel: reduceNeighbored2(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions        6128        6128        6128
          1                          gst_transactions                 Global Store Transactions        3072        3072        3072
          1                            gld_efficiency             Global Memory Load Efficiency      25.01%      25.01%      25.01%
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
          1                             inst_per_warp                     Instructions per warp  366.812500  366.812500  366.812500
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.14%      98.14%      98.14%

可以看到warp的执行效率提升明显。但是当blockDim.x/stride/2=32时，此时依然在计算的thread已经全部汇聚到同一个warp中，继续执行这样的归并操作，就无法避免发散了。教材里面是将其展开成多步操作：

// unrolling warp
if (tid < 32)
{
    volatile int *vmem = idata;
    vmem[tid] += vmem[tid + 32];
    vmem[tid] += vmem[tid + 16];
    vmem[tid] += vmem[tid +  8];
    vmem[tid] += vmem[tid +  4];
    vmem[tid] += vmem[tid +  2];
    vmem[tid] += vmem[tid +  1];
}

虽然没有了概念上的发散，但是实际上，该warp中的thread执行的计算大部分是无效的，和发散导致计算浪费本质上是一样的。

[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,inst_per_warp,warp_execution_efficiency ./test 1024
==48149== NVPROF is profiling process 48149, command: ./test 1024
CPU result = 16384
gridDim:(16,1,1) blockDim:(1024,1,1)
==48149== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "reduceNeighbored3(int*, int*, int)" (done)
GPU result = 16384 events
==48149== Profiling application: ./test 1024
==48149== Profiling result:
==48149== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
    Kernel: reduceNeighbored3(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions        7200        7200        7200
          1                          gst_transactions                 Global Store Transactions        3600        3600        3600
          1                            gld_efficiency             Global Memory Load Efficiency      22.49%      22.49%      22.49%
          1                            gst_efficiency            Global Memory Store Efficiency      22.48%      22.48%      22.48%
          1                             inst_per_warp                     Instructions per warp  545.750000  545.750000  545.750000
          1                 warp_execution_efficiency                 Warp Execution Efficiency     100.00%     100.00%     100.00%

可以看到，应用这种策略之后的核函数，尽管warp执行效率变为100%，但是每个warp执行的指令数增加，这也是一种变相的发散，（因为多出来的指令数是无效的）。

改进思路

观察上述几个核函数的内存信息我们可以发现，GMEM的加载和存储效率都很低。仔细观察一个warp中线程处理的数据全局索引，可以发现，在上述这种相邻元素求和的方式下，同一个warp中的32个thread加载的GMEM数据是不连续的，以reduceNeighbored2为例，int index = threadIdx.x * stride * 2;

当stride = 1时，32个线程索引的数据间隔为2，那么一个128字节的内存加载事务只能覆盖16个thread的请求，一次求和需要加载A和B两个数据，共需要加载 $\frac{16384}{2}\times2$ 组数据，各自需要 $\frac{16384}{2\times16}=512$ 个事务，共计1024个加载事务。
当stride = 2时，一个内存加载事务只能覆盖8个有效int请求，此时需要加载的int数据有 $\frac{16384}{4}\times2$ 个，各自需要 $\frac{16384}{4\times8}=512$ 因此在这一步需要1024个加载事务。
当stride = 4时，一个内存加载事务只能覆盖4个有效int请求，此时需要加载的int数据有 $\frac{16384}{8}\times2$ 个，各自需要 $\frac{16384}{8\times4}=512$ 因此在这一步需要1024个加载事务。
当stride = 8时，一个内存加载事务只能覆盖2个有效int请求，此时需要加载的int数据有 $\frac{16384}{16}\times2$ 个，各自需要 $\frac{16384}{16\times2}=512$ 因此在这一步需要1024个加载事务。
当stride = 16时，一个内存加载事务只能覆盖1个有效int请求，此时需要加载的int数据有 $\frac{16384}{32}\times2$ 个，各自需要 $\frac{16384}{32\times1}=512$ 因此在这一步需要1024个加载事务。
当stride = 32时，一个内存加载事务只能覆盖1个有效int请求，此时需要加载的int数据有 $\frac{16384}{64}\times2$ 个，各自需要 $\frac{16384}{64\times1}=256$ 因此在这一步需要512个加载事务。
当stride = 64时，一个内存加载事务只能覆盖1个有效int请求，此时需要加载的int数据有 $\frac{16384}{128}\times2$ 个，各自需要 $\frac{16384}{128\times1}=128$ 因此在这一步需要256个加载事务。
当stride = 128时，一个内存加载事务只能覆盖1个有效int请求，此时需要加载的int数据有 $\frac{16384}{256}\times2$ 个，各自需要 $\frac{16384}{256\times1}=64$ 因此在这一步需要128个加载事务。
当stride = 256时，一个内存加载事务只能覆盖1个有效int请求，此时需要加载的int数据有 $\frac{16384}{512}\times2$ 个，各自需要 $\frac{16384}{512\times1}=32$ 因此在这一步需要64个加载事务。
当stride = 512时，一个内存加载事务只能覆盖1个有效int请求，此时需要加载的int数据有 $\frac{16384}{1024}\times2$ 个，各自需要 $\frac{16384}{1024\times1}=16$ 因此在这一步需要32个加载事务。
最后一步，threadIdx.x = 0的线程加载block求和结果到out数组中，显然一个block请求需要用一个单独的事务服务，总共16个block，因此需要16个事务。

把以上所有的内存加载事务累加，得到共计6128个内存128字节的加载事务。由此我们也可以发现，对于存在发散的warp，被禁用的thread不会触发内存加载机制，尽管架构设计的是它们执行的是同一个指令，但是由于处理的数据是私有的，因此它们并不会去加载。自然而然的可以计算出整个规约过程所需要的有效数据量为：

$\begin{equation} (16384+8192+4096+2048+1024+512+256+128+64+32+16)\times 4byte = 131008byte \end{equation}$

而6128个128字节的加载事务共计加载数据784384byte，因此加载有效率为16.7%。与输出不符合。

不过，总而言之，这种间隔加载数据的方式，会导致GMEM合并加载功能发挥不出来。因此需要考虑使用另外一种方式来实现优化。前面介绍的称为相邻对，之后要介绍的是交错对。

交错对实现归并

同样要实现线程索引和GMEM索引动态映射。
但是同一个warp内的thread映射的GMEM地址需要尽可能连续。

代码实现

__global__ void reduceInterleaved1(int *arr, int *out, int nElem)
{
    int *local_arr = arr + blockIdx.x * blockDim.x;
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
    {
        if (threadIdx.x < stride)
        {
            local_arr[threadIdx.x] += local_arr[threadIdx.x + stride];
        }
        __syncthreads();
    }
    
    if (threadIdx.x == 0)
    {
        out[blockIdx.x] = local_arr[0];
    }
}

执行结果

[mmhe@k231 chapter3]$ nvprof --metrics gld_transactions,gst_transactions,gld_efficiency,gst_efficiency,inst_per_warp,warp_execution_efficiency ./test 1024
==17732== NVPROF is profiling process 17732, command: ./test 1024
CPU result = 16384
gridDim:(16,1,1) blockDim:(1024,1,1)
==17732== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "reduceInterleaved1(int*, int*, int)" (done)
GPU result = 16384 events
==17732== Profiling application: ./test 1024
==17732== Profiling result:
==17732== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
    Kernel: reduceInterleaved1(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions        1168        1168        1168
          1                          gst_transactions                 Global Store Transactions         592         592         592
          1                            gld_efficiency             Global Memory Load Efficiency      98.04%      98.04%      98.04%
          1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
          1                             inst_per_warp                     Instructions per warp  345.062500  345.062500  345.062500
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.24%      98.24%      98.24%

性能分析

内存加载分析
- stride = 1时，一个warp的线程需要加载A和B两组GMEM，用两个128字节的加载事务就能满足需求。因此一个block共需要32个内存事务，核函数共有16个block，因此这一步需要512个内存事务。
- stride = 2时，共有8个warp需要加载2组，共计16个内存事务，核函数总共需要256个内存事务。
- stride = 4时，128个
- stride = 8时， 64个
- stride = 16时，32个
- 此时一个block内的所有数据之和全部集中到了首个warp中。后续的操作就在一个warp中完成，会引起warp发散。
- 15行，需要 $2\times16=32$ 个加载内存
- 16行，需要 $2\times16=32$ 个加载内存
- 17行，需要 $2\times16=32$ 个加载内存
- 18行，需要 $2\times16=32$ 个加载内存
- 19行，需要 $2\times16=32$ 个加载内存
- 最后移植block之和，共需要16个加载内存。
- 总计： $512+256+128+64+32+32\times5 + 16=1168$ 个。与nvprof结果一致。
指令执行分析
- 这个核函数在最后一个warp归并的时候也会出现发散，和之前的核函数情况一样。这里我没有去展开处理，原因也和前面的一样。
需要注意的是，以上的程序鲁棒性都比较差，只是为了对基本理论进行分析。

一个线程处理多个数据

之前的是一个block线程处理blockDim.x个数据，但是这部分的改进是，处理多个blockDim.x的数据。

一个block处理1 blockDim的数据

这个代码是作为性能参照：

__global__ void reduceInterleaved1(int *arr, int *out, int nElem)
{
    int *local_arr = arr + blockIdx.x * blockDim.x;
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
    {
        if (threadIdx.x < stride)
        {
            local_arr[threadIdx.x] += local_arr[threadIdx.x + stride];
        }
        __syncthreads();
    }
    
    if (threadIdx.x == 0)
    {
        out[blockIdx.x] = local_arr[0];
    }
}

一个block处理2 blockDim的数据

__global__ void reduceInterleaved2(int *arr, int *out, int nElem)
{
    int *local_arr = arr + blockIdx.x * blockDim.x * 2;
    //- 主要特点，跨越blockDim加和数据，将两个block的数据合并到一个block中，在执行归并操作
    local_arr[threadIdx.x] += local_arr[threadIdx.x + blockDim.x];
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
    {
        if (threadIdx.x < stride)
        {
            local_arr[threadIdx.x] += local_arr[threadIdx.x + stride];
        }
        __syncthreads();
    }
    
    if (threadIdx.x == 0)
    {
        out[blockIdx.x] = local_arr[0];
    }
}

一个block处理4、8 blockDim的数据

核函数很类似，就没有重复放了。

汇总运行结果

数组大小为 $2^{14}=16384$ ：

==40306== Profiling application: ./test 1024
==40306== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   30.76%  65.664us         4  16.416us  16.352us  16.480us  [CUDA memcpy HtoD]
                   18.34%  39.135us         1  39.135us  39.135us  39.135us  reduceInterleaved8(int*, int*, int)
                   16.27%  34.720us         1  34.720us  34.720us  34.720us  reduceInterleaved4(int*, int*, int)
                   14.96%  31.936us         1  31.936us  31.936us  31.936us  reduceInterleaved1(int*, int*, int)
                   14.86%  31.711us         1  31.711us  31.711us  31.711us  reduceInterleaved2(int*, int*, int)
                    4.81%  10.272us         4  2.5680us  2.3680us  2.8480us  [CUDA memcpy DtoH]

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (11)"
    Kernel: reduceInterleaved1(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions        1168        1168        1168
          1                          gst_transactions                 Global Store Transactions         592         592         592
          1                            gld_efficiency             Global Memory Load Efficiency      98.04%      98.04%      98.04%
          1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
          1                             inst_per_warp                     Instructions per warp  345.062500  345.062500  345.062500
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.24%      98.24%      98.24%
          1                      dram_read_throughput             Device Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                             sm_efficiency                   Multiprocessor Activity      75.65%      75.65%      75.65%
    Kernel: reduceInterleaved2(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions        1096        1096        1096
          1                          gst_transactions                 Global Store Transactions         552         552         552
          1                            gld_efficiency             Global Memory Load Efficiency      99.01%      99.01%      99.01%
          1                            gst_efficiency            Global Memory Store Efficiency      98.84%      98.84%      98.84%
          1                             inst_per_warp                     Instructions per warp  394.062500  394.062500  394.062500
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.46%      98.46%      98.46%
          1                      dram_read_throughput             Device Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                             sm_efficiency                   Multiprocessor Activity      49.11%      49.11%      49.11%
    Kernel: reduceInterleaved4(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions         804         804         804
          1                          gst_transactions                 Global Store Transactions         276         276         276
          1                            gld_efficiency             Global Memory Load Efficiency      99.34%      99.34%      99.34%
          1                            gst_efficiency            Global Memory Store Efficiency      98.84%      98.84%      98.84%
          1                             inst_per_warp                     Instructions per warp  456.031250  456.031250  456.031250
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.67%      98.67%      98.67%
          1                      dram_read_throughput             Device Memory Read Throughput  49.032MB/s  49.032MB/s  49.032MB/s
          1                             sm_efficiency                   Multiprocessor Activity      24.74%      24.74%      24.74%
    Kernel: reduceInterleaved8(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions         658         658         658
          1                          gst_transactions                 Global Store Transactions         138         138         138
          1                            gld_efficiency             Global Memory Load Efficiency      99.60%      99.60%      99.60%
          1                            gst_efficiency            Global Memory Store Efficiency      98.84%      98.84%      98.84%
          1                             inst_per_warp                     Instructions per warp  552.031250  552.031250  552.031250
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.90%      98.90%      98.90%
          1                      dram_read_throughput             Device Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                             sm_efficiency                   Multiprocessor Activity      12.76%      12.76%      12.76%

变化规律：

核函数计算时间：展开的内存操作越多，时间越长；
内存加载事务：展开的内存操作越多，加载事务越少；
内存存储事务：展开的内存操作为4和8时，存储事务将减少。

分析：

内存加载事务：关于没有展开的内存加载事务的计算已经展示过了。现在计算一下展开4个操作的核函数加载事务。对于元素个数为16384的情况，block数量为4，首先会将四个blockDim的数据汇总到当前block中，加载了四次，覆盖了所有数据，warp一次加载消耗一次内存事务，因此这部分共计消耗 $4\times32\times4=512$ 个内存加载事务；随后进行规约操作，当stride = 512时，每个block只有一半warp执行加载操作，每次加载2个元素，可以合并加载，因此4个block共需要128个内存事务；当stride=256时，进一步缩小，需要64个加载；当stride=128时，需要32个加载；当stride=64时，需要16个；当stride=32时，需要8个；之后就会引发warp发散了，当stride=16、8、4、2、1时，依然各自需要8个；最终转存block求和数据，4个block需要4个加载事务。因此共计：804个内存加载事务，与nvprof的统计结果一样。加载事务的减少主要是由于（1）block数量减少，整体的规约操作变少，因此求和时自加操作减少，重复加载减少；（2）block减少，最后转存时加载减少。
内存存储事务：同样以reduceInterleaved4为例，首先将数组集中到1/4区域时，需要消耗128个4段32字节存储事务；然后开始规约：当stride = 512时，每个block中只有一半的warp需要存储操作，每个thread只需要一次，一个warp需要1个4段32字节的存储事务，因此4个block总计需要64个4段32字节的存储事务；当stride=265时，需要32个4段32字节的存储事务；当stride=128时，需要16个4段32字节的存储事务；当stride = 64时，需要8个4段32字节的存储事务；当stride = 32是，需要4个4段32字节的存储事务；之后就会引发warp发散，但是内存存储事务除了4段之外，还有2段和1段两种选择，当stride= 16时，需要4个2段32字节的存储事务；当stride = 8时，需要4个1段32字节的存储事务；当stride = 4、2、1时，都需要4个1段32字节的存储事务；最后转存block求和，需要消耗4个1段32字节的存储事务。因此共计内存存储事务数量为：276.和nvprof统计结果一致。这里的变化规律也是一样的。
sm_efficiency：这个定义是采用的SM数量和总共的SM的比值。对于reduceInterleaved2来说，只有8个block，因此会用到8个SM，每个SM分配到一个block，共计32个warp，只有SM最大允许warp的一半，因此一旦warp被内存操作阻塞，就可能无法找到可用的warp，SM因此被闲置，因此可以看到，尽管2、4、8这几个核函数的sm_efficiency不同，但是eligible_warps_per_cycle都比较接近，并且都比较小。
但是上述解释依然不能解释为什么随着展开程度的增加，核函数运行时间反而降低这个现象。

增大测试数组的大小

由于block数量太少，导致测试的结果有点不太正常，这次我们将求和数组的大小设置为 $2^{30}=1073741824$ ，再进行测试：

==22483== Profiling result:
==22483== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (11)"
    Kernel: reduceInterleaved1(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions    76546048    76546048    76546048
          1                          gst_transactions                 Global Store Transactions    38797312    38797312    38797312
          1                            gld_efficiency             Global Memory Load Efficiency      98.04%      98.04%      98.04%
          1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
          1                             inst_per_warp                     Instructions per warp  345.062500  345.062500  345.062500
          1                 warp_execution_efficiency                 Warp Execution Efficiency      97.95%      97.95%      97.95%
          1                      dram_read_throughput             Device Memory Read Throughput  4.1380GB/s  4.1380GB/s  4.1380GB/s
          1                             sm_efficiency                   Multiprocessor Activity      99.49%      99.49%      99.49%
          1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    1.847439    1.847439    1.847439
    Kernel: reduceInterleaved2(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions    71827456    71827456    71827456
          1                          gst_transactions                 Global Store Transactions    36175872    36175872    36175872
          1                            gld_efficiency             Global Memory Load Efficiency      99.01%      99.01%      99.01%
          1                            gst_efficiency            Global Memory Store Efficiency      98.84%      98.84%      98.84%
          1                             inst_per_warp                     Instructions per warp  394.062500  394.062500  394.062500
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.20%      98.20%      98.20%
          1                      dram_read_throughput             Device Memory Read Throughput  7.7045GB/s  7.7045GB/s  7.7045GB/s
          1                             sm_efficiency                   Multiprocessor Activity      99.39%      99.39%      99.39%
          1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    1.971404    1.971404    1.971404
    Kernel: reduceInterleaved4(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions    52690944    52690944    52690944
          1                          gst_transactions                 Global Store Transactions    18087936    18087936    18087936
          1                            gld_efficiency             Global Memory Load Efficiency      99.34%      99.34%      99.34%
          1                            gst_efficiency            Global Memory Store Efficiency      98.84%      98.84%      98.84%
          1                             inst_per_warp                     Instructions per warp  456.031250  456.031250  456.031250
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.45%      98.45%      98.45%
          1                      dram_read_throughput             Device Memory Read Throughput  13.887GB/s  13.887GB/s  13.887GB/s
          1                             sm_efficiency                   Multiprocessor Activity      99.41%      99.41%      99.41%
          1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    2.019229    2.019229    2.019229
    Kernel: reduceInterleaved8(int*, int*, int)
          1                          gld_transactions                  Global Load Transactions    43122688    43122688    43122688
          1                          gst_transactions                 Global Store Transactions     9043968     9043968     9043968
          1                            gld_efficiency             Global Memory Load Efficiency      99.60%      99.60%      99.60%
          1                            gst_efficiency            Global Memory Store Efficiency      98.84%      98.84%      98.84%
          1                             inst_per_warp                     Instructions per warp  552.031250  552.031250  552.031250
          1                 warp_execution_efficiency                 Warp Execution Efficiency      98.72%      98.72%      98.72%
          1                      dram_read_throughput             Device Memory Read Throughput  24.007GB/s  24.007GB/s  24.007GB/s
          1                             sm_efficiency                   Multiprocessor Activity      99.44%      99.44%      99.44%
          1                  eligible_warps_per_cycle           Eligible Warps Per Active Cycle    2.132739    2.132739    2.132739

从这个例子我们就能发现，当block数量足够多的时候，展开的内存操作越多，设备读取吞吐量越大。这一块我觉得应该是由于warp数量足够多，能够隐藏内存操作的延迟而得到的结果。同时由于展开操作可以减少加载和存储内存事务。
书中所谓的增加独立操作指令，有点迷惑。

动态并行

核函数中可以组织发布网格，从而可以实现递归。下面展示一个简单的例子：

一个简单的例子

#include<stdlib.h>
#include<stdio.h>

__global__ void nestedHelloWorld(int current_depth, int max_depth)
{
    if (current_depth == max_depth) return;
    if (threadIdx.x == 0)
    {
        dim3 block(3);
        dim3 grid(2);
        nestedHelloWorld<<<grid, block>>>(current_depth+1, max_depth);
        printf("current_depth=%d blockIdx.x=%d threadIdx.x=%d\n", current_depth, blockIdx.x, threadIdx.x);
    }
}

int main(int argc, char **argv)
{
    dim3 block(3);
    dim3 grid(2);
    nestedHelloWorld<<<grid, block>>>(0, 3);
    cudaDeviceSynchronize();
    return 0;
}

编译的时候需要添加编译选项：-rdc=ture，它强制生成可重定位的设备代码，这是动态并行的一个要求

current_depth=0 blockIdx.x=0 threadIdx.x=0
current_depth=0 blockIdx.x=1 threadIdx.x=0
current_depth=1 blockIdx.x=1 threadIdx.x=0
current_depth=1 blockIdx.x=0 threadIdx.x=0
current_depth=1 blockIdx.x=1 threadIdx.x=0
current_depth=1 blockIdx.x=0 threadIdx.x=0
current_depth=2 blockIdx.x=1 threadIdx.x=0
current_depth=2 blockIdx.x=0 threadIdx.x=0
current_depth=2 blockIdx.x=1 threadIdx.x=0
current_depth=2 blockIdx.x=0 threadIdx.x=0
current_depth=2 blockIdx.x=1 threadIdx.x=0
current_depth=2 blockIdx.x=0 threadIdx.x=0
current_depth=2 blockIdx.x=0 threadIdx.x=0
current_depth=2 blockIdx.x=1 threadIdx.x=0