nvprof——metrics

最新推荐文章于 2024-05-07 16:58:00 发布

UCAS_HMM

最新推荐文章于 2024-05-07 16:58:00 发布

阅读量400

点赞数

分类专栏： CUDA NV-compute 文章标签： 1024程序员节

本文链接：https://blog.csdn.net/ucas_hmm/article/details/127488054

版权

CUDA 同时被 2 个专栏收录

10 篇文章 3 订阅

订阅专栏

NV-compute

3 篇文章 1 订阅

订阅专栏

Metrics

nvprof的metrics的很多，并且对于之后的nvvp和nsight compute之类的优化工具而言，nvprof的metrics也是它们的数据基础。
根据nvvp对metrics的归类，所有的metrics可以分为五类：

Memory
Instruction
Multiprocessor
Cache
Texture

下面的内容就针对这些metrics进行详细测试

Memory

global_load_requests, global_store_requests

__global__ void kernel1(int* arr)
{
    int a = arr[threadIdx.x];
    int b = arr[(threadIdx.x + 2) % 32];
    arr[threadIdx.x] = a + b;
}
......
kernel1<<<1, 128>>>(d_A);

Invocations                               Metric Name                                          Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                      global_load_requests    Total number of global load requests from Multiprocessor           8           8           8
          1                     global_store_requests   Total number of global store requests from Multiprocessor           4           4           4

由于SIMT的设计，同一个warp内的thread执行同一条指令，因此它们的内存访问记为一次。这里发布了4个warp，其中每个线程包含两次加载和一次写入，因此共计8次加载请求和4次写入请求。

dram_read_bytes, dram_write_bytes

__global__ void kernel1(int* arr)
{
    arr[threadIdx.x]++;
}
kernel1<<<1, 128>>>(d_A);

Invocations                               Metric Name                          Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                           dram_read_bytes      Total bytes read from DRAM to L2 cache        1888        1888        1888
          1                          dram_write_bytes   Total bytes written from L2 cache to DRAM        2560        2560        2560

dram_read_bytes表征的是从DRAM读到L2中的字节数，DRAM是物理概念通常指显存，在编程逻辑上，local memory、global memory、constant memory和texture memory都在这个显存上。从当前这个核函数来看，从全局内存中共计读取了512字节，可以看到，远远小于1888这个结果。

Instruction

warp_execution_efficiency

$\begin{equation} \text{warp execution efficiency} = \frac{\text{active threads}}{\text{warpSize}} \end{equation}$

    kernel1<<<1, 24>>>(d_A);
    kernel2<<<1, 16>>>(d_A);
    kernel3<<<1, 12>>>(d_A);

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                 warp_execution_efficiency                 Warp Execution Efficiency      75.00%      75.00%      75.00%
    Kernel: kernel2(int*)
          1                 warp_execution_efficiency                 Warp Execution Efficiency      50.00%      50.00%      50.00%
    Kernel: kernel3(int*)
          1                 warp_execution_efficiency                 Warp Execution Efficiency      37.50%      37.50%      37.50%

warpSize一般为32，当发布核函数的时候，线程束不满32时，会自动补齐，只不过补齐的部分会被设置为inactive，inactive thread占比越低，效率越高：

kernel1： $\frac{24}{32} = 75\%$ ；
kernel2： $\frac{16}{32} = 50\%$ ；
kernel3： $\frac{12}{32} = 37.5\%$ ；

一般来说，只要发布kernel的时候稍微注意warpSize这一点，一般来说这个metric就不会影响性能。

branch_efficiency

$\begin{equation} \text{branch efficiency} = \frac{\text{branches - Divergent branches}}{\text{branches}} \end{equation}$

__global__ void kernel1(int* arr)
{
    if(threadIdx.x % 2 == 0)
        arr[threadIdx.x] = 0;
    else
        arr[threadIdx.x] = 1;
}

[mmhe@k057 Test]$ nvcc -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test
==57249== NVPROF is profiling process 57249, command: ./test
==57249== Profiling application: ./test
==57249== Profiling result:
==57249== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%
[mmhe@k057 Test]$ nvcc -g -G -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test
==57328== NVPROF is profiling process 57328, command: ./test
==57328== Profiling application: ./test
==57328== Profiling result:
==57328== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                         branch_efficiency                         Branch Efficiency      83.33%      83.33%      83.33%

可以看到，nvcc会自动优化这种简单的分支发散，而添加编译指令-g -G之后，nvcc同样会保留部分优化，使得分支效率大于50%

Multiprocessor

sm_efficiency

NVIDIA TESLA V100 GPU 架构白皮书

achieved_occupancy

查看Compute Capabilities可知，Maximum number of resident threads per SM为2048，Maximum number of resident warps per SM为64。占有率定义为：
$\begin{equation} \text{occupancy} = \frac{\text{active warps}}{\text{maximum warps}} \end{equation}$