nvprof——metrics

Metrics

nvprof的metrics的很多,并且对于之后的nvvpnsight compute之类的优化工具而言,nvprofmetrics也是它们的数据基础。
根据nvvpmetrics的归类,所有的metrics可以分为五类:

  • Memory
  • Instruction
  • Multiprocessor
  • Cache
  • Texture

下面的内容就针对这些metrics进行详细测试

Memory

global_load_requests, global_store_requests

__global__ void kernel1(int* arr)
{
    int a = arr[threadIdx.x];
    int b = arr[(threadIdx.x + 2) % 32];
    arr[threadIdx.x] = a + b;
}
......
kernel1<<<1, 128>>>(d_A);
Invocations                               Metric Name                                          Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                      global_load_requests    Total number of global load requests from Multiprocessor           8           8           8
          1                     global_store_requests   Total number of global store requests from Multiprocessor           4           4           4

由于SIMT的设计,同一个warp内的thread执行同一条指令,因此它们的内存访问记为一次。这里发布了4个warp,其中每个线程包含两次加载和一次写入,因此共计8次加载请求和4次写入请求。

dram_read_bytes, dram_write_bytes

__global__ void kernel1(int* arr)
{
    arr[threadIdx.x]++;
}
kernel1<<<1, 128>>>(d_A);
Invocations                               Metric Name                          Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                           dram_read_bytes      Total bytes read from DRAM to L2 cache        1888        1888        1888
          1                          dram_write_bytes   Total bytes written from L2 cache to DRAM        2560        2560        2560

dram_read_bytes表征的是从DRAM读到L2中的字节数,DRAM是物理概念通常指显存,在编程逻辑上,local memoryglobal memoryconstant memorytexture memory都在这个显存上。从当前这个核函数来看,从全局内存中共计读取了512字节,可以看到,远远小于1888这个结果。

Instruction

warp_execution_efficiency

warp execution efficiency = active threads warpSize \begin{equation} \text{warp execution efficiency} = \frac{\text{active threads}}{\text{warpSize}} \end{equation} warp execution efficiency=warpSizeactive threads

    kernel1<<<1, 24>>>(d_A);
    kernel2<<<1, 16>>>(d_A);
    kernel3<<<1, 12>>>(d_A);
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                 warp_execution_efficiency                 Warp Execution Efficiency      75.00%      75.00%      75.00%
    Kernel: kernel2(int*)
          1                 warp_execution_efficiency                 Warp Execution Efficiency      50.00%      50.00%      50.00%
    Kernel: kernel3(int*)
          1                 warp_execution_efficiency                 Warp Execution Efficiency      37.50%      37.50%      37.50%

warpSize一般为32,当发布核函数的时候,线程束不满32时,会自动补齐,只不过补齐的部分会被设置为inactiveinactive thread占比越低,效率越高:

  • kernel1: 24 32 = 75 % \frac{24}{32} = 75\% 3224=75%
  • kernel2: 16 32 = 50 % \frac{16}{32} = 50\% 3216=50%
  • kernel3: 12 32 = 37.5 % \frac{12}{32} = 37.5\% 3212=37.5%

一般来说,只要发布kernel的时候稍微注意warpSize这一点,一般来说这个metric就不会影响性能。

branch_efficiency

branch efficiency = branches - Divergent branches branches \begin{equation} \text{branch efficiency} = \frac{\text{branches - Divergent branches}}{\text{branches}} \end{equation} branch efficiency=branchesbranches - Divergent branches

__global__ void kernel1(int* arr)
{
    if(threadIdx.x % 2 == 0)
        arr[threadIdx.x] = 0;
    else
        arr[threadIdx.x] = 1;
}
[mmhe@k057 Test]$ nvcc -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test
==57249== NVPROF is profiling process 57249, command: ./test
==57249== Profiling application: ./test
==57249== Profiling result:
==57249== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%
[mmhe@k057 Test]$ nvcc -g -G -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test
==57328== NVPROF is profiling process 57328, command: ./test
==57328== Profiling application: ./test
==57328== Profiling result:
==57328== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel1(int*)
          1                         branch_efficiency                         Branch Efficiency      83.33%      83.33%      83.33%

可以看到,nvcc会自动优化这种简单的分支发散,而添加编译指令-g -G之后,nvcc同样会保留部分优化,使得分支效率大于50%

Multiprocessor

sm_efficiency

NVIDIA TESLA V100 GPU 架构白皮书

achieved_occupancy

查看Compute Capabilities可知,Maximum number of resident threads per SM为2048,Maximum number of resident warps per SM为64。占有率定义为:
occupancy = active warps maximum warps \begin{equation} \text{occupancy} = \frac{\text{active warps}}{\text{maximum warps}} \end{equation} occupancy=maximum warpsactive warps

和理论占有率不同,理论占有率定义为发布的kernel的warp与SM最大允许的常驻warp数之比,而实际占有率则是在每个运行周期内,活动的warp与最大允许warp的比值。

Cache

Texture

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值