Metrics
nvprof的metrics的很多,并且对于之后的nvvp
和nsight compute
之类的优化工具而言,nvprof
的metrics
也是它们的数据基础。
根据nvvp
对metrics
的归类,所有的metrics
可以分为五类:
Memory
Instruction
Multiprocessor
Cache
Texture
下面的内容就针对这些metrics
进行详细测试
Memory
global_load_requests, global_store_requests
__global__ void kernel1(int* arr)
{
int a = arr[threadIdx.x];
int b = arr[(threadIdx.x + 2) % 32];
arr[threadIdx.x] = a + b;
}
......
kernel1<<<1, 128>>>(d_A);
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-PCIE-32GB (0)"
Kernel: kernel1(int*)
1 global_load_requests Total number of global load requests from Multiprocessor 8 8 8
1 global_store_requests Total number of global store requests from Multiprocessor 4 4 4
由于SIMT的设计,同一个warp内的thread执行同一条指令,因此它们的内存访问记为一次。这里发布了4个warp,其中每个线程包含两次加载和一次写入,因此共计8次加载请求和4次写入请求。
dram_read_bytes, dram_write_bytes
__global__ void kernel1(int* arr)
{
arr[threadIdx.x]++;
}
kernel1<<<1, 128>>>(d_A);
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-PCIE-32GB (0)"
Kernel: kernel1(int*)
1 dram_read_bytes Total bytes read from DRAM to L2 cache 1888 1888 1888
1 dram_write_bytes Total bytes written from L2 cache to DRAM 2560 2560 2560
dram_read_bytes
表征的是从DRAM
读到L2
中的字节数,DRAM
是物理概念通常指显存,在编程逻辑上,local memory
、global memory
、constant memory
和texture memory
都在这个显存上。从当前这个核函数来看,从全局内存中共计读取了512字节,可以看到,远远小于1888这个结果。
Instruction
warp_execution_efficiency
warp execution efficiency = active threads warpSize \begin{equation} \text{warp execution efficiency} = \frac{\text{active threads}}{\text{warpSize}} \end{equation} warp execution efficiency=warpSizeactive threads
kernel1<<<1, 24>>>(d_A);
kernel2<<<1, 16>>>(d_A);
kernel3<<<1, 12>>>(d_A);
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-PCIE-32GB (0)"
Kernel: kernel1(int*)
1 warp_execution_efficiency Warp Execution Efficiency 75.00% 75.00% 75.00%
Kernel: kernel2(int*)
1 warp_execution_efficiency Warp Execution Efficiency 50.00% 50.00% 50.00%
Kernel: kernel3(int*)
1 warp_execution_efficiency Warp Execution Efficiency 37.50% 37.50% 37.50%
warpSize
一般为32,当发布核函数的时候,线程束不满32时,会自动补齐,只不过补齐的部分会被设置为inactive
,inactive thread
占比越低,效率越高:
- kernel1: 24 32 = 75 % \frac{24}{32} = 75\% 3224=75%;
- kernel2: 16 32 = 50 % \frac{16}{32} = 50\% 3216=50%;
- kernel3: 12 32 = 37.5 % \frac{12}{32} = 37.5\% 3212=37.5%;
一般来说,只要发布kernel的时候稍微注意warpSize这一点,一般来说这个metric就不会影响性能。
branch_efficiency
branch efficiency = branches - Divergent branches branches \begin{equation} \text{branch efficiency} = \frac{\text{branches - Divergent branches}}{\text{branches}} \end{equation} branch efficiency=branchesbranches - Divergent branches
__global__ void kernel1(int* arr)
{
if(threadIdx.x % 2 == 0)
arr[threadIdx.x] = 0;
else
arr[threadIdx.x] = 1;
}
[mmhe@k057 Test]$ nvcc -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test
==57249== NVPROF is profiling process 57249, command: ./test
==57249== Profiling application: ./test
==57249== Profiling result:
==57249== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-PCIE-32GB (0)"
Kernel: kernel1(int*)
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
[mmhe@k057 Test]$ nvcc -g -G -arch=sm_70 test.cu -o test
[mmhe@k057 Test]$ nvprof --metrics branch_efficiency ./test
==57328== NVPROF is profiling process 57328, command: ./test
==57328== Profiling application: ./test
==57328== Profiling result:
==57328== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-PCIE-32GB (0)"
Kernel: kernel1(int*)
1 branch_efficiency Branch Efficiency 83.33% 83.33% 83.33%
可以看到,nvcc
会自动优化这种简单的分支发散,而添加编译指令-g -G
之后,nvcc
同样会保留部分优化,使得分支效率大于50%
Multiprocessor
sm_efficiency
achieved_occupancy
查看Compute Capabilities可知,Maximum number of resident threads per SM
为2048,Maximum number of resident warps per SM
为64。占有率定义为:
occupancy
=
active warps
maximum warps
\begin{equation} \text{occupancy} = \frac{\text{active warps}}{\text{maximum warps}} \end{equation}
occupancy=maximum warpsactive warps
和理论占有率不同,理论占有率定义为发布的kernel的warp与SM最大允许的常驻warp数之比,而实际占有率则是在每个运行周期内,活动的warp与最大允许warp的比值。