【CUDA 】核函数性能分析工具-CSDN博客

本文链接：https://blog.csdn.net/weixin_44231807/article/details/148511876

CUDA C编程笔记

nvprof命令
nsys命令（linux）
ncu命令（linux）
nsight compute软件（windows）
nsight system软件（windows）

最近因为有事很久没有学习cuda代码了，关于性能分析的命令也已经忘的差不多了，今天用休息时间来把这方面内容补上，方便以后忘记查看。

nvprof命令

报错。
高于8.0的计算能力不能使用，

nvprof is not supported on devices with compute capability 8.0 and higher.                   
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.

nsys命令（linux）

nsys profile --stats=true ./5-3reduceIntege

结果：

~/cudaC/unit5$ nsys profile --stats=true ./5-3reduceIntege
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

Collecting data...
/home/zyn/cudaC/unit5/./5-3reduceIntege starting reduction atdevice 0: NVIDIA GeForce RTX 3090 
   with array size 16777216 grid 131072 block 128
cpu reduce         : 2139353471
reduceNeighboredGmem: 2139353471 <<<grid 131072 block 128>>>
Generating '/tmp/nsys-report-09fe.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/zyn/cudaC/unit5/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)     StdDev (ns)            Name         
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -------------  ----------------------
     63.8    1,537,834,150         92   16,715,588.6   10,119,573.0        1,146  262,165,057   33,433,226.8  poll                  
     25.9      623,869,881          2  311,934,940.5  311,934,940.5  123,751,122  500,118,759  266,132,108.3  pthread_cond_timedwait
     10.0      240,457,182        649      370,504.1       22,601.0        1,023   22,438,180    1,086,289.3  ioctl                 
      0.1        2,832,914         48       59,019.0        8,945.5        5,014      951,482      187,238.0  mmap64                
      0.1        1,342,580         18       74,587.8       96,389.5       13,299      153,732       47,889.8  sem_timedwait         
      0.0          818,329         18       45,462.7       10,248.0        1,957      202,878       72,523.7  mmap                  
      0.0          648,770          2      324,385.0      324,385.0      255,162      393,608       97,896.1  pthread_join          
      0.0          270,074         12       22,506.2        6,589.0        2,312      197,373       55,204.8  munmap                
      0.0          234,465          1      234,465.0      234,465.0      234,465      234,465            0.0  pthread_cond_wait     
      0.0          222,957          3       74,319.0       72,590.0       51,931       98,436       23,300.7  pthread_create        
      0.0          217,384         43        5,055.4        4,374.0        1,463       13,918        2,673.3  open64                
      0.0          161,246         34        4,742.5        2,862.0        1,125       27,307        5,351.4  fopen                 
      0.0           50,583         20        2,529.2        2,806.5        1,026        3,555          868.4  read                  
      0.0           42,498         20        2,124.9        1,703.0        1,105        5,659        1,155.0  write                 
      0.0           36,204          1       36,204.0       36,204.0       36,204       36,204            0.0  fgets                 
      0.0           32,465         20        1,623.3        1,234.5        1,005        4,845          945.2  fclose                
      0.0           22,702          2       11,351.0       11,351.0        5,941       16,761        7,650.9  fread                 
      0.0           21,733          6        3,622.2        4,046.5        1,314        6,126        1,777.9  open                  
      0.0           17,492          3        5,830.7        6,707.0        2,558        8,227        2,934.3  pipe2                 
      0.0           17,054          2        8,527.0        8,527.0        5,753       11,301        3,923.0  socket                
      0.0           15,133          3        5,044.3        3,915.0        3,648        7,570        2,191.4  pthread_cond_broadcast
      0.0            9,767          1        9,767.0        9,767.0        9,767        9,767            0.0  connect               
      0.0            6,405          1        6,405.0        6,405.0        6,405        6,405            0.0  pthread_kill          
      0.0            3,596          2        1,798.0        1,798.0        1,653        1,943          205.1  fwrite                
      0.0            3,048          1        3,048.0        3,048.0        3,048        3,048            0.0  bind                  
      0.0            2,277          1        2,277.0        2,277.0        2,277        2,277            0.0  pthread_cond_signal   
      0.0            1,610          1        1,610.0        1,610.0        1,610        1,610            0.0  fcntl                 

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                 Name               
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  ------------  ---------------------------------
     69.2       70,851,810          1  70,851,810.0  70,851,810.0  70,851,810  70,851,810           0.0  cudaDeviceReset                  
     24.1       24,712,531          2  12,356,265.5  12,356,265.5   3,096,192  21,616,339  13,095,721.5  cudaMemcpy                       
      3.3        3,399,421          1   3,399,421.0   3,399,421.0   3,399,421   3,399,421           0.0  cudaGetDeviceProperties_v2_v12000
      1.8        1,842,364          1   1,842,364.0   1,842,364.0   1,842,364   1,842,364           0.0  cudaLaunchKernel                 
      1.3        1,284,642          2     642,321.0     642,321.0     186,268   1,098,374     644,956.3  cudaFree                         
      0.3          291,849          2     145,924.5     145,924.5      68,123     223,726     110,027.9  cudaMalloc                       
      0.0            3,185          1       3,185.0       3,185.0       3,185       3,185           0.0  cuCtxSynchronize                 
      0.0            1,603          1       1,603.0       1,603.0       1,603       1,603           0.0  cuModuleGetLoadingMode           

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                   Name                 
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------------------------------
    100.0          231,714          1  231,714.0  231,714.0   231,714   231,714          0.0  reduceGmem(int *, int *, unsigned int)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Operation          
 --------  ---------------  -----  ------------  ------------  ----------  ----------  -----------  ----------------------------
     99.0       21,807,788      1  21,807,788.0  21,807,788.0  21,807,788  21,807,788          0.0  [CUDA memcpy Host-to-Device]
      1.0          217,121      1     217,121.0     217,121.0     217,121     217,121          0.0  [CUDA memcpy Device-to-Host]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
     67.109      1    67.109    67.109    67.109    67.109        0.000  [CUDA memcpy Host-to-Device]
      0.524      1     0.524     0.524     0.524     0.524        0.000  [CUDA memcpy Device-to-Host]

Generated:
    /home/zyn/cudaC/unit5/report2.nsys-rep
    /home/zyn/cudaC/unit5/report2.sqlite

发现：
①每使用一次nsys profile就会新生成两个文件report2.nsys-rep和report2.sqlite，并且序号还会自动往后延。
②核函数本身计算的时间并不多，更多在设备reset和数据传输上面。

ncu命令（linux）

命令使用方法：

基础分析（无指标） ncu ./5-3reduceInteger
指定指标分析（指标名称在前） ncu --metrics gld_efficiency,gst_efficiency ./5-3reduceInteger
分析特定核函数 ncu --kernel-name reduceGmem ./5-3reduceInteger
收集所有指标（不推荐，输出巨大） ncu --metrics all ./5-3reduceInteger

结果：

~/cudaC/unit5$ ncu ./5-3reduceInteger
==PROF== Connected to process 2380393 (/home/zyn/cudaC/unit5/5-3reduceInteger)
/home/zyn/cudaC/unit5/./5-3reduceInteger starting reduction atdevice 0: NVIDIA GeForce RTX 3090 
   with array size 4194304 grid 32768 block 128
cpu reduce         : 534907410
==PROF== Profiling "reduceGmem" - 0: 0%....50%....100% - 8 passes
reduceNeighboredGmem: 534907410 <<<grid 32768 block 128>>>
==PROF== Disconnected from process 2380393
[2380393] 5-3reduceInteger@127.0.0.1
  reduceGmem(int *, int *, unsigned int) (32768, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         9.49
    SM Frequency                    Ghz         1.39
    Elapsed Cycles                cycle      106,926
    Memory Throughput                 %        66.85
    DRAM Throughput                   %        36.09
    Duration                         us        76.80
    L1/TEX Cache Throughput           %        35.68
    L2 Cache Throughput               %        66.85
    SM Active Cycles              cycle   105,584.51
    Compute (SM) Throughput           %        22.44
    ----------------------- ----------- ------------

    OPT   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2 
          bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      
          transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        
          whether there are values you can (re)compute.                                                                 

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                 32,768
    Registers Per Thread             register/thread              18
    Shared Memory Configuration Size           Kbyte           16.38
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              82
    Stack Size                                                 1,024
    Threads                                   thread       4,194,304
    # TPCs                                                        41
    Enabled TPC IDs                                              all
    Uses Green Context                                             0
    Waves Per SM                                               33.30
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           21
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           12
    Theoretical Active Warps per SM        warp           48
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        58.66
    Achieved Active Warps Per SM           warp        28.16
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 41.34%                                                                                    
          The difference between calculated theoretical (100.0%) and measured achieved occupancy (58.7%) can be the     
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     
          optimizing occupancy.                                                                                         

    Section: GPU and Memory Workload Distribution
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    Average DRAM Active Cycles       cycle      262,944
    Total DRAM Elapsed Cycles        cycle    8,743,936
    Average L1 Active Cycles         cycle   105,584.51
    Total L1 Elapsed Cycles          cycle    8,760,638
    Average L2 Active Cycles         cycle   102,636.31
    Total L2 Elapsed Cycles          cycle    5,031,936
    Average SM Active Cycles         cycle   105,584.51
    Total SM Elapsed Cycles          cycle    8,760,638
    Average SMSP Active Cycles       cycle   103,461.77
    Total SMSP Elapsed Cycles        cycle   35,042,552
    -------------------------- ----------- ------------