最近因为有事很久没有学习cuda代码了,关于性能分析的命令也已经忘的差不多了,今天用休息时间来把这方面内容补上,方便以后忘记查看。
nvprof命令
报错。
高于8.0的计算能力不能使用,
nvprof is not supported on devices with compute capability 8.0 and higher.
Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
nsys命令(linux)
nsys profile --stats=true ./5-3reduceIntege
结果:
~/cudaC/unit5$ nsys profile --stats=true ./5-3reduceIntege
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.
WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.
Collecting data...
/home/zyn/cudaC/unit5/./5-3reduceIntege starting reduction atdevice 0: NVIDIA GeForce RTX 3090
with array size 16777216 grid 131072 block 128
cpu reduce : 2139353471
reduceNeighboredGmem: 2139353471 <<<grid 131072 block 128>>>
Generating '/tmp/nsys-report-09fe.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/zyn/cudaC/unit5/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------- ----------------------
63.8 1,537,834,150 92 16,715,588.6 10,119,573.0 1,146 262,165,057 33,433,226.8 poll
25.9 623,869,881 2 311,934,940.5 311,934,940.5 123,751,122 500,118,759 266,132,108.3 pthread_cond_timedwait
10.0 240,457,182 649 370,504.1 22,601.0 1,023 22,438,180 1,086,289.3 ioctl
0.1 2,832,914 48 59,019.0 8,945.5 5,014 951,482 187,238.0 mmap64
0.1 1,342,580 18 74,587.8 96,389.5 13,299 153,732 47,889.8 sem_timedwait
0.0 818,329 18 45,462.7 10,248.0 1,957 202,878 72,523.7 mmap
0.0 648,770 2 324,385.0 324,385.0 255,162 393,608 97,896.1 pthread_join
0.0 270,074 12 22,506.2 6,589.0 2,312 197,373 55,204.8 munmap
0.0 234,465 1 234,465.0 234,465.0 234,465 234,465 0.0 pthread_cond_wait
0.0 222,957 3 74,319.0 72,590.0 51,931 98,436 23,300.7 pthread_create
0.0 217,384 43 5,055.4 4,374.0 1,463 13,918 2,673.3 open64
0.0 161,246 34 4,742.5 2,862.0 1,125 27,307 5,351.4 fopen
0.0 50,583 20 2,529.2 2,806.5 1,026 3,555 868.4 read
0.0 42,498 20 2,124.9 1,703.0 1,105 5,659 1,155.0 write
0.0 36,204 1 36,204.0 36,204.0 36,204 36,204 0.0 fgets
0.0 32,465 20 1,623.3 1,234.5 1,005 4,845 945.2 fclose
0.0 22,702 2 11,351.0 11,351.0 5,941 16,761 7,650.9 fread
0.0 21,733 6 3,622.2 4,046.5 1,314 6,126 1,777.9 open
0.0 17,492 3 5,830.7 6,707.0 2,558 8,227 2,934.3 pipe2
0.0 17,054 2 8,527.0 8,527.0 5,753 11,301 3,923.0 socket
0.0 15,133 3 5,044.3 3,915.0 3,648 7,570 2,191.4 pthread_cond_broadcast
0.0 9,767 1 9,767.0 9,767.0 9,767 9,767 0.0 connect
0.0 6,405 1 6,405.0 6,405.0 6,405 6,405 0.0 pthread_kill
0.0 3,596 2 1,798.0 1,798.0 1,653 1,943 205.1 fwrite
0.0 3,048 1 3,048.0 3,048.0 3,048 3,048 0.0 bind
0.0 2,277 1 2,277.0 2,277.0 2,277 2,277 0.0 pthread_cond_signal
0.0 1,610 1 1,610.0 1,610.0 1,610 1,610 0.0 fcntl
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ---------- ------------ ---------------------------------
69.2 70,851,810 1 70,851,810.0 70,851,810.0 70,851,810 70,851,810 0.0 cudaDeviceReset
24.1 24,712,531 2 12,356,265.5 12,356,265.5 3,096,192 21,616,339 13,095,721.5 cudaMemcpy
3.3 3,399,421 1 3,399,421.0 3,399,421.0 3,399,421 3,399,421 0.0 cudaGetDeviceProperties_v2_v12000
1.8 1,842,364 1 1,842,364.0 1,842,364.0 1,842,364 1,842,364 0.0 cudaLaunchKernel
1.3 1,284,642 2 642,321.0 642,321.0 186,268 1,098,374 644,956.3 cudaFree
0.3 291,849 2 145,924.5 145,924.5 68,123 223,726 110,027.9 cudaMalloc
0.0 3,185 1 3,185.0 3,185.0 3,185 3,185 0.0 cuCtxSynchronize
0.0 1,603 1 1,603.0 1,603.0 1,603 1,603 0.0 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- --------------------------------------
100.0 231,714 1 231,714.0 231,714.0 231,714 231,714 0.0 reduceGmem(int *, int *, unsigned int)
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ------------ ------------ ---------- ---------- ----------- ----------------------------
99.0 21,807,788 1 21,807,788.0 21,807,788.0 21,807,788 21,807,788 0.0 [CUDA memcpy Host-to-Device]
1.0 217,121 1 217,121.0 217,121.0 217,121 217,121 0.0 [CUDA memcpy Device-to-Host]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ----------------------------
67.109 1 67.109 67.109 67.109 67.109 0.000 [CUDA memcpy Host-to-Device]
0.524 1 0.524 0.524 0.524 0.524 0.000 [CUDA memcpy Device-to-Host]
Generated:
/home/zyn/cudaC/unit5/report2.nsys-rep
/home/zyn/cudaC/unit5/report2.sqlite
发现:
①每使用一次nsys profile就会新生成两个文件report2.nsys-rep和report2.sqlite,并且序号还会自动往后延。
②核函数本身计算的时间并不多,更多在设备reset和数据传输上面。
ncu命令(linux)
命令使用方法:
- 基础分析(无指标) ncu ./5-3reduceInteger
- 指定指标分析(指标名称在前) ncu --metrics gld_efficiency,gst_efficiency ./5-3reduceInteger
- 分析特定核函数 ncu --kernel-name reduceGmem ./5-3reduceInteger
- 收集所有指标(不推荐,输出巨大) ncu --metrics all ./5-3reduceInteger
结果:
~/cudaC/unit5$ ncu ./5-3reduceInteger
==PROF== Connected to process 2380393 (/home/zyn/cudaC/unit5/5-3reduceInteger)
/home/zyn/cudaC/unit5/./5-3reduceInteger starting reduction atdevice 0: NVIDIA GeForce RTX 3090
with array size 4194304 grid 32768 block 128
cpu reduce : 534907410
==PROF== Profiling "reduceGmem" - 0: 0%....50%....100% - 8 passes
reduceNeighboredGmem: 534907410 <<<grid 32768 block 128>>>
==PROF== Disconnected from process 2380393
[2380393] 5-3reduceInteger@127.0.0.1
reduceGmem(int *, int *, unsigned int) (32768, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 9.49
SM Frequency Ghz 1.39
Elapsed Cycles cycle 106,926
Memory Throughput % 66.85
DRAM Throughput % 36.09
Duration us 76.80
L1/TEX Cache Throughput % 35.68
L2 Cache Throughput % 66.85
SM Active Cycles cycle 105,584.51
Compute (SM) Throughput % 22.44
----------------------- ----------- ------------
OPT Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L2
bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes
transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or
whether there are values you can (re)compute.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 128
Function Cache Configuration CachePreferNone
Grid Size 32,768
Registers Per Thread register/thread 18
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 82
Stack Size 1,024
Threads thread 4,194,304
# TPCs 41
Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 33.30
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 21
Block Limit Shared Mem block 16
Block Limit Warps block 12
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 58.66
Achieved Active Warps Per SM warp 28.16
------------------------------- ----------- ------------
OPT Est. Local Speedup: 41.34%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (58.7%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles cycle 262,944
Total DRAM Elapsed Cycles cycle 8,743,936
Average L1 Active Cycles cycle 105,584.51
Total L1 Elapsed Cycles cycle 8,760,638
Average L2 Active Cycles cycle 102,636.31
Total L2 Elapsed Cycles cycle 5,031,936
Average SM Active Cycles cycle 105,584.51
Total SM Elapsed Cycles cycle 8,760,638
Average SMSP Active Cycles cycle 103,461.77
Total SMSP Elapsed Cycles cycle 35,042,552
-------------------------- ----------- ------------
nsight compute软件(windows)
远程ssh链接后选择即可
nsight system软件(windows)
目前未使用过,待补充