cuda性能分析工具

欢迎关注微信公众号InfiniReach,这里有更多AI大模型的前沿算法与工程优化方法分享
请添加图片描述

https://github.com/cwpearson/nvidia-performance-tools

NVIDIA nvprof / nvvp

  • 由2008年起开始支持的性能分析器,交互性好,利于使用
  • 记录运行日志时使用命令nvprof
  • 可视化显示日志时使用命令nvvp,全称是NVIDIA Visual Profiler
  • nvprof/nvvp方式运行时消耗资源较多,数据统计容易不准确,推荐使用NSight
  1. 在终端运行nvvp
  2. 点击file -> new session,在file里选择可执行文件即可

在这里插入图片描述

https://blog.csdn.net/TracelessLe/article/details/110880135

NSight系列

包括NSight System和NSight Compute,其中Nsight Systems就是全新一代的nvprof,可以用于监测代码执行效率及分析性能。

Nsight Systems

本地使用

用命令nsight-sys打开Nsight Systems,设置命令与路径,点击右侧start
请添加图片描述

远程使用

远程时,使用nsys命令生成profile文件,再下载用Nsight Systems打开

nsys profile -o first_attempt.qdrep ./first_attempt

结果分析

  • 5部分内容:

    1. Analysis Summary (分析总结,内容非常全面,包含了Target的详细信息,Process summary, Module summary, Thread summary, Environment Variables, CPU info, GPU info等等)
    2. Timeline View (展示CPU/GPU各个核的工作时间线,一般用来来勘察模型训练或者推理的瓶颈在哪里)
    3. Diagnostics Summary (顾名思义,诊断总结。就是程序在运行中做了什么,有什么warning , error,或者message的,都在这里汇总)
    4. Symbol Resolution Logs(暂时不知道是干嘛的)
    5. Files (执行结果的log 文件:pid_stdout.log,& 执行出错的log 文件pid_stderr.log)
  • 在Timeline View ,主要关注CUDA HW(自己的kernel)、TensorRT 以及 CUDA API 这三部分,
    在这里插入图片描述
    光标指向kernel名称,出现如下记录:包括内存申请情况等信息

    gemmKernel
    Begins: 0.327224s
    Ends: 0.354951s (+27.727 ms)
    grid:  <<<32, 32, 1>>>
    block: <<<32, 32, 1>>>
    Launch Type: Regular
    Static Shared Memory: 0 bytes
    Dynamic Shared Memory: 0 bytes
    Registers Per Thread: 36
    Local Memory Per Thread: 0 bytes
    Local Memory Total: 26,542,080 bytes
    Shared Memory executed: 8,192 bytes
    Shared Memory Bank Size: 4 B
    Theoretical occupancy: 66.6667 %
    Launched from thread: 193828
    Latency: ←145.765 μs
    Correlation ID: 116
    Stream: Default stream 7
    

Nsight Compute

  • 一个用于CUDA应用程序的交互式内核分析器。它通过用户界面和命令行工具提供详细的性能指标和API调试。此外,它的基线特性允许用户在工具中比较结果。NVIDIA Nsight Compute提供了一个可定制的、数据驱动的用户界面和度量集合,并且可以通过分析脚本对后处理结果进行扩展。

本地使用

  1. 可以直接在Nsight Compute中设置可执行文件路径,launch即可。但是可能出现The user does not have permission to profile on the target device.报错,所以使用sudo指令:

    sudo /usr/local/cuda/bin/ncu-ui
    

    在这里插入图片描述

  2. 用->定位到我们的kernel,然后点击profile kernel即可
    在这里插入图片描述

远程使用

  1. 首先命令行执行
sudo /usr/local/cuda/bin/ncu -o profile --set full ./myapplication <arguments>
  1. 然后下载后,使用Nsight Compute open files即可

结果分析

page中的session是设备信息,detail是kernel的内容分析,source是源码中每行代码及汇编指令执行使用资源情况

在这里插入图片描述

https://blog.csdn.net/yan31415/article/details/109491749
https://blog.csdn.net/TracelessLe/article/details/116945768
https://www.paddlepaddle.org.cn/inference/master/guides/performance_tuning/performance_analysis_profiler.html

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
cuda检测工具 devicequery.zip(不含源代码,源代码在cuda sdk 8.0里) deviceQuery.exe Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 760" CUDA Driver Version / Runtime Version 9.2 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 2048 MBytes (2147483648 bytes) ( 6) Multiprocessors, (192) CUDA Cores/MP: 1152 CUDA Cores GPU Max Clock rate: 1137 MHz (1.14 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
deviceQuery.exe Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 650" CUDA Driver Version / Runtime Version 9.1 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 2048 MBytes (2147483648 bytes) ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 1072 MHz (1.07 GHz) Memory Clock rate: 2500 Mhz Memory Bus Width: 128-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 650 Result = PASS
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值