nsight 使用(1) nsys profile --cuda-graph-trace

伴生_0904

已于 2024-08-08 02:17:52 修改

阅读量437

点赞数 3

分类专栏： nsight 使用文章标签：性能优化

于 2024-08-08 02:17:05 首次发布

本文链接：https://blog.csdn.net/weixin_45705773/article/details/141004775

版权

nsight 使用专栏收录该内容

19 篇文章 0 订阅

订阅专栏

--cuda-graph-trace=<granularity>[:<launch origin>] Set the granularity and launch origin for CUDA graph trace. Applicable only when CUDA tracing is enabled. Possible values for <granularity> are 'graph' or 'node'. If 'graph' is selected, CUDA graphs will be traced as a whole and node activities will not be collected. This can reduce overhead to the minimal, but requires CUDA driver version 11.7 or higher. If 'node' is selected, node activities will be collected, but CUDA graphs will not be traced as a whole. This may cause significant runtime overhead. If CUDA driver version is 11.7 or higher, default is 'graph', otherwise default is 'node'. Possible values for <launch origin> are 'host-only' or 'host-and-device'. If 'host-only' is selected, only CUDA graphs launched from host codes will be traced. If 'host-and-device' is selected, CUDA graphs launched from host codes and device codes will both be traced. This is only supported when the granularity is set to 'graph' and the CUDA driver is version 12.3 or higher. This may cause significant runtime overhead. If granularity is set to 'graph' and the CUDA driver version is 12.3 or higher, the default is 'host-and-device', otherwise default is 'host-only'.'. Application scope.

--cuda-graph-trace=<granularity>[:<launch origin>] 参数在 nsys（NVIDIA System Profiler）中用于设置 CUDA 图的追踪粒度和启动来源。CUDA 图是 CUDA 运行时用于描述一系列 CUDA 内核和操作的高效化方法，可以帮助提高性能分析的效率。下面是对该参数的详细解释，包括用法、适用场景及其选择建议。

参数详解

使用方法:
```
--cuda-graph-trace=<granularity>[:<launch origin>]
```
- <granularity>: 指定追踪的粒度，可选择以下值：
  - graph: 将整个 CUDA 图作为一体进行追踪，不收集单独节点的活动。这可以将开销降到最低，但需要 CUDA 驱动版本 11.7 或更高。
  - node: 收集单独节点的活动，但不追踪整个 CUDA 图。这可能会导致显著的运行时开销。
- <launch origin>: 指定启动来源，仅在粒度为 graph 时应用。可能的值包括：
  - host-only: 仅追踪从主机代码启动的 CUDA 图。
  - host-and-device: 追踪从主机代码和设备代码启动的 CUDA 图。这只在粒度为 graph 且 CUDA 驱动版本为 12.3 或更高时才支持，可能会导致显著的运行时开销。

设定的默认值

在 CUDA 驱动版本为 11.7 或更高时，如果未指定，默认粒度为 graph。
在旧版本中，默认粒度为 node。
当粒度为 graph 且 CUDA 驱动版本为 12.3 或更高时，默认启动来源为 host-and-device。
在其他情况下，默认启动来源为 host-only。

使用示例

假设您希望以图（graph）粒度追踪 CUDA 图，并且只关注从主机启动的 CUDA 图，您可以这样命令：

nsys profile --cuda-graph-trace=graph:host-only ./my_cuda_application

如果您希望追踪端到端的 CUDA 图活动，包括从设备代码启动的图（假设 CUDA 驱动版本足够高），可以这样使用：

nsys profile --cuda-graph-trace=graph:host-and-device ./my_cuda_application

适用场景

性能分析与优化:
- 当您需要分析 CUDA 图的性能并看到多个内核执行顺序时，选择 graph 粒度可以减少性能分析的开销，并突出显示 CUDA 图的整体行为。
单个节点行为分析:
- 如果需要更深入地分析单个节点的性能，比如分析每个内核的执行时间和开销，设置为 node 是合适的选择。尽管这样会导致较大的运行时开销，但会提供更多的细节。
支持从设备启动的图:
- 当应用程序使用设备代码启动 CUDA 图并对性能调优感兴趣时，选择 host-and-device 启用全面的图追踪，这样可以捕获更多的性能数据，尽可能详细地理解执行过程。

选择建议

选择 graph 粒度:
- 如果您的应用程序中有多个 CUDA 图，而您的主要目标是了解其整体性能，这样可以最大限度地降低性能分析开销。同时，确保 CUDA 驱动版本为 11.7 及以上以获得最佳性能。
选择 node 粒度:
- 如果您在分析时关注特定的 CUDA 核心或操作，并需要详细的性能数据和火焰图，这种详细信息能够帮助您找到性能瓶颈。而不介意因此带来的显著运行时开销。
注意驱动版本:
- 一定要检查您的 CUDA 驱动版本是否符合选择的粒度和启动来源的要求，确保您的选择能够生效。