Nvidia GPU profiling nsight system

Luchang-Li

已于 2024-08-19 12:56:15 修改

阅读量696

点赞数 10

文章标签： GPU profiling

于 2024-08-13 09:32:58 首次发布

本文链接：https://blog.csdn.net/u013701860/article/details/141154811

版权

profiling重要性

找到问题的原因和瓶颈，问题就解决了一大半。

如何分析：首先是计算图分析和profiling分析，本文主要是profiling分析。

计算图分析：分析计算图是否足够优化，如格式转换，算子融合，模型量化是否处理的比较好了。

算子profiling：分析算子性能和瓶颈，卷积和矩阵乘是否占据了绝大多数，找到性能存在瓶颈的算子进行针对性优化。

nsight system

User Guide — nsight-systems 2024.5 documentation

Profiling Deep Learning with Nsight Systems

nsight system代替了旧的nvprof工具，提供更强大的profiling能力。当然你仍然可以在nsight system里面继续使用nvprof功能（如nsys nvprof python resnet_test.py）。

使用命令参考：

nsys profile --trace=cuda,nvtx,osrt,cudnn,cublas --export sqlite --force-overwrite true -o analysis_test python resnet_test.py

会生成xx.nsys-rep文件然后可以用windows安装的nsight system打开可视化。

导出profiling的详细json文件：

nsys export --type json --force-overwrite=true -o profiling.json analysis_test.nsys-rep

或者用这个更方便：

https://github.com/chenyu-jiang/nsys2json

python nsys2json.py -f analysis_test.sqlite -o analysis_test_2_json.json

nvtx标记执行范围，如Pytorch

https://pytorch.org/docs/stable/cud/ta.html#nvidia-tools-extension-nvtx

torch.cuda.nvtx.mark
Describe an instantaneous event that occurred at some point.

torch.cuda.nvtx.range_push
Push a range onto a stack of nested range span.

torch.cuda.nvtx.range_pop
Pop a range off of a stack of nested range spans.

torch.cuda.nvtx.range
Context manager / decorator that pushes an NVTX range at the beginning of its scope, and pops it at the end.
使用范例：

    with torch.cuda.nvtx.range(f"resnet_inference_iter{i}"):
        logits = model(pixel_values).logits

使用效果如下，可以清晰分析标记范围的情况。如果不进行标记，profiling里面夹杂了模型加载，前后处理等信息，难以知道每个步骤起始位置。

如何只采集/提取每个nvtx标记范围内的算子profiling信息？

使用上面提到的https://github.com/chenyu-jiang/nsys2json工具转换为json后，里面有NVTXRegions可以很方便获取标记范围的算子的信息，例如：

    {
        "name": "implicit_convolve_sgemm",
        "ph": "X",
        "cat": "cuda",
        "ts": 3530122.805,
        "dur": 51.968,
        "tid": "Stream 7",
        "pid": "Device 0",
        "args": {
            "NVTXRegions": [
                "resnet_inference_iter1"
            ]
        }
    },

nvidia-smi查看GPU使用率

https://zhuanlan.zhihu.com/p/667658845

python3获取nvidia GPU信息程序_python 调用smi-CSDN博客

python定时抓取nvidia-smi example（nvidia-smi -l 1也有这个功能但是最小只能做到秒级）

import subprocess
import numpy as np
import time
from datetime import datetime


def getcmdoutput(cmd):
    output = subprocess.getoutput(cmd)
    output = output.split('\n')
    return output


gap = 0.3  # second
period = 2 * 60  # second
loop_num = int(period / gap)

cmd = "nvidia-smi"

for i in range(loop_num):
    output = getcmdoutput(cmd)
    cur_date = datetime.now()
    print("current date", cur_date)
    for out in output:
        print(out)
    time.sleep(gap)