docker上安装NVIDIA Nsight Systems，cuda使用nsys工具

最新推荐文章于 2025-02-14 09:43:25 发布

BugRecorder

最新推荐文章于 2025-02-14 09:43:25 发布

阅读量1.7k

点赞数

分类专栏：备忘录文章标签： docker debian

本文链接：https://blog.csdn.net/weixin_45973213/article/details/128386604

版权

备忘录专栏收录该内容

6 篇文章

订阅专栏

找不到nsys工具：

root@8274e2789343:/usr/local/cuda-12.0# nsys
bash: nsys: command not found

在docker上安装NVIDIA Nsight Systems
镜像是基于debian的
（不是debian的话看官方文档：官方文档）

   $ apt-get update -y
   $ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
       apt-transport-https \
       ca-certificates \
       gnupg \
       wget
   $ rm -rf /var/lib/apt/lists/*
   $ wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub | apt-key add -
   $ echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list
   $ apt-get update -y
   $ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
       nsight-systems-2020.2.1
   $ rm -rf /var/lib/apt/lists/*

运行倒数第二条命令一直断开，多试几次就好了
安装完成：

root@8274e2789343:/# nsys --version
NVIDIA Nsight Systems version 2020.2.1.71-64a8f98

使用一下
先创建一个简单的应用

#include <iostream>
#include <math.h>
#include <stdlib.h> 
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}
 
int main(void)
{
  int N = 1<<20;
  float *x, *y;
 
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
 
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
 
  // Prefetch the data to the GPU
  char *prefetch = getenv("__PREFETCH");
  if (prefetch == NULL || strcmp(prefetch, "off") != 0) {
    int device = -1;
    cudaGetDevice(&device);
    cudaMemPrefetchAsync(x, N*sizeof(float), device, NULL);
    cudaMemPrefetchAsync(y, N*sizeof(float), device, NULL);
  }
 
  // Run kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);
 
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
 
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
 
  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

先编译，再使用nsys:

$ nvcc -o add_cuda add.cu
 
$ __PREFETCH=off nsys profile -o noprefetch --stats=true ./add_cuda

输出：

Collecting data...
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.

The target application terminated with signal 11 (SIGSEGV)
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-3dc6-1a3f-c700-ca20.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-3dc6-1a3f-c700-ca20.qdrep"
Exporting 122 events: [===================================================100%]

Exported successfully to
/tmp/nsys-report-3dc6-1a3f-c700-ca20.sqlite

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)




CUDA trace data was not collected.


Generating Operating System Runtime API Statistics...
Operating System Runtime API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   95.9        19404700          33        588021.2          512400          935000  read

    2.2          453300          36         12591.7            6300           92200  open

    0.9          174300           4         43575.0            4700           71500  ioctl

    0.6          129500           1        129500.0          129500          129500  pthread_create

    0.3           61500          12          5125.0            1400           22200  fopen

    0.1           17000           5          3400.0            1400            5000  fclose





Generating NVTX Push-Pop Range Statistics...
NVTX Push-Pop Range Statistics (nanoseconds)

大功告成