docker上安装NVIDIA Nsight Systems,cuda使用nsys工具

找不到nsys工具:

root@8274e2789343:/usr/local/cuda-12.0# nsys
bash: nsys: command not found

在docker上安装NVIDIA Nsight Systems
镜像是基于debian
(不是debian的话看官方文档:官方文档

   $ apt-get update -y
   $ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
       apt-transport-https \
       ca-certificates \
       gnupg \
       wget
   $ rm -rf /var/lib/apt/lists/*
   $ wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub | apt-key add -
   $ echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list
   $ apt-get update -y
   $ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
       nsight-systems-2020.2.1
   $ rm -rf /var/lib/apt/lists/*

运行倒数第二条命令一直断开,多试几次就好了
安装完成:

root@8274e2789343:/# nsys --version
NVIDIA Nsight Systems version 2020.2.1.71-64a8f98

使用一下
先创建一个简单的应用

#include <iostream>
#include <math.h>
#include <stdlib.h> 
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}
 
int main(void)
{
  int N = 1<<20;
  float *x, *y;
 
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
 
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
 
  // Prefetch the data to the GPU
  char *prefetch = getenv("__PREFETCH");
  if (prefetch == NULL || strcmp(prefetch, "off") != 0) {
    int device = -1;
    cudaGetDevice(&device);
    cudaMemPrefetchAsync(x, N*sizeof(float), device, NULL);
    cudaMemPrefetchAsync(y, N*sizeof(float), device, NULL);
  }
 
  // Run kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);
 
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
 
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
 
  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

先编译,再使用nsys:

$ nvcc -o add_cuda add.cu
 
$ __PREFETCH=off nsys profile -o noprefetch --stats=true ./add_cuda

输出:

Collecting data...
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.

The target application terminated with signal 11 (SIGSEGV)
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-3dc6-1a3f-c700-ca20.qdstrm" file to disk...
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-3dc6-1a3f-c700-ca20.qdrep"
Exporting 122 events: [===================================================100%]

Exported successfully to
/tmp/nsys-report-3dc6-1a3f-c700-ca20.sqlite

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)




CUDA trace data was not collected.


Generating Operating System Runtime API Statistics...
Operating System Runtime API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   95.9        19404700          33        588021.2          512400          935000  read

    2.2          453300          36         12591.7            6300           92200  open

    0.9          174300           4         43575.0            4700           71500  ioctl

    0.6          129500           1        129500.0          129500          129500  pthread_create

    0.3           61500          12          5125.0            1400           22200  fopen

    0.1           17000           5          3400.0            1400            5000  fclose





Generating NVTX Push-Pop Range Statistics...
NVTX Push-Pop Range Statistics (nanoseconds)

大功告成

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值