CUDA Program Analysis

Nvidia Visio profiler

usage

nvprof [options] [application] [application-arguments]

usage mode

  • summary mode
nvprof matrixMul
  • track gpu trace
nvprof --print-gpu-trace matrixMul
  • track API trace

    nvprof --print-api-trace matrixMul

    Note: API trace can be turned off, if not needed, by using –profile-api-trace none. This reduces some of the profiling overhead, especially when the kernels are short.

  • Event/metric Summary Mode

nvprof --events warps_launched,local_load --metrics ipc matrixMul
  • some userul events and metrics
  • events

    • gld_inst_32bit
    • gst_*
    • global_load
    • global_store
    • local_*
    • warp_launched
    • active_cycles
    • *_warps/ctas
    • tex0_cache_sector_queries
    • tex1_
  • metrics

    • ipc
    • gld_transactions_per_request
    • gst_*
    • gld_efficiency
    • sm_efficiency
    • l2_read_transactions
    • l2_tex_transactions
    • l2_utilization
  • example

nvprof --events warps_launched,local_load --metrics ipc matrixMul
  • Event/metric Trace Mode
nvprof --aggregate-mode off --events local_load --print-gpu-trace matrixMul

other important options

  • –dependency-analysis
  • Timeline
nvprof --export-profile timeline.prof <app> <app args>
  • Metrics And Events
    The second use case is to collect events or metrics for all kernels in an application for
    which you have already collected a timeline. Collecting events or metrics for all kernels will significantly change the overall
    performance characteristics of the application because all kernel executions will be serialized on the GPU.
    Even though overall application performance is changed, the event or metric values
    for individual kernels will be correct and so you can merge the collected event and metric values
    onto a previously collected timeline to get an accurate picture of the applications behavior.
  nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <app> <app args>
  • Analysis For Individual Kernel
nvprof --kernels <kernel specifier> --analysis-metrics -o analysis.prof <app> <app args>

metric reference

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

simple compile app

.cu -> .ptx -> .cubin ->exe

you can use “nvcc -keep” to preserve the middle compiled files
–ptxas-option=-v to see verbose compilation output

​ - number of registers used
- shared memory bytes
- local memory in bytes

cuobjdump

​ a disassemble tool

​ static inst

cuda timing functions

eg.

cudaEvent_t start,stop;

float elapsed ;

cudaEventCreate(&start);

cudaEventCreate(& stop);

cudaEventRecord(start,0);

fool_kernel<<<grid,block>>>();

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsed,start,stop);//返回ms,精确在0.5ms,不是很精确

printf("elapsed time %f (seconds) \n",elapsed/1000);

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值