测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系

测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系

测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系

一.参考链接

二.注意事项

lts__t_sectors_srcunit_tex_op_read.sum: total number of 32-byte sectors requested by src unit tex
(this is all l1tex including local, global, surface, and texture) and operation type is read from any aperture (device, sysmem, peer).

三.复现步骤

tee srcnode_srcunit_test.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>

#define CHECK_CUDA(call)                                           \
    do {                                                           \
        cudaError_t err = call;                                    \
        if (err != cudaSuccess) {                                  \
            std::cerr << "CUDA error at " << __FILE__ << ":" << __LINE__; \
            std::cerr << " code=" << err << " (" << cudaGetErrorString(err) << ")" << std::endl; \
            exit(EXIT_FAILURE);                                    \
        }                                                          \
    } while (0)

__global__ void kernel(float *input,float *output)
{
    unsigned int tid  = threadIdx.x + blockIdx.x * blockDim.x;
    output[tid]=input[tid];
}

int main(int argc,char *argv[])
{
    int deviceid=0;
    cudaSetDevice(deviceid);  
    int block_count=1;
    int block_size=32*4; //4个warp请求
    int thread_size=block_count*block_size;
    float *input;CHECK_CUDA(cudaMalloc(&input, thread_size*4));
    float *output;CHECK_CUDA(cudaMalloc(&output, thread_size*4));
    kernel<<<block_count, block_size>>>(input,output);
    CHECK_CUDA(cudaDeviceSynchronize());
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -arch=sm_86 -lineinfo  -o srcnode_srcunit_test srcnode_srcunit_test.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -lcuda

/usr/local/NVIDIA-Nsight-Compute/ncu --query-metrics | grep "aperture_device " | awk '{ALL=$1".sum,"ALL}END{print ALL}' > metrics.cfg
/usr/local/NVIDIA-Nsight-Compute/ncu --metrics `cat metrics.cfg` ./srcnode_srcunit_test

/usr/local/NVIDIA-Nsight-Compute/ncu --query-metrics | grep "lts.*op_read " | grep -v "aperture" | awk '{print $1}' | egrep "op_read$" |awk '{ALL=$1".sum,"ALL}END{print ALL}' > metrics.cfg
/usr/local/NVIDIA-Nsight-Compute/ncu --metrics `cat metrics.cfg` ./srcnode_srcunit_test

  • Metric解释
lts__t_requests_aperture_device              Counter request # of LTS requests accessing device memory (vidmem)
lts__t_requests_srcnode_gpc_aperture_device  Counter request # of LTS requests from node GPC accessing device memory (vidmem)
lts__t_requests_srcunit_l1_aperture_device   Counter request # of LTS requests from unit L1 accessing device memory (vidmem)
lts__t_requests_srcunit_tex_aperture_device  Counter request # of LTS requests from unit TEX accessing device memory (vidmem)
															 
lts__t_sectors_aperture_device               Counter sector  # of LTS sectors accessing device memory (vidmem)
lts__t_sectors_srcnode_gpc_aperture_device   Counter sector  # of LTS sectors from node GPC accessing device memory (vidmem)
lts__t_sectors_srcunit_l1_aperture_device    Counter sector  # of LTS sectors from unit L1 accessing device memory (vidmem)
lts__t_sectors_srcunit_tex_aperture_device   Counter sector  # of LTS sectors from unit TEX accessing device memory (vidmem)
															 
lts__t_requests_op_read                      Counter request # of LTS requests for reads
lts__t_requests_srcnode_gpc_op_read          Counter request # of LTS requests from node GPC for reads
lts__t_requests_srcunit_l1_op_read           Counter request # of LTS requests from unit L1 for reads
lts__t_requests_srcunit_tex_op_read          Counter request # of LTS requests from unit TEX for reads
															 
lts__t_sectors_op_read                       Counter sector  # of LTS sectors for reads
lts__t_sectors_srcnode_gpc_op_read           Counter sector  # of LTS sectors from node GPC for reads
lts__t_sectors_srcunit_l1_op_read            Counter sector  # of LTS sectors from unit L1 for reads
lts__t_sectors_srcunit_tex_op_read           Counter sector  # of LTS sectors from unit TEX for reads
  • 输出
  kernel(float *, float *) (1, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6
    Section: Command line profiler metrics
    ----------------------------------------------- ----------- ------------
    Metric Name                                     Metric Unit Metric Value
    ----------------------------------------------- ----------- ------------
    lts__t_requests_aperture_device.sum                 request           84
    lts__t_requests_srcnode_gpc_aperture_device.sum     request           74
    lts__t_requests_srcunit_l1_aperture_device.sum      request            0
    lts__t_requests_srcunit_tex_aperture_device.sum     request            8 #指令发了4个请求
    lts__t_sectors_aperture_device.sum                   sector          319
    lts__t_sectors_srcnode_gpc_aperture_device.sum       sector          290
    lts__t_sectors_srcunit_l1_aperture_device.sum        sector            0
    lts__t_sectors_srcunit_tex_aperture_device.sum       sector           32 #指令请求了16个sector,但L2往dram请求了32个sector
    ----------------------------------------------- ----------- ------------
	
  kernel(float *, float *) (1, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6
    Section: Command line profiler metrics
    --------------------------------------- ----------- ------------
    Metric Name                             Metric Unit Metric Value
    --------------------------------------- ----------- ------------
    lts__t_requests_op_read.sum                 request           83
    lts__t_requests_srcnode_gpc_op_read.sum     request           68  
    lts__t_requests_srcunit_l1_op_read.sum      request            0
    lts__t_requests_srcunit_tex_op_read.sum     request            4  #4个warp请求,合并访问后
    lts__t_sectors_op_read.sum                   sector          299  #lts一共收到的请求数
    lts__t_sectors_srcnode_gpc_op_read.sum       sector          272  #来自gpc的请求数=68*4 说明来自gpc的请求是按4个sector合并访问的
    lts__t_sectors_srcunit_l1_op_read.sum        sector            0  
    lts__t_sectors_srcunit_tex_op_read.sum       sector           16  #一个warp 128字节,4个sector 4*4=16个sector,指令需要16个sector
    --------------------------------------- ----------- ------------

从统计数据上,lts__t_requests_op_read 是L2 tag阶段接收到的总请求数,它来自GPC或其它,GPC的请求又包括L1、LEX和其它

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Hi20240217

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值