Roofline-on-NVIDIA-GPUs代码分析

图波列夫

已于 2023-08-12 09:42:43 修改

阅读量2.5k

点赞数 3

分类专栏： NVIDIA Roofline 文章标签： Roofline 深度学习 gpu 性能分析

于 2021-12-04 10:28:22 首次发布

本文链接：https://blog.csdn.net/yiran103/article/details/121711485

版权

NVIDIA 同时被 2 个专栏收录

12 篇文章

订阅专栏

Roofline

5 篇文章

订阅专栏

Roofline 代码现状：

CS Roofline Toolkit 为 Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis 的实现，uo-cdux/ert-mirror 为 github 上的一个镜像；
cyanguwa/nersc-roofline 为 Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs 对应的代码，包含 GPP 和 C 语言的 ERT kernel；
NERSC/roofline-on-nvidia-gpus 为 8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks 所对应的代码，数据收集方法有改进，但只有 GPP；
NERSC/timemory 是 Timemory: Modular Performance Analysis for HPC 所对应的代码，更为系统和规范。

下面对 NERSC/roofline-on-nvidia-gpus 进行介绍。

NERSC/roofline-on-nvidia-gpus 展示了在 NVIDIA GPU 尤其是在 V100上使用 Roofline 分析方法。仓库的结构如下所示。

/example-codes 包含一些玩具内核kernel_abc.cu和一个真正的 HPC 迷你应用程序 GPP，提取自材料科学代码 BerkeleyGW。
/ncu-section-files 包含 CUDA 11 中 Nsight Compute 附带的默认 Speed of Light 节文件，以及几个用于分层 Roofline 分析的自定义节文件，用于双精度、单精度、半精度和张量核心操作。这些节文件旨在使用 Nsight Compute（ncu）自动收集屋顶线数据并进行可视化。
run.ncu演示了如何在 CUDA 11 中运行 Nsight Compute，而run.gpp.ncu是一个 Slurm 作业脚本，用于在 Cori GPU 上运行五个版本的 GPP 示例。
/custom-scripts 提供了一套作业启动、后处理和可视化脚本，可用于手动的 Roofline 数据采集和可视化。这样做的目的是使用户更容易将 Roofline 分析集成到自己的工作流中。

Customized ncu-based Roofline Workflow

为了与用户的其他工作流集成，/custom-scripts 提供了一套用于手动度量收集和 Roofline 可视化的脚本。

run.gpp.customized
postprocess.py and roofline.py

run.gpp.customized自定义脚本以 GPP 为例展示了 Roofline 分析所需的 Nsight Compute 指标列表。这些指标使用 Nsight Compute ncu (或nv-nsight-cu-cli）命令行实用程序收集，并写入/custom-scripts中的.csv文件。

然后，postprocess.py使用 Pandas 对结果进行后处理，以计算每个被分析内核的算术强度（ Arithmetic Intensity，AI）和 FLOP/s 吞吐量。
处理完成后，postprocess.py将调用基于 Matplotlib 的roofline.py绘制 Roofline 图表，然后将图表保存到.png文件中。

这些脚本中使用的数据收集方法详述如下。它是 CUDA 11 中 Nsight Compute 的新功能。

Time:
- sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
FLOPs:
- DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_dfma_pred_on.sum + sm__sass_thread_inst_executed_op_dmul_pred_on.sum
- SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_ffma_pred_on.sum + sm__sass_thread_inst_executed_op_fmul_pred_on.sum
- HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_hfma_pred_on.sum + sm__sass_thread_inst_executed_op_hmul_pred_on.sum
- Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
Bytes:
- DRAM: dram__bytes.sum
- L2: lts__t_bytes.sum
- L1: l1tex__t_bytes.sum

run.gpp.customized

Environment module 工具，常用于高性能计算集群的环境配置管理上。它可以将软件编译器、MPI 库、数学库、应用软件（计算类软件、分析类软件等）等，以模块的方式，统一到一个框架下，使得用户可以动态切换环境变量。

module load cuda/11.0.2
module load pgi/19.10

设置 Nsight Compute CLI 需要收集的指标。

# Time
metrics="sm__cycles_elapsed.avg,\
sm__cycles_elapsed.avg.per_second,"

# DP
metrics+="sm__sass_thread_inst_executed_op_dadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_dfma_pred_on.sum,\
sm__sass_thread_inst_executed_op_dmul_pred_on.sum,"

# SP
metrics+="sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum,\
sm__sass_thread_inst_executed_op_fmul_pred_on.sum,"

# HP
metrics+="sm__sass_thread_inst_executed_op_hadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_hfma_pred_on.sum,\
sm__sass_thread_inst_executed_op_hmul_pred_on.sum,"

# Tensor Core
metrics+="sm__inst_executed_pipe_tensor.sum,"

# DRAM, L2 and L1
metrics+="dram__bytes.sum,\
lts__t_bytes.sum,\
l1tex__t_bytes.sum"

Slurm 是一个开源、容错、高度可扩展的集群管理和作业调度系统，适用于大型和小型 Linux 集群。
srun 用于提交作业以便实时执行或启动作业步骤。
切换到 GPP 目录，编译并运行。
指定-k参数可以根据内核名称的正则表达式匹配来过滤内核。

cd ../example-codes/GPP/

input=gpp214unformatted.dat
dir=../../custom-scripts/
 
# Baseline
output=output.csv
profilestr="ncu -k sigma_gpp_gpu --metrics $metrics --csv"
echo Baseline version
git checkout gpp.f90
make clean
make
srun -n1 $profilestr ./gpp.x $input  > $dir/$output 2>&1

切换到优化的4种实现并执行。

# Four optimization steps 
for n in `seq 1 4`
do
	output=output$n.csv
	profilestr="ncu -k sigma_gpp_gpu --metrics $metrics --csv"
	echo Patch version: $n
	git checkout gpp.f90
	patch gpp.f90 step$n.patch
	make clean
	make
	srun -n1 $profilestr ./gpp.x $input   > $dir/$output 2>&1
done

调用 postprocess.py 生成 Roofline 图。

module load python/3.7-anaconda-2019.10
cd $dir
srun -n1 python postprocess.py

postprocess.py

files为当前路径下"output"开头的csv文件列表

datadir='.'
files=[x for x in os.listdir(datadir) if x.endswith('.csv') and x.startswith('output')]
files.sort()
files=[os.path.join(datadir,file) for file in files]

变量名用file不太可取。
获取文件行数。
pandas.read_csv 读取时跳过最后一行。
pandas.DataFrame.groupby 使用映射器或一系列列对 DataFrame 进行分组，返回一个pandas.core.groupby.DataFrameGroupBy对象。
pandas.pivot_table 创建一个电子表格样式的数据透视表作为 DataFrame。
按’Kernel Name’和’Metric Name’两列分组求和。
pandas.DataFrame.shape 返回一个表示 DataFrame 维度的元组。
计算的结果放入了dfs[tag]中。

dfs={}
for file in files:
    tag, ext = os.path.splitext(os.path.basename(file))
    dfs[tag]=pd.DataFrame()
    with open(file,'r') as f:
        cnt=0
        while True:
            ln=f.readline()
            if not ln:
                break
            cnt+=1
            if 'Host Name' in ln:
                break
        df = pd.read_csv(file, skiprows=cnt-1)
        dft=df.groupby(['Kernel Name','Metric Name']).sum()
        dfmetric=pd.pivot_table(dft, index='Kernel Name', columns='Metric Name', values='Metric Value')
        dfmetric['Count']=df.groupby(['Kernel Name']).count()['ID'].div(dfmetric.shape[1])

$\mathrm{time} = \frac{\mathrm{cycles}}{\mathrm{rate}}$

        dfmetric['Time']=dfmetric['sm__cycles_elapsed.avg'] \
                        / (dfmetric['sm__cycles_elapsed.avg.per_second'] /dfmetric['Count'] )

$\mathrm{add} + 2\times \mathrm{fma} + \mathrm{mul}$

        dfmetric['CC FLOPs']= 2 * dfmetric['sm__sass_thread_inst_executed_op_dfma_pred_on.sum'] \
                            + dfmetric['sm__sass_thread_inst_executed_op_dmul_pred_on.sum'] \
                            + dfmetric['sm__sass_thread_inst_executed_op_dadd_pred_on.sum'] \
                            + 2 * dfmetric['sm__sass_thread_inst_executed_op_ffma_pred_on.sum'] \
                            + dfmetric['sm__sass_thread_inst_executed_op_fmul_pred_on.sum'] \
                            + dfmetric['sm__sass_thread_inst_executed_op_fadd_pred_on.sum'] \
                            + 2 * dfmetric['sm__sass_thread_inst_executed_op_hfma_pred_on.sum'] \
                            + dfmetric['sm__sass_thread_inst_executed_op_hmul_pred_on.sum'] \
                            + dfmetric['sm__sass_thread_inst_executed_op_hadd_pred_on.sum']

$\mathrm{FLOP_{tc}} = \mathrm{Inst_{tc}}\times 512$

        dfmetric['TC FLOPs']= 512 * dfmetric['sm__inst_executed_pipe_tensor.sum']
        dfmetric['all FLOPs']= dfmetric['CC FLOPs'] + dfmetric['TC FLOPs']

        dfmetric['AI HBM'] = dfmetric['all FLOPs'].div(dfmetric['dram__bytes.sum'])
        dfmetric['AI L2'] = dfmetric['all FLOPs'].div(dfmetric['lts__t_bytes.sum'])
        dfmetric['AI L1'] = dfmetric['all FLOPs'].div(dfmetric['l1tex__t_bytes.sum'])

        dfmetric['GFLOP/s'] = dfmetric['all FLOPs']/ dfmetric['Time'] /1024/1024/1024
        dfmetric['TC GFLOP/s'] = dfmetric['TC FLOPs']/ dfmetric['Time'] /1024/1024/1024
#         dfmetric.to_csv('pd_'+tag+'.csv')
        dfs[tag]=dfmetric

对于每个文件的结果，
pandas.Index.tolist 返回值列表。
pandas.Series.tolist 返回值列表。
这样 roofline 函数不再需要调用 Pandas 的库函数。

tags=dfs.keys()
flags=['all'] #'HBM','L2','L1' or 'all'
for tag in tags:
    for flag in flags:
        dfm=dfs[tag]
        LABELS = dfm.index.tolist()
        AIL1   = dfm['AI L1'].tolist()
        AIL2   = dfm['AI L2'].tolist()
        AIHBM  = dfm['AI HBM'].tolist()
        FLOPS  = dfm['GFLOP/s'].tolist()

        roofline(tag, FLOPS, AIHBM, AIL2, AIL1, LABELS, flag)

roofline

检查输入参数是否为空。

def roofline(filename, FLOPS, AIHBM, AIL2=None, AIL1=None, LABELS=None, flag='HBM'):

    if not FLOPS:
        print('FLOPS can not be empty!')
        return
    if max(FLOPS)==0:
        print('FLOPS are all 0s!')
        return
    if (not AIHBM) and (not AIL2) and (not AIL1):
        print('AIHBM, AIL2 and AIL1 can not all be empty!')
        return
    if (len(FLOPS) != len(AIHBM)) or (len(FLOPS) != len(AIL2)) or (len(FLOPS) != len(AIL1)):
        print('FLOPS needs to have the same length as AI!')
        return
    if (flag != 'HBM') and (flag != 'L2') and (flag != 'L1') and (flag != 'all'):
        print('flag needs to be one of HBM, L2, L1, and all!')
        return

memRoofs和cmpRoofs为提前确定好的值。
matplotlib.pyplot.figure 创建新图窗，或激活现有图窗。figsize为以英寸为单位的宽和高。
matplotlib.pyplot.clf 清除当前图形。
matplotlib.figure.Figure.gca 获取当前轴。
matplotlib.axes.Axes.set_xscale 设置 x 轴比例。
matplotlib.axes.Axes.set_xlabel 设置 x 轴的标签。
matplotlib.axes.Axes.set_xlim 设置 x 轴视图限制。
matplotlib.axes.Axes.get_xlim 返回 x 轴视图限制。
x 轴和 y 轴对数尺度，其中 x 轴的可见区间为 $10^{x_{min}}, 10^{x_{max}}]$ 。

    LABELS = [x[:maxchar] for x in LABELS]

    memRoofs = [('L1', 54000.), ('L2', 2996.77),  ('HBM', 828.76)] 
    cmpRoofs = [('Tensor', 96.9),('DP', 7.8)]

    fig = plt.figure(1,figsize=(10.67,6.6))
    plt.clf()
    ax = fig.gca()
    ax.set_xscale('log')
    ax.set_yscale('log')
    ax.set_xlabel('Arithmetic Intensity [FLOPs/Byte]')
    ax.set_ylabel('Performance [GFLOP/sec]')

    nx   = 10000
    xmin = -3 
    xmax = 3
    ymin = 1
    ymax = 200000

    ax.set_xlim(10**xmin, 10**xmax)
    ax.set_ylim(ymin, ymax)

    ixx = int(nx*0.02)
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

numpy.logspace 返回在[10**xmin, 10**xmax)区间内对数刻度上nx个均匀间隔的数字。
x与memRoofs相乘即可得到 y 轴上的性能值。
对于cmpRoofs中的每一种，如果当前位置的计算小于 L1的限制而前一点的计算大于 L1的限制，则将前一点加入scomp_x_elbow和scomp_ix_elbow中。
对于memRoofs中的每一种，如果当前位置的内存带宽大于 Tensor 算力的限制而前一点的内存带宽小于 Tensor 算力的限制，则将前一点加入smem_x_elbow和smem_ix_elbow中。

    scomp_x_elbow  = []
    scomp_ix_elbow = []
    smem_x_elbow   = []
    smem_ix_elbow  = []

    x = np.logspace(xmin,xmax,nx)
    for roof in cmpRoofs:
        for ix in range(1,nx):
            if float(memRoofs[0][1] * x[ix]) >= roof[1]*1024 and (memRoofs[0][1] * x[ix-1]) < roof[1]*1024:
                scomp_x_elbow.append(x[ix-1])
                scomp_ix_elbow.append(ix-1)
                break

    for roof in memRoofs:
        for ix in range(1,nx):
            if (cmpRoofs[0][1]*1024 <= roof[1] * x[ix] and cmpRoofs[0][1]*1024 > roof[1] * x[ix-1]):
                smem_x_elbow.append(x[ix-1])
                smem_ix_elbow.append(ix-1)
                break

绘制 Roofline 的折线。
对于每种cmpRoofs，绘制转弯后的部分。
对于每种memRoofs，绘制转弯前的部分。
这里使用len(cmpRoofs)和len(memRoofs)可能会遇到访问错误，换成len(scomp_ix_elbow)和len(smem_ix_elbow)较为合适。
matplotlib.axes.Axes.plot 用于绘制 XY 坐标系的点、线或其他标记形状。
color 为黑色，linestyle 为实线，linewidth 为 2 个像素。

    for i in range(len(cmpRoofs)):
        roof = cmpRoofs[i][1]*1024
        y = np.ones(len(x)) * roof
        ax.plot(x[scomp_ix_elbow[i]:],y[scomp_ix_elbow[i]:],c='k',ls='-',lw='2')

    for i in range(len(memRoofs)):
        roof = memRoofs[i][1]
        y = x * roof
        ax.plot(x[:smem_ix_elbow[i]+1],y[:smem_ix_elbow[i]+1],c='k',ls='-',lw='2')

绘制 kernel 性能数据到图上。L1的为圆圈，L2的为方形标记，HBM 的为倒三角标记。
按照AIHBM的长度遍历，这样假定其总是存在且长度匹配的。
根据flag来决定绘制哪一部分的结果。
从 colors 列表中取不同的颜色。
LABELS 为图例中的标签。

    for i in range(len(AIHBM)):
        if flag == 'L1':
            ax.plot(float(AIL1[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[0],\
                    linestyle='None',ms=markersize,markerfacecolor='none',\
                    markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")
        elif flag == 'L2':
            ax.plot(float(AIL2[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[1],\
                    linestyle='None',ms=markersize,markerfacecolor='none',\
                    markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")
        elif flag == 'HBM':
            ax.plot(float(AIHBM[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[2],\
                    linestyle='None',ms=markersize,markerfacecolor='none',\
                    markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")
        elif flag == 'all':
            ax.plot(float(AIL1[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[0],\
                    linestyle='None',ms=markersize,markerfacecolor='none',\
                    markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")
            ax.plot(float(AIL2[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[1],\
                    linestyle='None',ms=markersize,markerfacecolor='none',\
                    markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")
            ax.plot(float(AIHBM[i]),float(FLOPS[i]),c=colors[i%10],marker=styles[2],\
                    linestyle='None',ms=markersize,markerfacecolor='none',\
                    markeredgewidth=markerwidth,label=LABELS[i] if LABELS else "unknown")

matplotlib.axes.Axes.plot 会返回 matplotlib.lines.Line2D 对象的列表。

    marker_handles = []  

    if flag == 'L1':
        marker_handles.append(ax.plot([],[],c='k',marker=styles[0],linestyle='None',ms=markersize,\
                markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[0][0])[0])
    elif flag == 'L2':
        marker_handles.append(ax.plot([],[],c='k',marker=styles[1],linestyle='None',ms=markersize,\
                markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[1][0])[0])
    elif flag == 'HBM':
        marker_handles.append(ax.plot([],[],c='k',marker=styles[2],linestyle='None',ms=markersize,\
                markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[2][0])[0])
    elif flag == 'all':
        for i in range(len(memRoofs)):
            marker_handles.append(ax.plot([],[],c='k',marker=styles[i],linestyle='None',ms=markersize,\
                                  markerfacecolor='none',markeredgewidth=markerwidth,label=memRoofs[i][0])[0])

matplotlib.axes.Axes.text 向轴添加计算峰值和内存速率数据。

    for roof in cmpRoofs:
        ax.text(x[-ixx],roof[1]*1024,
              roof[0] + ': ' + '{0:.1f}'.format(roof[1]) + ' TFLOP/s',
              horizontalalignment='right',
              verticalalignment='bottom')

    for roof in memRoofs:
        ang = np.arctan(np.log10(xlim[1]/xlim[0]) / np.log10(ylim[1]/ylim[0])
                                   * fig.get_size_inches()[1]/fig.get_size_inches()[0] )
        if x[ixx]*roof[1] >ymin:
            ax.text(x[ixx],x[ixx]*roof[1]*(1+0.25*np.sin(ang)**2),
              roof[0] + ': ' + '{0:.1f}'.format(float(roof[1])) + ' GB/s',
              horizontalalignment='left',
              verticalalignment='bottom',
              rotation=180/np.pi*ang)
        else:
            ymin_ix_elbow=list()
            ymin_x_elbow=list()
            for ix in range(1,nx):
                if (ymin <= roof[1] * x[ix] and ymin > roof[1] * x[ix-1]):
                    ymin_x_elbow.append(x[ix-1])
                    ymin_ix_elbow.append(ix-1)
                    break
            ax.text(x[ixx+ymin_ix_elbow[0]],x[ixx+ymin_ix_elbow[0]]*roof[1]*(1+0.25*np.sin(ang)**2),
              roof[0] + ': ' + '{0:.1f}'.format(float(roof[1])) + ' GB/s',
              horizontalalignment='left',
              verticalalignment='bottom',
              rotation=180/np.pi*ang)

matplotlib.pyplot.legend 在右下方放置一个内存类型的图例marker_handles。
matplotlib.axes.Axes.add_artist 添加 Artist
matplotlib.patches.Patch 是具有外观和边缘颜色的 2D Artist。
leg2中使用的loc=4不易理解。
matplotlib.pyplot.savefig 保存当前图窗。

        
    leg1 = plt.legend(handles = marker_handles,loc='lower right', ncol=len(flag[0]) if 'all' not in flag else 3,bbox_to_anchor = (1,0))
    ax.add_artist(leg1)

    patch_handles = list()
    for i in range(0,len(AIHBM)):
        if FLOPS[i] > 0:
            patch_handles.append(mpatches.Patch(color=colors[i%10],label = LABELS[i] if LABELS else "unknown"))

    leg2 = plt.legend(handles = patch_handles,loc=4,ncol=1,bbox_to_anchor = (1,0.1),scatterpoints = 1)

    ax.text(xlim[0]*1.1,ylim[1]/1.1, '-'.join([filename,flag]), horizontalalignment='left',verticalalignment='top')
#     plt.title('-'.join([filename,flag]))

    plt.savefig('_'.join([filename,flag])+'.png')
#     plt.savefig('_'.join([filename,flag])+'.eps')

#    plt.show()