可执行文件的运行
以BT-MZ的编译为例:
cd config
修改make.def文件,选择你想用的编译器和编译选项
#---------------------------------------------------------------------------
#
# SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS.
#
#---------------------------------------------------------------------------
#---------------------------------------------------------------------------
# Items in this file will need to be changed for each platform.
#---------------------------------------------------------------------------
#---------------------------------------------------------------------------
# Parallel Fortran:
#
# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must
# be defined:
#
# FC - Fortran compiler
# FFLAGS - Fortran compilation arguments
# F_INC - any -I arguments required for compiling Fortran
# FLINK - Fortran linker
# FLINKFLAGS - Fortran linker arguments
# F_LIB - any -L and -l arguments required for linking Fortran
#
# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
# $(FC) $(FFLAGS)
# linking is done with $(FLINK) $(F_LIB) $(FLINKFLAGS)
#---------------------------------------------------------------------------
#---------------------------------------------------------------------------
# This is the fortran compiler used for fortran programs
#---------------------------------------------------------------------------
FC = mpiifort
# This links fortran programs; usually the same as ${FC}
FLINK = $(FC)
#---------------------------------------------------------------------------
# These macros are passed to the linker
#---------------------------------------------------------------------------
F_LIB =
#---------------------------------------------------------------------------
# These macros are passed to the compiler
#---------------------------------------------------------------------------
F_INC =
#---------------------------------------------------------------------------
# Global *compile time* flags for Fortran programs
#---------------------------------------------------------------------------
FFLAGS = -O3 -fopenmp
#---------------------------------------------------------------------------
# Global *link time* flags. Flags for increasing maximum executable
# size usually go here.
#---------------------------------------------------------------------------
FLINKFLAGS = $(FFLAGS)
#---------------------------------------------------------------------------
# Parallel C:
#
# For IS, which is in C, the following must be defined:
#
# CC - C compiler
# CFLAGS - C compilation arguments
# C_INC - any -I arguments required for compiling C
# CLINK - C linker
# CLINKFLAGS - C linker flags
# C_LIB - any -L and -l arguments required for linking C
#
# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
# $(CC) $(CFLAGS)
# linking is done with $(CLINK) $(C_LIB) $(CLINKFLAGS)
#---------------------------------------------------------------------------
#---------------------------------------------------------------------------
# This is the C compiler used for C programs
#---------------------------------------------------------------------------
CC = mpicc
# This links C programs; usually the same as ${CC}
CLINK = $(CC)
#---------------------------------------------------------------------------
# These macros are passed to the linker
#---------------------------------------------------------------------------
C_LIB =
#---------------------------------------------------------------------------
# These macros are passed to the compiler
#---------------------------------------------------------------------------
C_INC =
#---------------------------------------------------------------------------
# Global *compile time* flags for C programs
#---------------------------------------------------------------------------
CFLAGS = -O3 -fopenmp
#---------------------------------------------------------------------------
# Global *link time* flags. Flags for increasing maximum executable
# size usually go here.
#---------------------------------------------------------------------------
CLINKFLAGS = $(CFLAGS)
#---------------------------------------------------------------------------
# MPI dummy library:
#
# Uncomment if you want to use the MPI dummy library supplied by NAS instead
# of the true message-passing library. The include file redefines several of
# the above macros. It also invokes make in subdirectory MPI_dummy. Make
# sure that no spaces or tabs precede include.
#---------------------------------------------------------------------------
#include ../config/make.dummy
#---------------------------------------------------------------------------
# Utilities C:
#
# This is the C compiler used to compile C utilities. Flags required by
# this compiler go here also; typically there are few flags required; hence
# there are no separate macros provided for such flags.
#---------------------------------------------------------------------------
UCC = gcc
#---------------------------------------------------------------------------
# Destination of executables, relative to subdirs of the main directory. .
#---------------------------------------------------------------------------
BINDIR = ../bin
#---------------------------------------------------------------------------
# The variable RAND controls which random number generator
# is used. It is described in detail in README.install.
# Use "randi8" unless there is a reason to use another one.
# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
#---------------------------------------------------------------------------
RAND = randi8
# The following is highly reliable but may be slow:
# RAND = randdp
NPB由8个程序组成,包括5个内核测试程序和3个应用测试程序分别从不同的方面反映了流体力学计算的特点。NPB中每个基准测试程序有7类问题规模,分别为S、W、A、B、C、D和E。其中A类规模最小,S(Sample)类是样例程序,W(Workstation)类通常用于工作站。NPB每个应用有各自的特点,其中IS由C语言编写,其他7个由FORTRAN语言编写,是浮点密集型计算
export OMP_NUM_THREADS=1(修改线程数)
make BT-MZ CLASS=A NPROCS=1
(NPROCS设置的是进程数,进程数和线程数用这两个命令自行修改)
cd bin
进入bin文件夹
运行可执行文件
./bt-mz.A.x
得到程序运行结果
有时可能会出现这个报错
解决方法:
Intel官网上的介绍
我这里选了个Sockets,可以选择不同的类型
这回程序即可成功运行
得到运行结果:
NAS Parallel Benchmarks (NPB3.4-MZ MPI+OpenMP) - BT-MZ Benchmark
Number of zones: 4 x 4
Total mesh size: 128 x 128 x 16
Iterations: 200 dt: 0.000800
Number of active processes: 1
Use the default load factors
Total number of threads: 1 ( 1.0 threads/process)
Calculated speedup = 1.00
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class A
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.5536703889522E+05 0.5536703889522E+05 0.5887309627164E-13
2 0.5077835038405E+04 0.5077835038405E+04 0.6143497777078E-13
3 0.1067391361067E+05 0.1067391361067E+05 0.4495533898417E-12
4 0.6441179694972E+04 0.6441179694972E+04 0.5337360756113E-13
5 0.4371926324069E+05 0.4371926324069E+05 0.9286488489897E-13
Comparison of RMS-norms of solution error
1 0.6716797714343E+04 0.6716797714343E+04 0.6634893930438E-13
2 0.6512687902160E+03 0.6512687902160E+03 0.4887738371663E-13
3 0.1332930740128E+04 0.1332930740128E+04 0.2519492637685E-12
4 0.7848302089180E+03 0.7848302089180E+03 0.4925081168657E-13
5 0.5429053878818E+04 0.5429053878818E+04 0.8811741857989E-13
Verification Successful
BT-MZ Benchmark Completed.
Class = A
Size = 128x 128x 16
Iterations = 200 (迭代次数)
Time in seconds = 58.96 (运行时间)
Total processes = 1 (进程数)
Total threads = 1 (线程数)
Mop/s total = 2479.82 (浮点性能)
Mop/s/thread = 2479.82**
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.4.1
Compile date = 26 Aug 2020
Compile options:
FC = mpiifort
FLINK = $(FC)
F_LIB = (none)
F_INC = (none)
FFLAGS = -O3 -fopenmp
FLINKFLAGS = $(FFLAGS)
RAND = (none)
Please send all errors/feedbacks to:
NPB Development Team
npb@nas.nasa.gov
然后多次测试
Vtune下的性能测试:
amplxe-cl -collect 想测试的性能的命令参数 -target-duration-type medium -result-dir XXX/path_to_result $可执行文件
各个参数的解释:
-target-duration-type:要统计的可执行文件运行的大约时间,分三个等级,small/medium/large,分别代表0-15min,15min-3h,3h-24h,具体解释可以去Vtune官网找。默认small,注意:如果使用默认值,但是你的可执行文件运行时间超过15min,就会出问题。vtune会在15min时自动收集一轮数据,如果你要对原始数据还要用report命令进行统计,vtune会告诉你找不到原始数据。
-result-dir:原始数据存放的路径,不要具体到文件,到目录即可。
对原始数据进行处理,并生成报告的命令:
amplxe-cl -report summary -result-dir XXX/path_to_result > XXX/path_to_report
参数解释:
(1)-report:你要报告的事件,summary表示要生成总的报告。还有其他参数请参看官网。
(2)-result-dir :和上面相同
(3)重定向>:默认是输出在标准输出上,为了保存结果,可以指定某个文件,注意这里是文件,不是目录。
Vtune命令行下常用的几个命令:
例如命令:
amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.CORE,CPU_CLK_UNHALTED.REF,INST_RETIRED.ANYhome/test/sample
与体系结构有关的事件:
事件 | 含义 |
---|---|
INST_RETIRED.ANY | 指令执行的计数 |
CPU_CLK_UNHALTED.THREAD | 非停机状态花费的机器周期计数 |
CYCLE_ACTIVITY.STALLS_L1D_PENDING | 由一级缓存缺失导致的执行受阻的机器周期计数 |
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE | 所有执行受阻的机器周期计数 |
CYCLE_ACTIVITY.STALLS_L2_PENDING | 由二级缓存缺失导致的执行受阻的机器周期计数 |
CYCLE_ACTIVITY.STALLS_LDM_PENDIN | 所有因内存原因而执行受阻的机器周期计数 |
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 | 载入内存数据延时超过4个时钟周期的计数 |
MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS | 内存数据载入过程中三级缓存缺失的计数 |
MEM_UOPS_RETIRED.ALL_LOADS_PS | 内存数据载入微操作的计数 |
MEM_UOPS_RETIRED.ALL_STORES_PS | 内存数据存储微操作的计数 |
MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS | 三级缓存命中的微操作的计数 |
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM | 三级缓存缺失且从本地内存得到数据的载入微操作的计数 |
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM | 三级缓存缺失且从远端内存得到数据的载入微操作的计数 |
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM | 三级缓存缺失且从远端缓存得到数据的载入微操作的计数 |
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_FWD | 三级缓存缺失且通过远端缓存转发而得到数据的载入微操作的计数 |
Total_Latency_MEM_UOPS_RETIRED.ALL_LOADS_PS | 数据载入微操作产生的全部延时 |
Total_Latency_MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 | 由延时超过4个时钟周期的数据载入微操作产生的全部延时 |
Total_Latency_MEM_UOPS_RETIRED.ALL_STORES_PS | 数据存储微操作产生的全部延时 |
Total_Latency_MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS | 三级缓存命中的数据载入微操作产生的全部延时 |
Total_Latency_MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS | 三级缓存缺失的数据载入微操作产生的全部延时 |
使用Vtune测试体系结构实际操作:
使用编译器和MPI环境,运行这两个脚本(目录根据自己脚本放置位置选择)
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /opt/intel/impi/2019.1.144/intel64/bin/mpivars.sh
使用Vtune命令行:运行amplxe-vars.sh这个脚本
source /opt/intel/vtune_amplifier_2019/amplxe-vars.sh
不同的CPU体系结构可以支持的性能事件是不同的,建议先查明本机器可以使用的性能事件再进行测试
amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.REF_TSC,CPU_CLK_UNHALTED.THREAD,INST_RETIRED.ANY -result-dir resHWC ./bt-mz.B.x
amplxe-cl -report hw-events -r resHWC -format=csv -group-by module > testb.txt
把生成的数据存放在一个txt文件中方便查阅
amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD,INST_RETIRED.ANY,MEM_INST_RETIRED.ALL_STORES,MEM_INST_RETIRED.ALL_LOADS,CYCLE_ACTIVITY.STALLS_L3_MISS -result-dir resHWC2 ./bt-mz.B.x
对BT-MZ版本问题规模A,B,C测试如下性能事件进行分析统计:
CPU_CLK_UNHALTED.THREAD,
INST_RETIRED.ANY,
MEM_INST_RETIRED.
ALL_STORES,
MEM_INST_RETIRED.ALL_LOADS,
CYCLE_ACTIVITY.STALLS_L3_MISS
CLASS=B:
CLASS=A:
CLASS=C:
经过整理:
Hardware Event Count:Self是性能数据,即性能事件发生的次数
Hardware Event Sample Count:self:是测试中的采样数据