目录
1. Chroma+QUDA简介
Chroma是美国JeffersonLab联合格点QCD各国开发人员开发的开源格点量子色动力学通用软件包,支持各种胶子作用量和除了手征费米子以外各种费米子作用量的数值模拟。支持MPI,OpenMP, CUDA加速(由QUDA提供)。
QUDA是Nvidia联合格点QCD各国开发人员开发的开源格点量子色动力学模拟软件,主要依托Chroma或者MILC使用,支持除手征费米子以外的各种格点QCD作用量,支持MPI, CUDA加速(OpenMP支持由chroma提供)。我们移植到419上之后也支持HIP加速。
子课题成员杨一玻对上述两个软件包的最新版本(develop版)均有贡献,并主持撰写了本子课题中使用的Chroma附加功能包。
基本算法实现的文章:arXiv 0911.3191,1109.2935,1612.07873,1710.09745。
2. 软件版本
1.0,rocm-devel分支。
3. 编译安装过程
1) Chroma可以从github上获取,git clone git://github.com/JeffersonLab/chroma.git,并checkout devel分支。Chroma依赖于QMP/QIO/QDP++等软件包,安装教程参见Chroma_make.sh文件。
2) 解压Chroma+QUDA.tgz(Quda软件也可以从github上获取,git clone https://github.com/lattice/quda.git,并checkout rocm-devel分支);
3) 进入软件主目录,修改Makefile中的的chroma路径。
4) make cmake 设置cmake基本参数
5) make hack 对hip编译器不兼容的文件,做编译规则上的修改。
6) make build 编译整个软件包
7) make -j 编译Chroma附加功能包
4. 测试算例及slurm脚本
测试算例路径: /public/home/ybyang/scratch/scaling_test
slurm脚本:scaling.sh
脚本内容:
#!/bin/bash
#SBATCH --job-name=quda_test
#SBATCH --partition=normal
#SBATCH --output=test.sh.out
#SBATCH --error=test.sh.err
#SBATCH --array=0-1
#SBATCH --nodes=1024
#SBATCH -n 4096
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --exclusive
export OMP_NUM_THREADS=7
cd $SLURM_SUBMIT_DIR
tag=scaling
file_list="list.$tag"
directory=.
Dlog=${directory}/log
prefix_log=log
mass=0.0
clover=1.05088
#########input
eo_level=3
n_src=2
it0=0
sleep $((1+${SLURM_ARRAY_TASK_ID}*1))
do_job(){
./milc_3pt.pl "./l96192f211b672m0008m022m260a.scidac.1002.hyp WEAK_FIELD 0 $1 $2 $3 $4 ${mass} ${clover}"\
"${n_src} ${eo_level} 0.0 0" \
"source_position ${it0} 10000" \
"6 4 1 2 2 1 UNPOL" \
"4 4 4 4" \
"-1 0 0 1 $7 1 ${directer}/data" >${directory}/ini.$5.xml 2>&1
output=${directory}/log.$5.$7.${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID};
echo "SLURM_NODEs: $SLURM_JOB_NODELIST" > ${output}
#HIP_DB=0 QUDA_ENABLE_P2P=0 \
HIP_DB=0 QUDA_ENABLE_P2P=0 QUDA_ENABLE_TUNING=0 \
mpirun -n $5 ./bind_gpu.sh_quda ./chroma_gpu -i ${directory}/ini.$5.xml -o ${directory}/out.$5.xml \
-geom $6 >>${output} 2>&1
}
do_job 128 128 256 256 4096 "8 8 8 8" 1
提交测试算例:
sbatch scaling.sh
测试算例输出(只是输出格式的示例,并非实际输出):
Start time: 2020年 03月 02日 星期一 21:30:45 CST
SLURM_JOB_ID: 1539701-0
SLURM_NODEs: h02r3n[18-19],h02r4n[01-02,04-06,08-09,14-17],……(以下省略)
QDP use OpenMP threading. We have 8 threads
Initializing QUDA
Affinity reporting not implemented for this architecture
Initialize done
Initializing QUDA
Initializing QUDA
(共N行,N=DCU数量)
Disabling GPU-Direct RDMA access
Disabling peer-to-peer access
QUDA 0.9.1 (git rocm)
CUDA Driver version = 4
CUDA Runtime version = 19361
Found device 0: Device 66a1
Found device 1: Device 66a1
Found device 2: Device 66a1
Found device 3: Device 66a1
Using device 0: Device 66a1
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
WARNING: Autotuning disabled
Linkage = bool Chroma::MapObjectDiskEnv::registerAll(): registering map obj key colorvec
0
InlineMeasurements are:
<InlineMeasurements>
(其间省略数万行)
CHROMA measurements: time= 704.720384 secs
CHROMA: total time = 718.894683 secs
CHROMA: ran successfully
(以下仅为QUDA数据统计示例)
initQuda Total time = 0.049556 secs
init = 0.049535 secs ( 100%), with 2 calls at 2.476750e+04 us per call
total accounted = 0.049535 secs ( 100%)
total missing = 0.000021 secs (0.0424%)
loadGaugeQuda Total time = 5.82817 secs
download = 0.484106 secs ( 8.31%), with 4 calls at 1.210265e+05 us per call
init = 5.341287 secs ( 91.6%), with 4 calls at 1.335322e+06 us per call
compute = 0.002536 secs (0.0435%), with 4 calls at 6.340000e+02 us per call
free = 0.000087 secs (0.00149%), with 4 calls at 2.175000e+01 us per call
total accounted = 5.828016 secs ( 100%)
total missing = 0.000157 secs (0.00269%)
loadCloverQuda Total time = 0.037905 secs
download = 0.037063 secs ( 97.8%), with 2 calls at 1.853150e+04 us per call
init = 0.000648 secs ( 1.71%), with 4 calls at 1.620000e+02 us per call
free = 0.000013 secs (0.0343%), with 2 calls at 6.500000e+00 us per call
total accounted = 0.037724 secs ( 99.5%)
total missing = 0.000181 secs ( 0.478%)
invertQuda Total time = 552.425 secs
download = 0.060770 secs ( 0.011%), with 24 calls at 2.532083e+03 us per call
upload = 0.059620 secs (0.0108%), with 24 calls at 2.484167e+03 us per call
init = 311.603995 secs ( 56.4%), with 25 calls at 1.246416e+07 us per call
preamble = 9.814832 secs ( 1.78%), with 49 calls at 2.003027e+05 us per call
compute = 229.550398 secs ( 41.6%), with 24 calls at 9.564600e+06 us per call
epilogue = 1.068579 secs ( 0.193%), with 72 calls at 1.484137e+04 us per call
free = 0.251250 secs (0.0455%), with 73 calls at 3.441781e+03 us per call
total accounted = 552.409444 secs ( 100%)
total missing = 0.015977 secs (0.00289%)
endQuda Total time = 0.500053 secs
initQuda-endQuda Total time = 719.455 secs
QUDA Total time = 558.814 secs
download = 0.581942 secs ( 0.104%), with 30 calls at 1.939807e+04 us per call
upload = 0.059622 secs (0.0107%), with 24 calls at 2.484250e+03 us per call
init = 316.995466 secs ( 56.7%), with 35 calls at 9.057013e+06 us per call
preamble = 9.814777 secs ( 1.76%), with 49 calls at 2.003016e+05 us per call
compute = 229.552940 secs ( 41.1%), with 28 calls at 8.198319e+06 us per call
epilogue = 1.068588 secs ( 0.191%), with 72 calls at 1.484150e+04 us per call
free = 0.251351 secs ( 0.045%), with 79 calls at 3.181658e+03 us per call
total accounted = 558.324686 secs ( 99.9%)
total missing = 0.488892 secs (0.0875%)
Device memory used = 5126.5 MB
Pinned device memory used = 0.0 MB
Managed memory used = 0.0 MB
Page-locked host memory used = 74.0 MB
Total host memory used >= 103.7 MB