HPC应用&物理学软件Chroma+QUDA详细安装使用教程

技术瘾君子1573

于 2024-08-30 00:00:00 发布

阅读量131

点赞数 5

分类专栏： Linux并行计算&HPC高性能计算文章标签： Chroma+QUDA HPC应用物理学 DCU 海光 GPGPU

本文链接：https://blog.csdn.net/qq_27815483/article/details/141184968

版权

Linux并行计算&HPC高性能计算专栏收录该内容

60 篇文章 4 订阅

订阅专栏

1. Chroma+QUDA简介

Chroma是美国JeffersonLab联合格点QCD各国开发人员开发的开源格点量子色动力学通用软件包，支持各种胶子作用量和除了手征费米子以外各种费米子作用量的数值模拟。支持MPI，OpenMP， CUDA加速（由QUDA提供）。

QUDA是Nvidia联合格点QCD各国开发人员开发的开源格点量子色动力学模拟软件，主要依托Chroma或者MILC使用，支持除手征费米子以外的各种格点QCD作用量，支持MPI， CUDA加速（OpenMP支持由chroma提供）。我们移植到419上之后也支持HIP加速。

子课题成员杨一玻对上述两个软件包的最新版本（develop版）均有贡献，并主持撰写了本子课题中使用的Chroma附加功能包。

基本算法实现的文章：arXiv 0911.3191，1109.2935，1612.07873，1710.09745。

2. 软件版本

1.0，rocm-devel分支。

3. 编译安装过程

1） Chroma可以从github上获取，git clone git://github.com/JeffersonLab/chroma.git，并checkout devel分支。Chroma依赖于QMP/QIO/QDP++等软件包，安装教程参见Chroma_make.sh文件。

2） 解压Chroma+QUDA.tgz（Quda软件也可以从github上获取，git clone https://github.com/lattice/quda.git，并checkout rocm-devel分支）；

3） 进入软件主目录，修改Makefile中的的chroma路径。

4） make cmake 设置cmake基本参数

5） make hack 对hip编译器不兼容的文件，做编译规则上的修改。

6） make build 编译整个软件包

7） make -j 编译Chroma附加功能包

4. 测试算例及slurm脚本

测试算例路径： /public/home/ybyang/scratch/scaling_test

slurm脚本：scaling.sh

脚本内容：

#!/bin/bash
#SBATCH --job-name=quda_test
#SBATCH --partition=normal
#SBATCH --output=test.sh.out
#SBATCH --error=test.sh.err
#SBATCH --array=0-1
#SBATCH --nodes=1024
#SBATCH -n 4096
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --exclusive

export OMP_NUM_THREADS=7
cd $SLURM_SUBMIT_DIR

tag=scaling
file_list="list.$tag"

directory=.
Dlog=${directory}/log
prefix_log=log


mass=0.0
clover=1.05088

#########input

eo_level=3
n_src=2
it0=0

sleep $((1+${SLURM_ARRAY_TASK_ID}*1))

do_job(){

./milc_3pt.pl "./l96192f211b672m0008m022m260a.scidac.1002.hyp WEAK_FIELD 0 $1 $2 $3 $4 ${mass} ${clover}"\
                 "${n_src} ${eo_level} 0.0 0" \
                 "source_position ${it0} 10000" \
                 "6 4 1 2 2 1 UNPOL" \
                 "4 4 4 4" \
                 "-1 0 0 1 $7 1 ${directer}/data"  >${directory}/ini.$5.xml 2>&1

output=${directory}/log.$5.$7.${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID};

echo "SLURM_NODEs: $SLURM_JOB_NODELIST" > ${output}

#HIP_DB=0 QUDA_ENABLE_P2P=0 \
HIP_DB=0 QUDA_ENABLE_P2P=0 QUDA_ENABLE_TUNING=0 \
        mpirun -n $5 ./bind_gpu.sh_quda ./chroma_gpu -i ${directory}/ini.$5.xml  -o ${directory}/out.$5.xml \
        -geom $6   >>${output} 2>&1
}

do_job 128 128 256 256 4096 "8 8 8 8" 1

提交测试算例：

sbatch scaling.sh

测试算例输出（只是输出格式的示例，并非实际输出）：

Start time: 2020年 03月 02日 星期一 21:30:45 CST
SLURM_JOB_ID: 1539701-0
SLURM_NODEs: h02r3n[18-19],h02r4n[01-02,04-06,08-09,14-17],……(以下省略)
QDP use OpenMP threading. We have 8 threads
Initializing QUDA
Affinity reporting not implemented for this architecture
Initialize done
Initializing QUDA
Initializing QUDA 
(共N行，N=DCU数量)
Disabling GPU-Direct RDMA access
Disabling peer-to-peer access
QUDA 0.9.1 (git rocm)
CUDA Driver version = 4
CUDA Runtime version = 19361
Found device 0: Device 66a1
Found device 1: Device 66a1
Found device 2: Device 66a1
Found device 3: Device 66a1
Using device 0: Device 66a1
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
WARNING: Autotuning disabled
Linkage = bool Chroma::MapObjectDiskEnv::registerAll(): registering map obj key colorvec
0
InlineMeasurements are: 
<InlineMeasurements> 
(其间省略数万行)
CHROMA measurements: time= 704.720384 secs
CHROMA: total time = 718.894683 secs
CHROMA: ran successfully
(以下仅为QUDA数据统计示例)

               initQuda Total time = 0.049556 secs
                     init     = 0.049535 secs (   100%), with        2 calls at 2.476750e+04 us per call
        total accounted       = 0.049535 secs (   100%)
        total missing         = 0.000021 secs (0.0424%)

          loadGaugeQuda Total time = 5.82817 secs
                 download     = 0.484106 secs (  8.31%), with        4 calls at 1.210265e+05 us per call
                     init     = 5.341287 secs (  91.6%), with        4 calls at 1.335322e+06 us per call
                  compute     = 0.002536 secs (0.0435%), with        4 calls at 6.340000e+02 us per call
                     free     = 0.000087 secs (0.00149%), with        4 calls at 2.175000e+01 us per call
        total accounted       = 5.828016 secs (   100%)
        total missing         = 0.000157 secs (0.00269%)

         loadCloverQuda Total time = 0.037905 secs
                 download     = 0.037063 secs (  97.8%), with        2 calls at 1.853150e+04 us per call
                     init     = 0.000648 secs (  1.71%), with        4 calls at 1.620000e+02 us per call
                     free     = 0.000013 secs (0.0343%), with        2 calls at 6.500000e+00 us per call
        total accounted       = 0.037724 secs (  99.5%)
        total missing         = 0.000181 secs ( 0.478%)

             invertQuda Total time = 552.425 secs
                 download     = 0.060770 secs ( 0.011%), with       24 calls at 2.532083e+03 us per call
                   upload     = 0.059620 secs (0.0108%), with       24 calls at 2.484167e+03 us per call
                     init     = 311.603995 secs (  56.4%), with       25 calls at 1.246416e+07 us per call
                 preamble     = 9.814832 secs (  1.78%), with       49 calls at 2.003027e+05 us per call
                  compute     = 229.550398 secs (  41.6%), with       24 calls at 9.564600e+06 us per call
                 epilogue     = 1.068579 secs ( 0.193%), with       72 calls at 1.484137e+04 us per call
                     free     = 0.251250 secs (0.0455%), with       73 calls at 3.441781e+03 us per call
        total accounted       = 552.409444 secs (   100%)
        total missing         = 0.015977 secs (0.00289%)

                endQuda Total time = 0.500053 secs

       initQuda-endQuda Total time = 719.455 secs

                   QUDA Total time = 558.814 secs
                 download     = 0.581942 secs ( 0.104%), with       30 calls at 1.939807e+04 us per call
                   upload     = 0.059622 secs (0.0107%), with       24 calls at 2.484250e+03 us per call
                     init     = 316.995466 secs (  56.7%), with       35 calls at 9.057013e+06 us per call
                 preamble     = 9.814777 secs (  1.76%), with       49 calls at 2.003016e+05 us per call
                  compute     = 229.552940 secs (  41.1%), with       28 calls at 8.198319e+06 us per call
                 epilogue     = 1.068588 secs ( 0.191%), with       72 calls at 1.484150e+04 us per call
                     free     = 0.251351 secs ( 0.045%), with       79 calls at 3.181658e+03 us per call
        total accounted       = 558.324686 secs (  99.9%)
        total missing         = 0.488892 secs (0.0875%)

Device memory used = 5126.5 MB
Pinned device memory used = 0.0 MB
Managed memory used = 0.0 MB
Page-locked host memory used = 74.0 MB
Total host memory used >= 103.7 MB

技术瘾君子1573

关注

5
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
HPC应用&物理学软件Chroma+QUDA详细安装使用教程

Chroma是美国JeffersonLab联合格点QCD各国开发人员开发的开源格点量子色动力学通用软件包，支持各种胶子作用量和除了手征费米子以外各种费米子作用量的数值模拟。支持MPI，OpenMP， CUDA加速（由QUDA提供）。
复制链接

扫一扫