基于CUDA的并行lammps编译及测试

最新推荐文章于 2023-12-27 20:31:59 发布

tengh

最新推荐文章于 2023-12-27 20:31:59 发布

阅读量4.1k

点赞数 1

分类专栏： Linux系统量化资源 GPU lammps

Linux系统同时被 3 个专栏收录

181 篇文章 1 订阅

订阅专栏

量化资源

60 篇文章 1 订阅

订阅专栏

GPU

5 篇文章 0 订阅

订阅专栏

运行环境：Centos 5.8 Final
            Cuda 5驱动及Toolkit
            Lammps-16Nov12
fftw-2.1.5
openmpi-1.4.5
　
硬件环境：Intel XEON E5 2640
           128GB DDR3 1600 ECC
           WD 1TB HDD
           Nvidia Tesla C2050
           Nvidia Kelper K10
LAMMPS即Large-scale Atomic/Molecular Massively Parallel Simulator，可以翻译为大规模原子分子并行模拟器，主要用于分子动力学相关的一些计算和模拟工作，一般来讲，分子动力学所涉及到的领域，LAMMPS代码也都涉及到了。LAMMPS由美国Sandia国家实验室开发，以GPL license发布，即开放源代码且可以免费获取使用，这意味着使用者可以根据自己的需要自行修改源代码。LAMMPS可以支持包括气态，液态或者固态相形态下、各种系综下、百万级的原子分子体系，并提供支持多种势函数。且LAMMPS有良好的并行扩展性。

编译
Lammps的并行需要能无密码ssh访问本机，首先配置ssh
ssh-keygen -t rsa
不断回车后得到.ssh/id_rsa和 .ssh/id_rsa.pub
cd ～/.ssh
cp id_rsa.pub authorized_keys
现在已经能无密码访问本机了。

安装FFTW2
tar zxvf fftw-2.1.5.tar.gz
cd fftw-2.1.5
./configure --prefix=/opt/fftw2 --enable-float --enable-shared
make
make install

安装及配置OPENMPI
tar –zxvf openmpi-1.4.5.tar.gz
cd openmpi-1.4.5
./configure --prefix=/opt/opnmpi
make
make install

设置环境变量
gedit ~/.bashrc
PATH=/opt/cuda5/bin:/opt/openmpi/bin:$PATH
LD_LIBRARY_PATH=/opt/cuda5/lib64:/opt/openmpi/lib:/opt/fftw2/lib:$LD_LIBRARY_PATH

最后source ~/.bashrc

测试openmpi是否安装成功

which mpicc
which mpiexec
which mpirun

配置lammps
http://lammps.sandia.gov/tars/lammps.tar.gz
tar xvf lammps.tar.gz

首先编译gpu package

cd lammps/lib/gpu

修改Makefile.linux

CUDA_HOME = /opt/cuda5
# Kelper CUDA
CUDA_ARCH = -arch=sm_30
（将其他CUDA_ARCH注释掉）
最后make -f Makefile.linux
生成nvc_get_devices，可以运行一下，看看GPU的信息

修改Makefile.lammps

gpu_SYSINC = -I/opt/cuda5/include
gpu_SYSLIB = -lcudart -lcuda
gpu_SYSPATH = -L/opt/cuda5/lib64

然后编译自定义包，我们需要用到user-cuda
cd ../cuda

修改Makefile.common
CUDA_INSTALL_PATH = /opt/cuda5

然后make：
make CUDA_INSTALL_PATH=/opt/cuda5 cufft=2 precision=2 arch=30
最后会生成liblammpscuda.a

然后安装所需要的包:
make yes-asphere
make yes-class2
make yes-colloid
make yes-dipole
make yes-granular
make yes-user-misc
make yes-user-cg-cmm

安装GPU和USER-CUDA package
make yes-gpu
make yes-user-cuda

编译lammps

使用/src/MAKE/Makefile.openmpi作为模版
cp Makefile.openmpi Makefile.gpu
vi Makefile.gpu

MPI_INC = -I/opt/openmpi/include
MPI_PATH =
MPI_LIB = -L/opt/openmpi/lib -lmpi

FFT_INC = -I/opt/fftw2/include -DFFT_FFTW
FFT_PATH =
FFT_LIB = -L/opt/fftw2/lib -lfftw

然后回到lammps/src

make gpu

编译完成并行的可执行文件lmp_gpu

测试(分别使用CPU及GPU，CUDA模块)
cd lammps/bench/GPU

Nvidia Kelper K10
4194304 atoms
CPU
time mpirun -np 12 ../../src/lmp_gpu -c off -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
1 by 3 by 4 MPI processor grid
Created 4194304 atoms
Setting up run ...
Memory usage per processor = 115.99 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133686 -5.0196696
1000 0.70371346 -5.6760464 0 -4.6204765 0.70456724
Loop time of 445.893 on 12 procs for 1000 steps with 4194304 atoms

Pair time (%) = 344.521 (77.2653)
Neigh time (%) = 37.3499 (8.37643)
Comm time (%) = 34.5695 (7.75287)
Outpt time (%) = 0.00629385 (0.00141152)
Other time (%) = 29.4467 (6.60397)

Nlocal:    349525 ave 349810 max 349270 min
Histogram: 3 0 0 3 0 1 2 0 2 1
Nghost:    88501 ave 88753 max 88106 min
Histogram: 1 0 0 1 0 1 6 2 0 1
Neighs:    1.31018e+07 ave 1.313e+07 max 1.30777e+07 min
Histogram: 1 0 4 1 0 2 2 1 0 1

Total # of neighbors = 157221517
Ave neighs/atom = 37.4845
Neighbor list builds = 50
Dangerous builds = 0

real    7m28.357s
user    88m58.623s
sys     0m6.306s

GPU
time mpirun -np 2 ../../src/lmp_gpu -sf gpu -c off -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.gpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
1 by 1 by 2 MPI processor grid
Created 4194304 atoms

--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
- with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla K10.G1.8GB, 1536 cores, 3.4/3.5 GB, 0.74 GHZ (Single Precision)
GPU 1: Tesla K10.G1.8GB, 1536 cores, 3.4/0.74 GHZ (Single Precision)
--------------------------------------------------------------------------

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.

Setting up run ...
Memory usage per processor = 336.665 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733679 0 -4.6133684 -5.01967
1000 0.70407139 -5.6765788 0 -4.620472 0.70226909
Loop time of 163.778 on 2 procs for 1000 steps with 4194304 atoms

Pair time (%) = 102.784 (62.7581)
Neigh time (%) = 5.78165e-05 (3.53018e-05)
Comm time (%) = 10.0776 (6.15322)
Outpt time (%) = 0.0124401 (0.0075957)
Other time (%) = 50.9039 (31.081)

Nlocal:    2.09715e+06 ave 2.09736e+06 max 2.09695e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    285880 ave 286182 max 285579 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds = 0

---------------------------------------------------------------------
      GPU Time Info (average):
---------------------------------------------------------------------
Data Transfer:   11.9795 s.
Data Cast/Pack: 29.8788 s.
Neighbor copy:   0.0003 s.
Neighbor build: 27.8765 s.
Force calc:      33.9176 s.
GPU Overhead:    0.0555 s.
Average split:   1.0000.
Threads / atom: 4.
Max Mem / Proc: 2850.45 MB.
CPU Driver_Time: 0.0564 s.
CPU Idle_Time:   45.4944 s.
---------------------------------------------------------------------

real    3m9.960s
user    5m33.879s
sys     0m21.650s

CUDA
time mpirun -np 2 ../../src/lmp_gpu -sf cuda -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cuda
LAMMPS (16 Nov 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
# Using device 0: Tesla K10.G1.8GB
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
# Using device 1: Tesla K10.G1.8GB
1 by 1 by 2 MPI processor grid
Created 4194304 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 2100000 atoms...
# CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA

# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA
7.604725 1.637228
# CUDA: Total Device Memory useage post setup: 1363.265625 MB
Memory usage per processor = 329.441 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133686 -5.0196696
1000 0.7037135 -5.6760465 0 -4.6204766 0.70456647
Loop time of 171.094 on 2 procs for 1000 steps with 4194304 atoms

Pair time (%) = 119.582 (69.8926)
Neigh time (%) = 34.4807 (20.153)
Comm time (%) = 12.0482 (7.04183)
Outpt time (%) = 0.00174761 (0.00102143)
Other time (%) = 4.98143 (2.91151)

Nlocal:    2.09715e+06 ave 2.09761e+06 max 2.09669e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    285910 ave 286389 max 285431 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs: 1.57222e+08 ave 1.57269e+08 max 1.57174e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 1

Total # of neighbors = 314443080
Ave neighs/atom = 74.9691
Neighbor list builds = 50
Dangerous builds = 0
# CUDA: Free memory...

real    3m31.330s
user    6m17.069s
sys     0m21.483s

Nvidai Tesla C2050
2097152 atoms
CPU
time mpirun -np 12 ../../src/lmp_g++ -c off -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
2 by 2 by 3 MPI processor grid
Created 2097152 atoms
Setting up run ...
Memory usage per processor = 59.9782 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133691 -5.0196698
1000 0.70398846 -5.6764793 0 -4.6204971 0.7035921
Loop time of 255.275 on 12 procs for 1000 steps with 2097152 atoms

Pair time (%) = 189.553 (74.2546)
Neigh time (%) = 19.7922 (7.75329)
Comm time (%) = 31.5617 (12.3638)
Outpt time (%) = 0.00327303 (0.00128216)
Other time (%) = 14.3645 (5.62708)

Nlocal:    174763 ave 175050 max 174540 min
Histogram: 1 2 0 2 3 1 2 0 0 1
Nghost:    55156.6 ave 55337 max 55013 min
Histogram: 2 0 3 1 2 0 1 1 1 1
Neighs:    6.55081e+06 ave 6.56937e+06 max 6.53648e+06 min
Histogram: 2 0 0 2 4 2 1 0 0 1

Total # of neighbors = 78609680
Ave neighs/atom = 37.484
Neighbor list builds = 50
Dangerous builds = 0

real    4m16.362s
user    0m0.067s
sys     0m0.018s

GPU
time mpirun -np 2 ../../src/lmp_g++ -sf gpu -c off -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.gpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
1 by 1 by 2 MPI processor grid
Created 2097152 atoms

--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
- with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla C2050, 448 cores, 2.6/2.6 GB, 1.1 GHZ (Single Precision)
GPU 1: Tesla C2050, 448 cores, 2.6/1.1 GHZ (Single Precision)
--------------------------------------------------------------------------

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.

Setting up run ...
Memory usage per processor = 173.566 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733679 0 -4.6133689 -5.0196703
1000 0.70365628 -5.6759221 0 -4.6204382 0.70516901
Loop time of 82.1602 on 2 procs for 1000 steps with 2097152 atoms

Pair time (%) = 49.8815 (60.7125)
Neigh time (%) = 6.53267e-05 (7.95114e-05)
Comm time (%) = 5.40412 (6.57754)
Outpt time (%) = 0.00573647 (0.00698206)
Other time (%) = 26.8688 (32.7029)

Nlocal:    1.04858e+06 ave 1.04859e+06 max 1.04856e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    173222 ave 173223 max 173220 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds = 0

---------------------------------------------------------------------
      GPU Time Info (average):
---------------------------------------------------------------------
Data Transfer:   5.8268 s.
Data Cast/Pack: 15.0256 s.
Neighbor copy:   0.0002 s.
Neighbor build: 13.8191 s.
Force calc:      15.7533 s.
GPU Overhead:    0.0495 s.
Average split:   1.0000.
Threads / atom: 4.
Max Mem / Proc: 1426.04 MB.
CPU Driver_Time: 0.0497 s.
CPU Idle_Time:   21.3674 s.
---------------------------------------------------------------------

real    1m29.050s
user    0m0.065s
sys     0m0.028s

CUDA
time mpirun -np 2 ../../src/lmp_g++ -sf cuda -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cuda
LAMMPS (16 Nov 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
# Using device 0: Tesla C2050
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
# Using device 1: Tesla C2050
1 by 1 by 2 MPI processor grid
Created 2097152 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 1050000 atoms...
# CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA

# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA
2.088803 0.418611
# CUDA: Total Device Memory useage post setup: 726.984375 MB
Memory usage per processor = 169.36 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133691 -5.0196698
1000 0.70398844 -5.6764793 0 -4.6204971 0.70359222
Loop time of 49.6546 on 2 procs for 1000 steps with 2097152 atoms

Pair time (%) = 31.7106 (63.8622)
Neigh time (%) = 9.56514 (19.2634)
Comm time (%) = 5.88421 (11.8503)
Outpt time (%) = 0.00104213 (0.00209875)
Other time (%) = 2.49368 (5.02204)

Nlocal:    1.04858e+06 ave 1.04861e+06 max 1.04854e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    173368 ave 173410 max 173325 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs: 7.86097e+07 ave 7.86114e+07 max 7.8608e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 1

Total # of neighbors = 157219330
Ave neighs/atom = 74.968
Neighbor list builds = 50
Dangerous builds = 0
# CUDA: Free memory...

real    0m59.271s
user    0m0.071s
sys     0m0.023s

tengh

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
基于CUDA的并行lammps编译及测试

运行环境：Centos 5.8 Final Cuda 5驱动及Toolkit Lammps-16Nov12fftw-2.1.5openmpi-1.4.5　硬件环境：Intel XEON E5 2640 128GB DDR3 1600 ECC WD 1TB HDD
复制链接

扫一扫