基于CUDA的并行lammps编译及测试

60 篇文章 1 订阅
5 篇文章 0 订阅

运行环境:Centos 5.8 Final
            Cuda 5驱动及Toolkit
            Lammps-16Nov12
fftw-2.1.5
openmpi-1.4.5
 
硬件环境:Intel XEON E5 2640
           128GB DDR3 1600 ECC
           WD 1TB HDD
           Nvidia Tesla C2050
           Nvidia Kelper K10
LAMMPS即Large-scale Atomic/Molecular Massively Parallel Simulator,可以翻译为大规模原子分子并行模拟器,主要用于分子动力学相关的一些计算和模拟工作,一般来讲,分子动力学所涉及到的领域,LAMMPS代码也都涉及到了。LAMMPS由美国Sandia国家实验室开发,以GPL license发布,即开放源代码且可以免费获取使用,这意味着使用者可以根据自己的需要自行修改源代码。LAMMPS可以支持包括气态,液态或者固态相形态下、各种系综下、百万级的原子分子体系,并提供支持多种势函数。且LAMMPS有良好的并行扩展性。


编译
Lammps的并行需要能无密码ssh访问本机,首先配置ssh
ssh-keygen -t rsa
不断回车后得到.ssh/id_rsa和 .ssh/id_rsa.pub
cd ~/.ssh
cp id_rsa.pub authorized_keys
现在已经能无密码访问本机了。

安装FFTW2
tar zxvf fftw-2.1.5.tar.gz
cd fftw-2.1.5
./configure --prefix=/opt/fftw2 --enable-float --enable-shared
make
make install


安装及配置OPENMPI
tar –zxvf openmpi-1.4.5.tar.gz
cd openmpi-1.4.5
./configure --prefix=/opt/opnmpi
make
make install

设置环境变量
gedit ~/.bashrc
PATH=/opt/cuda5/bin:/opt/openmpi/bin:$PATH
LD_LIBRARY_PATH=/opt/cuda5/lib64:/opt/openmpi/lib:/opt/fftw2/lib:$LD_LIBRARY_PATH

最后source ~/.bashrc

测试openmpi是否安装成功

which mpicc
which mpiexec
which mpirun


配置lammps
http://lammps.sandia.gov/tars/lammps.tar.gz
tar xvf lammps.tar.gz

首先编译gpu package

cd lammps/lib/gpu

修改Makefile.linux

CUDA_HOME = /opt/cuda5
# Kelper CUDA
CUDA_ARCH = -arch=sm_30
(将其他CUDA_ARCH注释掉)
最后make -f Makefile.linux
生成nvc_get_devices,可以运行一下,看看GPU的信息

修改Makefile.lammps

gpu_SYSINC = -I/opt/cuda5/include
gpu_SYSLIB = -lcudart -lcuda
gpu_SYSPATH = -L/opt/cuda5/lib64

然后编译自定义包,我们需要用到user-cuda
cd ../cuda

修改Makefile.common
CUDA_INSTALL_PATH = /opt/cuda5

然后make:
make CUDA_INSTALL_PATH=/opt/cuda5 cufft=2 precision=2 arch=30
最后会生成liblammpscuda.a


然后安装所需要的包:
make yes-asphere
make yes-class2
make yes-colloid
make yes-dipole
make yes-granular
make yes-user-misc
make yes-user-cg-cmm

安装GPU和USER-CUDA package
make yes-gpu
make yes-user-cuda


编译lammps

使用/src/MAKE/Makefile.openmpi作为模版
cp Makefile.openmpi Makefile.gpu
vi Makefile.gpu

MPI_INC =  -I/opt/openmpi/include    
MPI_PATH =
MPI_LIB = -L/opt/openmpi/lib -lmpi

FFT_INC =       -I/opt/fftw2/include -DFFT_FFTW
FFT_PATH =
FFT_LIB = -L/opt/fftw2/lib -lfftw


然后回到lammps/src

make gpu

编译完成并行的可执行文件lmp_gpu

测试(分别使用CPU及GPU,CUDA模块)
cd lammps/bench/GPU

Nvidia Kelper K10
4194304 atoms
CPU
time mpirun -np 12 ../../src/lmp_gpu  -c off  -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
  1 by 3 by 4 MPI processor grid
Created 4194304 atoms
Setting up run ...
Memory usage per processor = 115.99 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733681            0   -4.6133686   -5.0196696
    1000   0.70371346   -5.6760464            0   -4.6204765   0.70456724
Loop time of 445.893 on 12 procs for 1000 steps with 4194304 atoms

Pair  time (%) = 344.521 (77.2653)
Neigh time (%) = 37.3499 (8.37643)
Comm  time (%) = 34.5695 (7.75287)
Outpt time (%) = 0.00629385 (0.00141152)
Other time (%) = 29.4467 (6.60397)

Nlocal:    349525 ave 349810 max 349270 min
Histogram: 3 0 0 3 0 1 2 0 2 1
Nghost:    88501 ave 88753 max 88106 min
Histogram: 1 0 0 1 0 1 6 2 0 1
Neighs:    1.31018e+07 ave 1.313e+07 max 1.30777e+07 min
Histogram: 1 0 4 1 0 2 2 1 0 1

Total # of neighbors = 157221517
Ave neighs/atom = 37.4845
Neighbor list builds = 50
Dangerous builds = 0

real    7m28.357s
user    88m58.623s
sys     0m6.306s 

GPU
time mpirun -np 2 ../../src/lmp_gpu -sf gpu -c off   -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.gpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
  1 by 1 by 2 MPI processor grid
Created 4194304 atoms

--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
-  with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla K10.G1.8GB, 1536 cores, 3.4/3.5 GB, 0.74 GHZ (Single Precision)
GPU 1: Tesla K10.G1.8GB, 1536 cores, 3.4/0.74 GHZ (Single Precision)
--------------------------------------------------------------------------

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.

Setting up run ...
Memory usage per processor = 336.665 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733679            0   -4.6133684     -5.01967
    1000   0.70407139   -5.6765788            0    -4.620472   0.70226909
Loop time of 163.778 on 2 procs for 1000 steps with 4194304 atoms

Pair  time (%) = 102.784 (62.7581)
Neigh time (%) = 5.78165e-05 (3.53018e-05)
Comm  time (%) = 10.0776 (6.15322)
Outpt time (%) = 0.0124401 (0.0075957)
Other time (%) = 50.9039 (31.081)

Nlocal:    2.09715e+06 ave 2.09736e+06 max 2.09695e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    285880 ave 286182 max 285579 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds = 0


---------------------------------------------------------------------
      GPU Time Info (average):
---------------------------------------------------------------------
Data Transfer:   11.9795 s.
Data Cast/Pack:  29.8788 s.
Neighbor copy:   0.0003 s.
Neighbor build:  27.8765 s.
Force calc:      33.9176 s.
GPU Overhead:    0.0555 s.
Average split:   1.0000.
Threads / atom:  4.
Max Mem / Proc:  2850.45 MB.
CPU Driver_Time: 0.0564 s.
CPU Idle_Time:   45.4944 s.
---------------------------------------------------------------------


real    3m9.960s
user    5m33.879s
sys     0m21.650s

CUDA
time mpirun -np 2 ../../src/lmp_gpu -sf cuda  -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cuda
LAMMPS (16 Nov 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
# Using device 0: Tesla K10.G1.8GB
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
# Using device 1: Tesla K10.G1.8GB
  1 by 1 by 2 MPI processor grid
Created 4194304 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 2100000 atoms...
# CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA

# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA
 7.604725 1.637228
# CUDA: Total Device Memory useage post setup: 1363.265625 MB
Memory usage per processor = 329.441 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733681            0   -4.6133686   -5.0196696
    1000    0.7037135   -5.6760465            0   -4.6204766   0.70456647
Loop time of 171.094 on 2 procs for 1000 steps with 4194304 atoms

Pair  time (%) = 119.582 (69.8926)
Neigh time (%) = 34.4807 (20.153)
Comm  time (%) = 12.0482 (7.04183)
Outpt time (%) = 0.00174761 (0.00102143)
Other time (%) = 4.98143 (2.91151)

Nlocal:    2.09715e+06 ave 2.09761e+06 max 2.09669e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    285910 ave 286389 max 285431 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs:  1.57222e+08 ave 1.57269e+08 max 1.57174e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 1

Total # of neighbors = 314443080
Ave neighs/atom = 74.9691
Neighbor list builds = 50
Dangerous builds = 0
# CUDA: Free memory...

real    3m31.330s
user    6m17.069s
sys     0m21.483s

 


Nvidai Tesla C2050
2097152 atoms
CPU
time mpirun -np 12 ../../src/lmp_g++ -c off  -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
  2 by 2 by 3 MPI processor grid
Created 2097152 atoms
Setting up run ...
Memory usage per processor = 59.9782 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733681            0   -4.6133691   -5.0196698
    1000   0.70398846   -5.6764793            0   -4.6204971    0.7035921
Loop time of 255.275 on 12 procs for 1000 steps with 2097152 atoms

Pair  time (%) = 189.553 (74.2546)
Neigh time (%) = 19.7922 (7.75329)
Comm  time (%) = 31.5617 (12.3638)
Outpt time (%) = 0.00327303 (0.00128216)
Other time (%) = 14.3645 (5.62708)

Nlocal:    174763 ave 175050 max 174540 min
Histogram: 1 2 0 2 3 1 2 0 0 1
Nghost:    55156.6 ave 55337 max 55013 min
Histogram: 2 0 3 1 2 0 1 1 1 1
Neighs:    6.55081e+06 ave 6.56937e+06 max 6.53648e+06 min
Histogram: 2 0 0 2 4 2 1 0 0 1

Total # of neighbors = 78609680
Ave neighs/atom = 37.484
Neighbor list builds = 50
Dangerous builds = 0

real    4m16.362s
user    0m0.067s
sys     0m0.018s

GPU
time mpirun -np 2 ../../src/lmp_g++ -sf gpu -c off  -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.gpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
  1 by 1 by 2 MPI processor grid
Created 2097152 atoms

--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
-  with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla C2050, 448 cores, 2.6/2.6 GB, 1.1 GHZ (Single Precision)
GPU 1: Tesla C2050, 448 cores, 2.6/1.1 GHZ (Single Precision)
--------------------------------------------------------------------------

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.

Setting up run ...
Memory usage per processor = 173.566 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733679            0   -4.6133689   -5.0196703
    1000   0.70365628   -5.6759221            0   -4.6204382   0.70516901
Loop time of 82.1602 on 2 procs for 1000 steps with 2097152 atoms

Pair  time (%) = 49.8815 (60.7125)
Neigh time (%) = 6.53267e-05 (7.95114e-05)
Comm  time (%) = 5.40412 (6.57754)
Outpt time (%) = 0.00573647 (0.00698206)
Other time (%) = 26.8688 (32.7029)

Nlocal:    1.04858e+06 ave 1.04859e+06 max 1.04856e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    173222 ave 173223 max 173220 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds = 0


---------------------------------------------------------------------
      GPU Time Info (average):
---------------------------------------------------------------------
Data Transfer:   5.8268 s.
Data Cast/Pack:  15.0256 s.
Neighbor copy:   0.0002 s.
Neighbor build:  13.8191 s.
Force calc:      15.7533 s.
GPU Overhead:    0.0495 s.
Average split:   1.0000.
Threads / atom:  4.
Max Mem / Proc:  1426.04 MB.
CPU Driver_Time: 0.0497 s.
CPU Idle_Time:   21.3674 s.
---------------------------------------------------------------------


real    1m29.050s
user    0m0.065s
sys     0m0.028s

CUDA
time mpirun -np 2 ../../src/lmp_g++ -sf cuda  -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cuda
LAMMPS (16 Nov 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
# Using device 0: Tesla C2050
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
# Using device 1: Tesla C2050
  1 by 1 by 2 MPI processor grid
Created 2097152 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 1050000 atoms...
# CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA

# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA
 2.088803 0.418611
# CUDA: Total Device Memory useage post setup: 726.984375 MB
Memory usage per processor = 169.36 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733681            0   -4.6133691   -5.0196698
    1000   0.70398844   -5.6764793            0   -4.6204971   0.70359222
Loop time of 49.6546 on 2 procs for 1000 steps with 2097152 atoms

Pair  time (%) = 31.7106 (63.8622)
Neigh time (%) = 9.56514 (19.2634)
Comm  time (%) = 5.88421 (11.8503)
Outpt time (%) = 0.00104213 (0.00209875)
Other time (%) = 2.49368 (5.02204)

Nlocal:    1.04858e+06 ave 1.04861e+06 max 1.04854e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost:    173368 ave 173410 max 173325 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs:    0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs:  7.86097e+07 ave 7.86114e+07 max 7.8608e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 1

Total # of neighbors = 157219330
Ave neighs/atom = 74.968
Neighbor list builds = 50
Dangerous builds = 0
# CUDA: Free memory...

real    0m59.271s
user    0m0.071s
sys     0m0.023s

 


 
 

 
  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值