运行环境:Centos 5.8 Final
Cuda 5驱动及Toolkit
Lammps-16Nov12
fftw-2.1.5
openmpi-1.4.5
硬件环境:Intel XEON E5 2640
128GB DDR3 1600 ECC
WD 1TB HDD
Nvidia Tesla C2050
Nvidia Kelper K10
LAMMPS即Large-scale Atomic/Molecular Massively Parallel Simulator,可以翻译为大规模原子分子并行模拟器,主要用于分子动力学相关的一些计算和模拟工作,一般来讲,分子动力学所涉及到的领域,LAMMPS代码也都涉及到了。LAMMPS由美国Sandia国家实验室开发,以GPL license发布,即开放源代码且可以免费获取使用,这意味着使用者可以根据自己的需要自行修改源代码。LAMMPS可以支持包括气态,液态或者固态相形态下、各种系综下、百万级的原子分子体系,并提供支持多种势函数。且LAMMPS有良好的并行扩展性。
编译
Lammps的并行需要能无密码ssh访问本机,首先配置ssh
ssh-keygen -t rsa
不断回车后得到.ssh/id_rsa和 .ssh/id_rsa.pub
cd ~/.ssh
cp id_rsa.pub authorized_keys
现在已经能无密码访问本机了。
安装FFTW2
tar zxvf fftw-2.1.5.tar.gz
cd fftw-2.1.5
./configure --prefix=/opt/fftw2 --enable-float --enable-shared
make
make install
安装及配置OPENMPI
tar –zxvf openmpi-1.4.5.tar.gz
cd openmpi-1.4.5
./configure --prefix=/opt/opnmpi
make
make install
设置环境变量
gedit ~/.bashrc
PATH=/opt/cuda5/bin:/opt/openmpi/bin:$PATH
LD_LIBRARY_PATH=/opt/cuda5/lib64:/opt/openmpi/lib:/opt/fftw2/lib:$LD_LIBRARY_PATH
最后source ~/.bashrc
测试openmpi是否安装成功
which mpicc
which mpiexec
which mpirun
配置lammps
http://lammps.sandia.gov/tars/lammps.tar.gz
tar xvf lammps.tar.gz
首先编译gpu package
cd lammps/lib/gpu
修改Makefile.linux
CUDA_HOME = /opt/cuda5
# Kelper CUDA
CUDA_ARCH = -arch=sm_30
(将其他CUDA_ARCH注释掉)
最后make -f Makefile.linux
生成nvc_get_devices,可以运行一下,看看GPU的信息
修改Makefile.lammps
gpu_SYSINC = -I/opt/cuda5/include
gpu_SYSLIB = -lcudart -lcuda
gpu_SYSPATH = -L/opt/cuda5/lib64
然后编译自定义包,我们需要用到user-cuda
cd ../cuda
修改Makefile.common
CUDA_INSTALL_PATH = /opt/cuda5
然后make:
make CUDA_INSTALL_PATH=/opt/cuda5 cufft=2 precision=2 arch=30
最后会生成liblammpscuda.a
然后安装所需要的包:
make yes-asphere
make yes-class2
make yes-colloid
make yes-dipole
make yes-granular
make yes-user-misc
make yes-user-cg-cmm
安装GPU和USER-CUDA package
make yes-gpu
make yes-user-cuda
编译lammps
使用/src/MAKE/Makefile.openmpi作为模版
cp Makefile.openmpi Makefile.gpu
vi Makefile.gpu
MPI_INC = -I/opt/openmpi/include
MPI_PATH =
MPI_LIB = -L/opt/openmpi/lib -lmpi
FFT_INC = -I/opt/fftw2/include -DFFT_FFTW
FFT_PATH =
FFT_LIB = -L/opt/fftw2/lib -lfftw
然后回到lammps/src
make gpu
编译完成并行的可执行文件lmp_gpu
测试(分别使用CPU及GPU,CUDA模块)
cd lammps/bench/GPU
Nvidia Kelper K10
4194304 atoms
CPU
time mpirun -np 12 ../../src/lmp_gpu -c off -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
1 by 3 by 4 MPI processor grid
Created 4194304 atoms
Setting up run ...
Memory usage per processor = 115.99 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133686 -5.0196696
1000 0.70371346 -5.6760464 0 -4.6204765 0.70456724
Loop time of 445.893 on 12 procs for 1000 steps with 4194304 atoms
Pair time (%) = 344.521 (77.2653)
Neigh time (%) = 37.3499 (8.37643)
Comm time (%) = 34.5695 (7.75287)
Outpt time (%) = 0.00629385 (0.00141152)
Other time (%) = 29.4467 (6.60397)
Nlocal: 349525 ave 349810 max 349270 min
Histogram: 3 0 0 3 0 1 2 0 2 1
Nghost: 88501 ave 88753 max 88106 min
Histogram: 1 0 0 1 0 1 6 2 0 1
Neighs: 1.31018e+07 ave 1.313e+07 max 1.30777e+07 min
Histogram: 1 0 4 1 0 2 2 1 0 1
Total # of neighbors = 157221517
Ave neighs/atom = 37.4845
Neighbor list builds = 50
Dangerous builds = 0
real 7m28.357s
user 88m58.623s
sys 0m6.306s
GPU
time mpirun -np 2 ../../src/lmp_gpu -sf gpu -c off -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.gpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
1 by 1 by 2 MPI processor grid
Created 4194304 atoms
--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
- with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla K10.G1.8GB, 1536 cores, 3.4/3.5 GB, 0.74 GHZ (Single Precision)
GPU 1: Tesla K10.G1.8GB, 1536 cores, 3.4/0.74 GHZ (Single Precision)
--------------------------------------------------------------------------
Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.
Setting up run ...
Memory usage per processor = 336.665 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733679 0 -4.6133684 -5.01967
1000 0.70407139 -5.6765788 0 -4.620472 0.70226909
Loop time of 163.778 on 2 procs for 1000 steps with 4194304 atoms
Pair time (%) = 102.784 (62.7581)
Neigh time (%) = 5.78165e-05 (3.53018e-05)
Comm time (%) = 10.0776 (6.15322)
Outpt time (%) = 0.0124401 (0.0075957)
Other time (%) = 50.9039 (31.081)
Nlocal: 2.09715e+06 ave 2.09736e+06 max 2.09695e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 285880 ave 286182 max 285579 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds = 0
---------------------------------------------------------------------
GPU Time Info (average):
---------------------------------------------------------------------
Data Transfer: 11.9795 s.
Data Cast/Pack: 29.8788 s.
Neighbor copy: 0.0003 s.
Neighbor build: 27.8765 s.
Force calc: 33.9176 s.
GPU Overhead: 0.0555 s.
Average split: 1.0000.
Threads / atom: 4.
Max Mem / Proc: 2850.45 MB.
CPU Driver_Time: 0.0564 s.
CPU Idle_Time: 45.4944 s.
---------------------------------------------------------------------
real 3m9.960s
user 5m33.879s
sys 0m21.650s
CUDA
time mpirun -np 2 ../../src/lmp_gpu -sf cuda -v g 2 -v x 64 -v y 128 -v z 128 -v t 1000 < in.lj.cuda
LAMMPS (16 Nov 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
# Using device 0: Tesla K10.G1.8GB
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 214.988 214.988)
# Using device 1: Tesla K10.G1.8GB
1 by 1 by 2 MPI processor grid
Created 4194304 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 2100000 atoms...
# CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA
# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA
7.604725 1.637228
# CUDA: Total Device Memory useage post setup: 1363.265625 MB
Memory usage per processor = 329.441 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133686 -5.0196696
1000 0.7037135 -5.6760465 0 -4.6204766 0.70456647
Loop time of 171.094 on 2 procs for 1000 steps with 4194304 atoms
Pair time (%) = 119.582 (69.8926)
Neigh time (%) = 34.4807 (20.153)
Comm time (%) = 12.0482 (7.04183)
Outpt time (%) = 0.00174761 (0.00102143)
Other time (%) = 4.98143 (2.91151)
Nlocal: 2.09715e+06 ave 2.09761e+06 max 2.09669e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 285910 ave 286389 max 285431 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs: 1.57222e+08 ave 1.57269e+08 max 1.57174e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Total # of neighbors = 314443080
Ave neighs/atom = 74.9691
Neighbor list builds = 50
Dangerous builds = 0
# CUDA: Free memory...
real 3m31.330s
user 6m17.069s
sys 0m21.483s
Nvidai Tesla C2050
2097152 atoms
CPU
time mpirun -np 12 ../../src/lmp_g++ -c off -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
2 by 2 by 3 MPI processor grid
Created 2097152 atoms
Setting up run ...
Memory usage per processor = 59.9782 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133691 -5.0196698
1000 0.70398846 -5.6764793 0 -4.6204971 0.7035921
Loop time of 255.275 on 12 procs for 1000 steps with 2097152 atoms
Pair time (%) = 189.553 (74.2546)
Neigh time (%) = 19.7922 (7.75329)
Comm time (%) = 31.5617 (12.3638)
Outpt time (%) = 0.00327303 (0.00128216)
Other time (%) = 14.3645 (5.62708)
Nlocal: 174763 ave 175050 max 174540 min
Histogram: 1 2 0 2 3 1 2 0 0 1
Nghost: 55156.6 ave 55337 max 55013 min
Histogram: 2 0 3 1 2 0 1 1 1 1
Neighs: 6.55081e+06 ave 6.56937e+06 max 6.53648e+06 min
Histogram: 2 0 0 2 4 2 1 0 0 1
Total # of neighbors = 78609680
Ave neighs/atom = 37.484
Neighbor list builds = 50
Dangerous builds = 0
real 4m16.362s
user 0m0.067s
sys 0m0.018s
GPU
time mpirun -np 2 ../../src/lmp_g++ -sf gpu -c off -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.gpu
LAMMPS (16 Nov 2012)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
1 by 1 by 2 MPI processor grid
Created 2097152 atoms
--------------------------------------------------------------------------
- Using GPGPU acceleration for lj/cut:
- with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla C2050, 448 cores, 2.6/2.6 GB, 1.1 GHZ (Single Precision)
GPU 1: Tesla C2050, 448 cores, 2.6/1.1 GHZ (Single Precision)
--------------------------------------------------------------------------
Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-1 on core 0...Done.
Setting up run ...
Memory usage per processor = 173.566 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733679 0 -4.6133689 -5.0196703
1000 0.70365628 -5.6759221 0 -4.6204382 0.70516901
Loop time of 82.1602 on 2 procs for 1000 steps with 2097152 atoms
Pair time (%) = 49.8815 (60.7125)
Neigh time (%) = 6.53267e-05 (7.95114e-05)
Comm time (%) = 5.40412 (6.57754)
Outpt time (%) = 0.00573647 (0.00698206)
Other time (%) = 26.8688 (32.7029)
Nlocal: 1.04858e+06 ave 1.04859e+06 max 1.04856e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 173222 ave 173223 max 173220 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 50
Dangerous builds = 0
---------------------------------------------------------------------
GPU Time Info (average):
---------------------------------------------------------------------
Data Transfer: 5.8268 s.
Data Cast/Pack: 15.0256 s.
Neighbor copy: 0.0002 s.
Neighbor build: 13.8191 s.
Force calc: 15.7533 s.
GPU Overhead: 0.0495 s.
Average split: 1.0000.
Threads / atom: 4.
Max Mem / Proc: 1426.04 MB.
CPU Driver_Time: 0.0497 s.
CPU Idle_Time: 21.3674 s.
---------------------------------------------------------------------
real 1m29.050s
user 0m0.065s
sys 0m0.028s
CUDA
time mpirun -np 2 ../../src/lmp_g++ -sf cuda -v g 2 -v x 64 -v y 64 -v z 128 -v t 1000 < in.lj.cuda
LAMMPS (16 Nov 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
# Using device 0: Tesla C2050
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (107.494 107.494 214.988)
# Using device 1: Tesla C2050
1 by 1 by 2 MPI processor grid
Created 2097152 atoms
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of 1050000 atoms...
# CUDA: Using precision: Global: 8 X: 8 V: 8 F: 8 PPPM: 8
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA
# CUDA: Timing of parallelisation layout with 10 loops:
# CUDA: BpA TpA
2.088803 0.418611
# CUDA: Total Device Memory useage post setup: 726.984375 MB
Memory usage per processor = 169.36 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133691 -5.0196698
1000 0.70398844 -5.6764793 0 -4.6204971 0.70359222
Loop time of 49.6546 on 2 procs for 1000 steps with 2097152 atoms
Pair time (%) = 31.7106 (63.8622)
Neigh time (%) = 9.56514 (19.2634)
Comm time (%) = 5.88421 (11.8503)
Outpt time (%) = 0.00104213 (0.00209875)
Other time (%) = 2.49368 (5.02204)
Nlocal: 1.04858e+06 ave 1.04861e+06 max 1.04854e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Nghost: 173368 ave 173410 max 173325 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 0 ave 0 max 0 min
Histogram: 2 0 0 0 0 0 0 0 0 0
FullNghs: 7.86097e+07 ave 7.86114e+07 max 7.8608e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Total # of neighbors = 157219330
Ave neighs/atom = 74.968
Neighbor list builds = 50
Dangerous builds = 0
# CUDA: Free memory...
real 0m59.271s
user 0m0.071s
sys 0m0.023s