1. 输入文件
输入文件是从官网上找的。链接地址如下
https://raw.githubusercontent.com/wiki/nwchemgit/nwchem/pentacene_ccsdt.nw
测试分子为pentacene分子。执行的计算是CCSD(T)的计算,计算的瓶颈实际在(T)微扰计算的部分,非常适合于采用异构的设备来加速,例如GPU或者MIC的设备。
但是,该输入文件设置的基组十分大,对于内存要求很高,可以适当设置为较小的基组,这里尝试了6-31g和sto-3g基组。sto-3g基组的运算量太小,当cpu核数或gpu卡数变大时,运行时间减少有限,因此测试都是使用的6-31g。
输入文件有一些特点:
- 可以多个文件写到同一个文件中,它会顺序执行
2.运行
普通情况下是
nwchem test.nw
nw是默认的扩展名。但是在使用mpi的情况下要用
mpirun -np 2 nwchem test.nw
运行脚本 run.sh
#!/bin/bash
export OMP_NUM_THREADS=1
nohup mpirun -np $1 nwchem simple_cuda.nw |tee $2 &
- nohup 退出终端继续运行
- & 在后台运行。不输入这个只用nohup,程序输出信息会一直在当前终端打印出来
- $1 脚本的第一个参数; $2 脚本的第二个参数;例如
./run.sh 12 89
,$1是12.$2是89。 - tee 将管道传来的信息同时打印到屏幕上和文件中
- OMP_NUM_THREADS 使用gcc9时候(gcc其他版本可能也可以),设置omp的线程数.因为现在openMP函数库都是随着编译器发布的,主流的编译器比如gcc(GNU) ,icc(INTEL)都包含openMP函数库,不用再额外单独安装.
注意
- 使用OMP线程数过多会让并行效率变低.
3.遇到的问题
-
GA 内存不足
available GA memory 2199852640 bytes ------------------------------------------------------------------------ createfile: failed ga_create size/nproc bytes 298911041856
这里有一个帖子添加链接描述说了这方面的事情,需要预设内存大小,具体使用如下命令
memory stack 2000 mb heap 1000 mb global 16000 mb verify
这里 参考了下面的邮件,询问为什么那么设置内存
参考邮件
GA论坛
。。。。还未解决设置了大内存之后,出现了超出物理地址范围的错误
Superposition of Atomic Density Guess
-------------------------------------
Sum of atomic energies: -835.64512040
[cu06:97841:0:97841] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 97841) ====
0 /home/jrf/tools/ucx/lib/libucs.so.0(ucs_handle_error+0x19c) [0x7fc9f0581f5c]
1 /home/jrf/tools/ucx/lib/libucs.so.0(+0x2628c) [0x7fc9f058228c]
2 /home/jrf/tools/ucx/lib/libucs.so.0(+0x26453) [0x7fc9f0582453]
3 /lib64/libpthread.so.0(+0xf5d0) [0x7fc9fa87d5d0]
4 /lib64/libc.so.6(+0x8efe0) [0x7fc9f9ab9fe0]
5 nwchem() [0x2aabf87]
6 nwchem() [0x2a77006]
7 nwchem() [0x8c42d2]
8 nwchem() [0x8c5243]
9 nwchem() [0x8c59e9]
10 nwchem() [0x8bc75c]
11 nwchem() [0x8d3fce]
12 nwchem() [0x8bfb51]
13 nwchem() [0x163dccd]
14 nwchem() [0x163ae8e]
15 nwchem() [0x41e869]
16 nwchem() [0x41fddb]
17 nwchem() [0x412574]
18 nwchem() [0x40ae76]
19 nwchem() [0x40b3c1]
20 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fc9f9a4d3d5]
21 nwchem() [0x4092eb]
=================================
Program received signal SIGBUS: Access to an undefined portion of a memory object.
Backtrace for this error:
#0 0x7FC9FA565697
#1 0x7FC9FA565CDE
#2 0x7FC9FA87D5CF
#3 0x7FC9F9AB9FE0
#4 0x2AABF86 in wnga_zero
#5 0x2A77005 in ga_zero_
#6 0x8C42D1 in int_1e_oldga0_ at int_1e_ga.F:690
#7 0x8C5242 in int_1e_oldga_ at int_1e_ga.F:554
#8 0x8C59E8 in int_1e_ga_ at int_1e_ga.F:154
#9 0x8BC75B in rhf_dens_to_mo_ at rhf_dens_mo.F:61
#10 0x8D3FCD in scf_vectors_guess_ at scf_vec_guess.F:199
#11 0x8BFB50 in scf_ at scf.F:263
#12 0x163DCCC in tce_energy_ at tce_energy.F:569
#13 0x163AE8D in tce_energy_fragment_ at tce_energy_fragment.F:88
#14 0x41E868 in task_energy_doit_ at task_energy.F:299
#15 0x41FDDA in task_energy_ at task_energy.F:122
#16 0x412573 in task_ at task.F:371
#17 0x40AE75 in MAIN__ at nwchem.F:313
这里有一篇关于nwchem 内存参数含义以及如何设置的帖子
http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id648/Is_there_a_systematic_way_of_fin…html
官方讲解memory各个参数设置的例子。这里有–》直通车
顺便写一下各个参数的含义
1
m
b
=
1024
k
b
=
102
4
2
b
y
t
e
=
1
8
×
102
4
2
d
o
u
b
l
e
s
1 mb = 1024 kb= 1024^{2} byte =\frac{1}{8} \times 1024^{2} doubles
1mb=1024kb=10242byte=81×10242doubles
根据错误提示,发现如果用原脚本的基组,至少需要100多g的内存,所以我换成了sto-3g基组,顺利运行成功。
-
找不到openblas库
(base) [jrf@cu06 test]$ mpirun -np 6 nwchem simple.nw nwchem: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory nwchem: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory nwchem: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory nwchem: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory nwchem: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory nwchem: error while loading shared libraries: libopenblas.so.0: cannot open shared object file: No such file or directory
在.bashrc中设置好openblas库的路径,放在
LD_LIBRARY_PATH
中export LD_LIBRARY_PATH=/home/jrf/tools/OpenBLAS/lib/
然后使其生效
source ~/.bashrc
-
mpirun exit with signal 9
出错信息如下,说明内存耗尽,可以分配多一点节点再试试。
"mpirun noticed that process rank 16 with PID 1524 on node cvb-10 exited on signal 9 (killed).
4.部分信息
- 每次运行失败都会在
/dev/shm
下存放很多很大的中间文件,占内存,需要手动删除才可以。这个目录是从内存分出的一片空间,用起来速度会快.