I did series of Coupled Cluster testing calculations of FH2 using two nodes connected with infiniband, with 12 cores and 48GB memory per node. For rather small basis and tasks, such as UCCSDT/avdz and UCCSD/vtz, everything is OK, but for tasks requiring more than 500MB memory occurs the problem.
The input file is :
start fh2
scratch_dir ./tmp
memory heap 300 mb stack 300 mb global 3000 mb
geometry units au
H -0.466571969 0.000000000 -3.498280516
H 0.624505061 0.000000000 -2.532671944
F -0.008378972 0.000000000 0.319965748
end
basis noprint
* library cc-pvdz # or aug-cc-pvdz or others
end
SCF
semidirect
DOUBLET
UHF
THRESH 1.0e-10
TOL2E 1.0e-10
END
TCE
SCF
CCSD # or CCSDT or CCSDTQ
END
TASK TCE ENERGY
When I do UCCSD/avtz calculations, the Hartree Fock part is OK, but terminated at CC as below:
Memory Information
------------------
Available GA space size is 9437161950 doubles
Available MA space size is 78639421 doubles
Maximum block size 36 doubles
tile_dim = 35
Block Spin Irrep Size Offset Alpha
-------------------------------------------------
1 alpha a' 5 doubles 0 1
2 alpha a" 1 doubles 5 2
3 beta a' 4 doubles 6 3
4 beta a" 1 doubles 10 4
5 alpha a' 34 doubles 11 5
6 alpha a' 34 doubles 45 6
7 alpha a" 31 doubles 79 7
8 beta a' 34 doubles 110 8
9 beta a' 35 doubles 144 9
10 beta a" 31 doubles 179 10
Global array virtual files algorithm will be used
Parallel file system coherency ......... OK
Integral file = ./tmp/fh2.aoints.00
Record size in doubles = 65536 No. of integs per rec = 43688
Max. records in memory = 15 Max. records in file = 2287
No. of bits per label = 8 No. of bits per value = 64
#quartets = 1.396D+05 #integrals = 8.008D+06 #direct = 0.0% #cached =100.0%
File balance: exchanges= 12 moved= 15 time= 0.0
Fock matrix recomputed
1-e file size = 12706
1-e file name = ./tmp/fh2.f1
Cpu & wall time / sec 0.2 1.1
tce_ao2e: fast2e=1
half-transformed integrals in memory
2-e (intermediate) file size = 279803475
2-e (intermediate) file name = ./tmp/fh2.v2i
Cpu & wall time / sec 1.8 2.3
tce_mo2e: fast2e=1
2-e integrals stored in memory
2-e file size = 119972997
2-e file name = ./tmp/fh2.v2
Cpu & wall time / sec 10.0 10.5
do_pt = F
do_lam_pt = F
do_cr_pt = F
do_lcr_pt = F
do_2t_pt = F
T1-number-of-tasks 6
t1 file size = 678
t1 file name = ./tmp/fh2.t1
t1 file handle = -998
T2-number-of-boxes 38
t2 file size = 368230
t2 file name = ./tmp/fh2.t2
t2 file handle = -995
CCSD iterations
-----------------------------------------------------------------
Iter Residuum Correlation Cpu Wall V2*C2
-----------------------------------------------------------------
0: error ival=4
(rank:0 hostname:compute-10-15.local pid:19142):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
12: error ival=4
(rank:12 hostname:compute-10-1.local pid:9867):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
rank 0 in job 8 i10-15_52820 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
For UCCSDTQ/vdz calculations, it terminated at the third iteration of CC:
2-e file size = 386290
2-e file name = ./tmp/fh2.v2
Cpu & wall time / sec 0.4 0.4
do_pt = F
do_lam_pt = F
do_cr_pt = F
do_lcr_pt = F
do_2t_pt = F
T1-number-of-tasks 6
t1 file size = 140
t1 file name = ./tmp/fh2.t1
t1 file handle = -998
T2-number-of-boxes 38
t2 file size = 14660
t2 file name = ./tmp/fh2.t2
t2 file handle = -995
t3 file size = 1160539
t3 file name = ./tmp/fh2.t3
2: WARNING:armci_set_mem_offset: offset changed 794624 to 9244672
3: WARNING:armci_set_mem_offset: offset changed 0 to 8450048
6: WARNING:armci_set_mem_offset: offset changed 794624 to 8450048
8: WARNING:armci_set_mem_offset: offset changed 794624 to 8450048
13: WARNING:armci_set_mem_offset: offset changed 0 to -620834816
t4 file size = 78188214
t4 file name = ./tmp/fh2.t4
CCSDTQ iterations
--------------------------------------------------------
Iter Residuum Correlation Cpu Wall
--------------------------------------------------------
1 0.2682660632262 -0.1813353786615 86.8 89.1
2 0.0920127385001 -0.1943555090903 87.5 89.9
0: error ival=4
(rank:0 hostname:compute-10-15.local pid:19656):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
12: error ival=4
(rank:12 hostname:compute-10-1.local pid:10202):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
application called MPI_Abort(comm=0x84000003, 1) - process 0
rank 0 in job 13 i10-15_52820 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
If only one node was used everything is also OK. It seems that the actually available memory becomes limited if parallel with multiple nodes. I also checked the maximum shared memory, which is nearly 36GB:
Psd, Your calculations are likely to be crashing while creating shared memory segments. If you set the environmental variable ARMCI_DEFAULT_SHMMAX to a value of 2048 (or larger), you should be able to overcome this problem. Please keep in mind that ARMCI_DEFAULT_SHMMAX has to be greater or equal than the kernel parameter kernel.shmmax (Root can only change kernel.shmmax, therefore you might have to ask the system administrator to do it). For example, if the value of kernel.shmmax is 4294967296 as in the example below, ARMCI_DEFAULT_SHMMAX can be at most 4096 (4294967296=4096*1024*1024)
Psd,
Your calculations are likely to be crashing while creating shared memory segments.
If you set the environmental variable ARMCI_DEFAULT_SHMMAX to a value of 2048 (or larger),
you should be able to overcome this problem.
Please keep in mind that
ARMCI_DEFAULT_SHMMAX has to be greater or equal than the kernel parameter kernel.shmmax
(Root can only change kernel.shmmax, therefore you might have to ask the system
administrator to do it).
For example, if the value of kernel.shmmax is 4294967296 as in the example below,
ARMCI_DEFAULT_SHMMAX can be at most 4096 (4294967296=4096*1024*1024)
argument 1 = fh2.nw
(rank:12 hostname:compute-11-3.local pid:1523):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_register_region():1124 cond:(memhdl->memhndl!=((void *)0))
Last System Error Message from Task 12:: Cannot allocate memory
application called MPI_Abort(comm=0x84000003, 1) - process 12
(rank:0 hostname:compute-11-32.local pid:4764):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_register_region():1124 cond:(memhdl->memhndl!=((void *)0))
Last System Error Message from Task 0:: Cannot allocate memory
rank 12 in job 2 i11-32_41208 caused collective abort of all ranks
exit status of rank 12: killed by signal 9
[5:i11-32] unexpected disconnect completion event from [12:i11-3]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 5
There's 48G physical memory available on each node, and 12 processors are used on each node.
It is really strange that, for a medium calculation that could run with only one node, fails using two or more nodes. I don't think this break down is caused by the lack of memory, maybe some tools such as ga are not well installed, or maybe some system services are not available. It seems that the host machine can only use up to 1GB remote memory, and the ga do not make the memory sum up.
Thanks for your help. We have found the bottleneck and fixed it, and it works well now. This is caused by the default amount of memory infiniband can register, in default it is limited to 4G.
Besides, I performed some CC calculations, and I found that it is unsatisfactory about the paralleling efficiency. For the task as below, if I use 16 cores with 1 nodes, each iteration takes 759.6s of CPU time, but when I use 128 cores with 8 nodes, it increases to 1326s to complete one iteration. Is this normal?
Thanks!
Jun Chen 2012/11/15
the input file is:
start fh2
permanent_dir .
scratch_dir ./tmp
memory heap 500 mb stack 500 mb global 9000 mb
geometry units au
H -0.466571969 0.000000000 -3.498280516
H 0.624505061 0.000000000 -2.532671944
F -0.008378972 0.000000000 0.319965748
# symmetry c1
end
basis noprint
* library aug-cc-pvqz
end
SCF
semidirect
DOUBLET
RHF
THRESH 1.0d-8
TOL2E 1.0d-8
END
TCE
SCF
CCSDT
THRESH 1.0d-5
FREEZE atomic
DIIS 5
END
TASK TCE ENERGY
Hi, I am not surprised that your CCSDT/CCSDTQ jobs are not running (or perhaps not scaling properly). Please look at your tilesizes you are using. For unoccupied orbitals the max. tilesize is 35 which poses a huge demand on the local memory requirement and additionally provide really poor granularity.
For the CCSDT part the local memory demand is proportional to tilesize^6, so please set for the CCSDT the tilesize parameter equal to 15.
For the CCSDTQ part the local memory demand is proportional to tilesize^8, so be even more conservative with the tilesize in these runs. I guess tilesize 8 shoule be fine.
Please also modify the memory settigns. Something like this should work memory heap 100 mb stack 1200 mb global 2500 mb