转自:
http://hi.baidu.com/linzch/blog/item/7e7d750e18329ec07acbe14f.html
1. p1_xxxxx: p4_error: interrupt SIGSEGV: 11
这个错误可能是因为某个进程中出现了段错误引起的,自己编程中曾出现过的错误:
a.只在一个进程中给指针申请空间,而在其他进程没有申请,所以在广播的时候出错。
b.数组内存的越界使用。
网上有个人说的很好:
"There are 2 things to check.
** Run one of the test programs like pi3.f or cpi.c to see whether your cluster's OK.
** if it is, the fault is in your code. See if you're exceeding array bounds or accessing memory which you haven't allocated, There's a SIGSEGV error - that's a segmentation violation. That might explain stuff like
bm_list_21829: p4_error: interrupt SIGINT: 2
Once you have a seg. violation, all the 4 processors are sent a signal to interrupt the process (SIGINT). Signals are defined in /usr/include/sys/signal.h (at least on the SGIs; might be
different on other systems). "
2. p1_10401: p4_error: : 14
1 - MPI_BCAST : Message truncated
[1] Aborting program !
[1] Aborting program!
这个也是由于mpi_bcast的接收空间不够引起的,要在mpi_bcast之前分配足够大的空间,这样就不会truncated了
3. p4_error: alloc_p4_msg failed:
p0_6773: (7.828703) xx_shmalloc: returning NULL; requested 1048616 bytesp0_6773: (7.828762) p4_shmalloc returning NULL; request = 1048616 bytes 内存空间没分配足,可以通过设置环境变量P4_GLOBMEMSIZE (in bytes)来增大程序需要的内存空间
export P4_GLOBMEMSIZE=32000000 (for bash users) setenv P4_GLOBMEMSIZE 32000000 (for csh or tcsh users)
4.libcprts.so.5: cannot open shared object file: No such file or directory
/home/jbrandt/tests/test.exe: error while loading shared libraries:libcprts.so.5: cannot open shared object file: No such file or directoryp0_792: p4_error: Child process exited while making connection to remoteprocess on compute-0-0.local: 0/opt/mpich/intel/bin/mpirun: line 1: 792 Broken pipe /home/jbrandt/tests/test.exe - p4pg /home/jbrandt/tests/PI646 -p4wd /home/jbrandt/tes
没有用-static静态的连接,用-static重新编译就好了