遇到的问题
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(230).................: MPI_Sendrecv(sbuf=0x39837f0, scount=5776, MPI_BYTE, dest=4, stag=4, rbuf=0x3984e90, rcount=5776, MPI_BYTE, src=2, rtag=4, MPI_COMM_WORLD, status=0x715430) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 2 and tag 4 truncated; 55696 bytes received but buffer size is 5776
这是为啥啊,不知道怎么办。
搜索信息
下面搜的一些链接,但是暂时无法解决问题。
MSMPI Buffer size issue ( Message truncated ) (microsoft.com)
PMPI_Bcast: Message truncated, - Intel Community
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(230).................: MPI_Sendrecv(sbuf=0x2d1d880, scount=5776, MPI_BYTE, dest=4, stag=4, rbuf=0x2d1ef20, rcount=5776, MPI_BYTE, src=2, rtag=4, MPI_COMM_WORLD, status=0x715430) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 2 and tag 4 truncated; 55696 bytes received but buffer size is 5776
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(230): MPI_Sendrecv(sbuf=0x10484160, scount=55696, MPI_BYTE, dest=3, stag=4, rbuf=0x10491b00, rcount=55696, MPI_BYTE, src=1, rtag=4, MPI_COMM_WORLD, status=0x715430) failed
do_cts(629)......: Message truncated; 88736 bytes received but buffer size is 55696
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(230): MPI_Sendrecv(sbuf=0x361bcc0, scount=6400, MPI_BYTE, dest=MPI_PROC_NULL, stag=5, rbuf=0x36207f0, rcount=6400, MPI_BYTE, src=1, rtag=5, MPI_COMM_WORLD, status=0x715430) failed
do_cts(629)......: Message truncated; 88736 bytes received but buffer size is 6400
测试
好像找到一个很相似的问题:
做了里面推荐的Test-parallel,整个程序没问题,感觉可能是设置问题。
将parallel案例复制到了相应路径下,然后编译,编译成功
到我的计算案例下面,测试。
然后输出
同时在计算案例下面输出log文件。
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 2.3.x |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 2.3.x
Exec : Test-parallel -parallel
Date : Nov 18 2021
Time : 17:40:20
Host : "K228"
PID : 1250
Case : /ncsfs02/yyxie/parallel350
nProcs : 5
Slaves :
4
(
"K228.1251"
"K228.1252"
"K228.1253"
"K228.1254"
)
Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking
polling iterations : 0
sigFpe : Floating point exception trapping - not supported on this platform
fileModificationChecking : Monitoring run-time modified files using timeStampMaster
allowSystemOperations : Allowing user-supplied system call operations
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time
End
Finalising parallel run
初步解决方案:
根据我的案例的报错:
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(230).................: MPI_Sendrecv(sbuf=0x60d94d0, scount=4624, MPI_BYTE, dest=MPI_PROC_NULL, stag=4, rbuf=0x60da6f0, rcount=4624, MPI_BYTE, src=3, rtag=4, MPI_COMM_WORLD, status=0x715430) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 3 and tag 4 truncated; 36864 bytes received but buffer size is 4624
它的意思是3和4进程之间传递不了信息。根据我在网上冲浪那么久的直觉,我觉得我设2或者3个并行应该不会报错。然后我设了个2的,它居然跑起来了。我直接现场流泪。
而且根据前面对MPI的测试,发现系统配置应该是没有问题的。所以,我感觉是代码哪里逻辑存在问题,之后再熟悉一下代码,看看能不能改过来。
当然我还有一个想法,decomposeParDict的设置其实是有一些rules的,不是想这么切就这么切的,毕竟并行之间应该是要交换信息的。要考虑分布板那一块的不同。所以之后也会尝试不同的切割方法。
然后在cfdonline发布了一个问题,不知道有没有回我嘤嘤嘤
又找到了一个可能的原因:
MPI job won't work on multiple hosts - Bug Reports - Palabos Forum
在这里面说到了一个类似的问题,好像是代码里面有个bug,
The temporary variable with red color, which would be sent as the sizes of dynamic data to other process, can be rewriten by other program before it is send. Thus caused that the data size other process received was the modified wrong number.
然后解决方案是:
So if we replace the temporary variable above (red one) with a global variable or something lives longer than the sending time, it would fix this problem.
fortran - MPI_Cart_create error - Stack Overflow
又看到一个和MPI和MPICH的关系友观
又遇到问题:
并行计算收敛和串行差好多。暂时不知道咋整。
换一个简单的案例,分5块并行是能跑起来的。或许跟网格划分真的有关系。我在想是不是要在规则的面上切割。
解决了:
我在想,如果不是MPI_Sendrecv出问题,那真的是信息那一部分有问题。结果发现真的是,往前面读代码,分清它的错误来源。感觉是mapping出了错,接着就发现原来是在boundbox就已经出错了。
因为从checkMesh里面读的语句是流体网格的范围,但是其实应该要考虑颗粒网格的范围。因此编译到这一步的时候,要手动修改成颗粒网格范围,而且每个processor直接是要完全衔接那种,不能重合啥的。
其实就是分割进程的时候就出问题了。