srun: job 577909 queued and waiting for resources

在运行高性能计算(HPC)任务时,遇到MPI_Wait错误,导致进程间通信异常并触发MPI_ERR_TRUNCATE,这可能是由于消息截断问题。多个作业被取消,状态显示为CANCELLED,尝试重新启动作业但依然失败。通过`sacct`命令查看作业状态,显示作业已被强制终止,可能涉及到节点资源不可用或调度问题。正在努力解决这一问题以恢复HPL测试的顺利进行。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 Prog= 85.94%   N_left= 24960   Time= 10.25     Time_left= 1.68 iGF=  5648.53   GF=  6181.10    iGF_per= 1412.13        GF_per= 1545.27 
 Prog= 86.42%   N_left= 24672   Time= 10.31     Time_left= 1.62 iGF=  5904.47   GF=  6179.48    iGF_per= 1476.12        GF_per= 1544.87 
[g0151:33153] *** An error occurred in MPI_Wait
[g0151:33153] *** reported by process [3514040320,2]
[g0151:33153] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[g0151:33153] *** MPI_ERR_TRUNCATE: message truncated
[g0151:33153] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[g0151:33153] ***    and potentially your MPI job)
In: PMI_Abort(15, N/A)
[g0151:33154] *** An error occurred in MPI_Wait
[g0151:33154] *** reported by process [3514040320,3]
[g0151:33154] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[g0151:33154] *** MPI_ERR_TRUNCATE: message truncated
[g0151:33154] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[g0151:33154] ***    and potentially your MPI job)
In: PMI_Abort(15, N/A)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 577908.0 ON g0151 CANCELLED AT 2022-07-07T20:20:15 ***
 Prog= 86.89%   N_left= 24384   Time= 10.37     Time_left= 1.57 iGF=  5459.12   GF=  617srun: error: g0151: task 0: Killed
srun: error: Timed out waiting for job step to complete
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ srun -M priv -p priv_test --nodelist=g0151 -n 4 --gres=gpu:4 ./xhpl 4-47000.dat
srun: job 577909 queued and waiting for resources
^Csrun: Job allocation 577909 has been revoked
srun: Force Terminated job 577909
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED|grep -v CANCELLED
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
577604             xhpl  priv_test   hpl_test          3 CANCELLED+      0:0 
577604.0           xhpl              hpl_test          3 CANCELLED+      0:2 
577703.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577844.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577847.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577848             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577849             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577850             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577854             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577896             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577896.0           xhpl              hpl_test          4 CANCELLED+     0:11 
577904             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577905             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577906             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577909             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
[hpl_test@swarm02 project]$ scancel -M priv 577604,577848,577849,577850,577854,577896,577904,577905,577906,577909
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
577604             xhpl  priv_test   hpl_test          3 CANCELLED+      0:0 
577604.0           xhpl              hpl_test          3 CANCELLED+      0:2 
577703.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577844.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577847.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577848             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577849             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577850             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577854             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577896             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577896.0           xhpl              hpl_test          4 CANCELLED+     0:11 
577904             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577905             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577906             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577908.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577909             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
[hpl_test@swarm02 project]$ srun -M priv -p priv_test --nodelist=g0151 -n 4 --gres=gpu:4 ./xhpl 4-47000.dat

================================================================================
HPL-NVIDIA 1.0.0  -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   47000 
NB     :     288 
PMAP   : Row-major process mapping
P      :       4 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :  2ringM 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : transposed form
EQUIL  : no
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

[hpl_test@swarm02 project]$ srun -M priv -p priv_test --nodelist=g0150 -n 4 --gres=gpu:4 ./xhpl 4-47000.dat
srun: Required node not available (down, drained or reserved)
srun: job 577911 queued and waiting for resources
^Csrun: Job allocation 577911 has been revoked
srun: Force Terminated job 577911
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
577604             xhpl  priv_test   hpl_test          3 CANCELLED+      0:0 
577604.0           xhpl              hpl_test          3 CANCELLED+      0:2 
577703.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577844.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577847.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577848             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577849             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577850             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577854             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577896             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577896.0           xhpl              hpl_test          4 CANCELLED+     0:11 
577904             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577905             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577906             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577908.0           xhpl              hpl_test          4 CANCELLED+      0:9 
577909             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
577911             xhpl  priv_test   hpl_test          4 CANCELLED+      0:0 
[hpl_test@swarm02 project]$ scancel -M priv 577604,577848,577849,577850,577854,577896,577904,577905,577906,577909,577911
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$ 
[hpl_test@swarm02 project]$  

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值