Prog= 85.94% N_left= 24960 Time= 10.25 Time_left= 1.68 iGF= 5648.53 GF= 6181.10 iGF_per= 1412.13 GF_per= 1545.27
Prog= 86.42% N_left= 24672 Time= 10.31 Time_left= 1.62 iGF= 5904.47 GF= 6179.48 iGF_per= 1476.12 GF_per= 1544.87
[g0151:33153] *** An error occurred in MPI_Wait
[g0151:33153] *** reported by process [3514040320,2]
[g0151:33153] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[g0151:33153] *** MPI_ERR_TRUNCATE: message truncated
[g0151:33153] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[g0151:33153] *** and potentially your MPI job)
In: PMI_Abort(15, N/A)
[g0151:33154] *** An error occurred in MPI_Wait
[g0151:33154] *** reported by process [3514040320,3]
[g0151:33154] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[g0151:33154] *** MPI_ERR_TRUNCATE: message truncated
[g0151:33154] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[g0151:33154] *** and potentially your MPI job)
In: PMI_Abort(15, N/A)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 577908.0 ON g0151 CANCELLED AT 2022-07-07T20:20:15 ***
Prog= 86.89% N_left= 24384 Time= 10.37 Time_left= 1.57 iGF= 5459.12 GF= 617srun: error: g0151: task 0: Killed
srun: error: Timed out waiting for job step to complete
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$ srun -M priv -p priv_test --nodelist=g0151 -n 4 --gres=gpu:4 ./xhpl 4-47000.dat
srun: job 577909 queued and waiting for resources
^Csrun: Job allocation 577909 has been revoked
srun: Force Terminated job 577909
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED|grep -v CANCELLED
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
577604 xhpl priv_test hpl_test 3 CANCELLED+ 0:0
577604.0 xhpl hpl_test 3 CANCELLED+ 0:2
577703.0 xhpl hpl_test 4 CANCELLED+ 0:9
577844.0 xhpl hpl_test 4 CANCELLED+ 0:9
577847.0 xhpl hpl_test 4 CANCELLED+ 0:9
577848 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577849 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577850 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577854 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577896 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577896.0 xhpl hpl_test 4 CANCELLED+ 0:11
577904 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577905 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577906 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577909 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
[hpl_test@swarm02 project]$ scancel -M priv 577604,577848,577849,577850,577854,577896,577904,577905,577906,577909
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
577604 xhpl priv_test hpl_test 3 CANCELLED+ 0:0
577604.0 xhpl hpl_test 3 CANCELLED+ 0:2
577703.0 xhpl hpl_test 4 CANCELLED+ 0:9
577844.0 xhpl hpl_test 4 CANCELLED+ 0:9
577847.0 xhpl hpl_test 4 CANCELLED+ 0:9
577848 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577849 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577850 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577854 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577896 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577896.0 xhpl hpl_test 4 CANCELLED+ 0:11
577904 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577905 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577906 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577908.0 xhpl hpl_test 4 CANCELLED+ 0:9
577909 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
[hpl_test@swarm02 project]$ srun -M priv -p priv_test --nodelist=g0151 -n 4 --gres=gpu:4 ./xhpl 4-47000.dat
================================================================================
HPL-NVIDIA 1.0.0 -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 47000
NB : 288
PMAP : Row-major process mapping
P : 4
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
[hpl_test@swarm02 project]$ srun -M priv -p priv_test --nodelist=g0150 -n 4 --gres=gpu:4 ./xhpl 4-47000.dat
srun: Required node not available (down, drained or reserved)
srun: job 577911 queued and waiting for resources
^Csrun: Job allocation 577911 has been revoked
srun: Force Terminated job 577911
[hpl_test@swarm02 project]$ sacct -M priv|grep -v COMPLETED|grep -v FAILED
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
577604 xhpl priv_test hpl_test 3 CANCELLED+ 0:0
577604.0 xhpl hpl_test 3 CANCELLED+ 0:2
577703.0 xhpl hpl_test 4 CANCELLED+ 0:9
577844.0 xhpl hpl_test 4 CANCELLED+ 0:9
577847.0 xhpl hpl_test 4 CANCELLED+ 0:9
577848 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577849 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577850 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577854 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577896 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577896.0 xhpl hpl_test 4 CANCELLED+ 0:11
577904 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577905 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577906 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577908.0 xhpl hpl_test 4 CANCELLED+ 0:9
577909 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
577911 xhpl priv_test hpl_test 4 CANCELLED+ 0:0
[hpl_test@swarm02 project]$ scancel -M priv 577604,577848,577849,577850,577854,577896,577904,577905,577906,577909,577911
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$
[hpl_test@swarm02 project]$