be idle sometimes to_Bug #2689: gpu timer state not cleaned up - GROMACS - GROMACS development

In the API MPI tests, on the CUDA+OpenCL config on bs_1310, the following intermittent problem can be seen when run with multiple ranks:

00:09:11.740 [==========] Running 7 tests from 4 test cases.

00:09:11.740 [----------] Global test environment set-up.

00:09:11.740 [----------] 3 tests from ApiRunner

00:09:11.740 [ RUN ] ApiRunner.BasicMD

00:09:11.740 Checkpoint file is from part 5, new output files will be suffixed '.part0006'.

00:09:11.740 The current CPU can measure timings more accurately than the code in

00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation

00:09:11.740 speed as accurate timings are needed for load-balancing.

00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option.

00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision)

00:09:11.740

00:09:11.740 Overriding nsteps with value passed on the command line: 10 steps, 0.01 ps

00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1

00:09:11.740

00:09:11.740

00:09:11.740 Using 2 MPI processes

00:09:11.740 Using 4 OpenMP threads per MPI process

00:09:11.740

00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run.

00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:

00:09:11.740 PP:0,PP:1

00:09:11.740 starting mdrun 'spc-and-methanol'

00:09:11.740 80 steps, 0.1 ps (continuing from step 70, 0.1 ps).

00:09:11.740

00:09:11.740 Writing final coordinates.

00:09:11.740

00:09:11.740 NOTE: 45 % of the run time was spent communicating energies,

00:09:11.740 you might want to use the -gcom option of mdrun

00:09:11.740

00:09:11.740 Core t (s) Wall t (s) (%)

00:09:11.740 Time: 0.175 0.022 795.8

00:09:11.740 (ns/day) (hour/ns)

00:09:11.740 Performance: 43.232 0.555

00:09:11.740 [ OK ] ApiRunner.BasicMD (1319 ms)

00:09:11.740 [ RUN ] ApiRunner.Reinitialize

00:09:11.740 Checkpoint file is from part 6, new output files will be suffixed '.part0007'.

00:09:11.740 The current CPU can measure timings more accurately than the code in

00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation

00:09:11.740 speed as accurate timings are needed for load-balancing.

00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option.

00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision)

00:09:11.740

00:09:11.740 Overriding nsteps with value passed on the command line: 20 steps, 0.02 ps

00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1

00:09:11.740

00:09:11.740

00:09:11.740 Using 2 MPI processes

00:09:11.740 Using 4 OpenMP threads per MPI process

00:09:11.740

00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run.

00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:

00:09:11.740 PP:0,PP:1

00:09:11.740

00:09:11.740 Non-default thread affinity set probably by the OpenMP library,

00:09:11.740 disabling internal thread affinity

00:09:11.740 starting mdrun 'spc-and-methanol'

00:09:11.740 100 steps, 0.1 ps (continuing from step 80, 0.1 ps).

00:09:11.740

00:09:11.740

00:09:11.740 Received the remote INT/TERM signal, stopping within 100 steps

00:09:11.740

00:09:11.740

00:09:11.740

00:09:11.740 Received the remote INT/TERM signal, stopping within 100 steps

00:09:11.740

00:09:11.740

00:09:11.740 NOTE: 47 % of the run time was spent communicating energies,

00:09:11.740 you might want to use the -gcom option of mdrun

00:09:11.740

00:09:11.740 Core t (s) Wall t (s) (%)

00:09:11.740 Time: 0.144 0.018 796.5

00:09:11.740 (ns/day) (hour/ns)

00:09:11.740 Performance: 9.589 2.503

00:09:11.740 Checkpoint file is from part 7, new output files will be suffixed '.part0008'.

00:09:11.740 The current CPU can measure timings more accurately than the code in

00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation

00:09:11.740 speed as accurate timings are needed for load-balancing.

00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option.

00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision)

00:09:11.740

00:09:11.740 Overriding nsteps with value passed on the command line: 20 steps, 0.02 ps

00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1

00:09:11.740

00:09:11.740

00:09:11.740 Using 2 MPI processes

00:09:11.740 Using 4 OpenMP threads per MPI process

00:09:11.740

00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run.

00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:

00:09:11.740 PP:0,PP:1

00:09:11.740

00:09:11.740 Non-default thread affinity set probably by the OpenMP library,

00:09:11.740 disabling internal thread affinity

00:09:11.740 starting mdrun 'spc-and-methanol'

00:09:11.740 101 steps, 0.1 ps (continuing from step 81, 0.1 ps).

00:09:11.740

00:09:11.740 -------------------------------------------------------

00:09:11.740 Program: gmxapi-mpi-test, version 2019-dev-20181010-8a92135

00:09:11.740 Source file: src/gromacs/gpu_utils/gpuregiontimer.h (line 94)

00:09:11.740 Function: GpuRegionTimerWrapper::openTimingRegion(CommandStream):: [with GpuRegionTimerImpl = GpuRegionTimerImpl]

00:09:11.740 MPI rank: 1 (out of 2)

00:09:11.740

00:09:11.740 Assertion failed:

00:09:11.740 Condition: debugState_ == TimerState::Idle

00:09:11.740 GPU timer should be idle, but is stopped.

00:09:11.740

00:09:11.740 For more information and tips for troubleshooting, please check the GROMACS

00:09:11.740 website at http://www.gromacs.org/Documentation/Errors

00:09:11.740 -------------------------------------------------------

00:09:11.740 --------------------------------------------------------------------------

00:09:11.740 MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD

00:09:11.740 with errorcode 1.

00:09:11.740

00:09:11.740 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

00:09:11.740 You may or may not see output from other processes, depending on

00:09:11.740 exactly when Open MPI kills them.

00:09:11.740 --------------------------------------------------------------------------

00:09:11.740 --------------------------------------------------------------------------

00:09:11.740 mpiexec has exited due to process rank 1 with PID 19321 on

00:09:11.740 node bs-nix1310 exiting improperly. There are two reasons this could occur:

00:09:11.740

00:09:11.740 1. this process did not call "init" before exiting, but others in

00:09:11.740 the job did. This can cause a job to hang indefinitely while it waits

00:09:11.740 for all processes to call "init". By rule, if one process calls "init",

00:09:11.740 then ALL processes must call "init" prior to termination.

00:09:11.740

00:09:11.740 2. this process called "init", but exited without calling "finalize".

00:09:11.740 By rule, all processes that call "init" MUST call "finalize" prior to

00:09:11.740 exiting or it will be considered an "abnormal termination"

00:09:11.740

00:09:11.740 This may have caused other processes in the application to be

00:09:11.740 terminated by signals sent by mpiexec (as reported here).

00:09:11.740 --------------------------------------------------------------------------

I have reproduced with a manual build on that host. Sometimes BasicMD also fails.

This seems like a GPU infrastructure issue with multiple simulation parts within the same process, which is exposed by the API use case, but probably not caused by it. I'm not sure why our existing continuation test cases don't trigger it, but perhaps they are not run with multiple ranks per simulation.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值