问题描述:
slurm平台,计算节点的状态自动变为down,但是slurmd的status输出是正常的。重启slurmd服务能好使一会,但是过一段时间后还是会变为down。
问题定位
1、检查slurmctld服务日志,发现存在以下异常日志:
[2023-05-09T21:30:59.650] error: Orphan StepId=6639.extern reported on node node31
[2023-05-09T21:30:59.650] error: Orphan StepId=6639.0 reported on node node31
[2023-05-09T21:30:59.650] error: Orphan StepId=6652.extern reported on node node31
[2023-05-09T21:30:59.651] error: Orphan StepId=6652.0 reported on node node31
[2023-05-09T21:30:59.651] error: Orphan StepId=6653.extern reported on node node31
[2023-05-09T21:30:59.651] error: Orphan StepId=6653.0 reported on node node31
[2023-05-09T21:30:59.651] error: Orphan StepId=6651.extern reported on node node32
[2023-05-09T21:30:59.651] error: Orphan StepId=6649.extern reported on node node32
[2023-05-09T21:30:59.651] error: Orphan StepId=6650.extern reported on node node32
[2023-05-09T21:30:59.651] error: Orphan StepId=6651.0 reported on node node32
[2023-05-09T21:30:59.651] error: Orphan StepId=6650.0 reported on node node32
[2023-05-09T21:30:59.651] error: Orphan StepId=6649.0 reported on node node32
[2023-05-09T21:31:09.662] error: slurm_receive_msgs: [[node31]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:31:09.662] error: slurm_receive_msgs: [[node31]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:31:09.662] error: slurm_receive_msgs: [[node32]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:31:09.667] error: slurm_receive_msgs: [[node32]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:31:49.707] error: slurm_receive_msgs: [[node31]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:31:49.712] error: slurm_receive_msgs: [[node32]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:32:09.736] error: slurm_receive_msgs: [[node31]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:32:09.742] error: slurm_receive_msgs: [[node32]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:32:29.765] error: slurm_receive_msgs: [[node31]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:32:29.771] error: slurm_receive_msgs: [[node32]:6818] failed: Socket timed out on send/recv operation
[2023-05-09T21:32:39.789] error: Nodes node[31-32] not responding, setting DOWN
找到down的原因了,是从控制节点直接设置状态为down的,原因是Orphan slurm step
2、在计算节点检查slurmstep
计算节点上有存在异常的slurm step
3、杀掉对应的slurmstep
杀掉对应的slurmstep,持续观察是否恢复正常
4、 结果
持续观察一周后,没有再次出现down的现象,问题处理完毕
解决方案:
直接kill掉对应的step进程