Modelarts 中进行多机多卡分布式训练与验证
【操作步骤&问题现象】
在训练了一个多epoch,验证也做了十几次后。EvalCallback中模型在由训练网络转向验证网络,并执行验证出现了问题。
目前看上去,第零个节点(8张昇腾910)已经完成了模型验证,并给出了验证准确率。但是第一个节点(8张昇腾910)会报错,并且没有打印验证准确率。报错内容如下:
[ERROR] HCCL_ADPT(113,fffdfd7fa160,python):2021-12-22-02:04:11.583.220 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(119,fffde1ffb160,python):2021-12-22-02:04:11.599.194 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(111,fffdfbfff160,python):2021-12-22-02:04:11.602.991 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(121,fffdecff9160,python):2021-12-22-02:04:11.616.258 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(117,fffe18ff9160,python):2021-12-22-02:04:11.629.527 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(115,fffde27fc160,python):2021-12-22-02:04:11.735.578 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(109,fffded7fa160,python):2021-12-22-02:04:11.760.305 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[ERROR] HCCL_ADPT(107,fffdde7fc160,python):2021-12-22-02:04:11.933.854 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860
[Modelarts Service Log]2021-12-22 02:04:17,442 - ERROR - proc-rank-9-device-1 (pid: 109) has exited with non-zero code: -11
[Modelarts Service Log]2021-12-22 02:04:17,443 - INFO - Begin destroy training processes
这个plog的报错是因为其他卡异常退出了,而且报错都是相同的。