问题:
【功能模块】
显卡配置:4 * Nvidia P100
【操作步骤&问题现象】
1、使用mpirun --allow-run-as-root -n 4 python resnet50_distributed_training_gpu.py 进行分布式训练
【resnet50_distributed_training_gpu.py主函数代码】
遇到报错:(同链接 https://bbs.huaweicloud.com/forum/thread-111788-1-1.html)
2、通过运行docker时设置 "docker run --shm-size=2gb" 来解决
3.遇到后续报错:RuntimeError: mindspore/ccsrc/backend/session/kernel_build_client.h:141 GetScriptFilePath] popen failed, errno: 12
【截图信息】
【日志信息】(可选,上传日志内容或者附件)
Traceback (most recent call last):
File "resnet50_distributed_training_gpu.py", line 139, in
model.train(epoch_size, dataset, callbacks=[loss_cb,ckpoint_cb], dataset_sink_mode=False)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 592, in train
sink_size=sink_size)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 385, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 513, in _train_process
outputs = self._train_network(*next_element)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 322, in __call__
out = self.compile_and_run(*inputs)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 578, in compile_and_run
self.compile(*inputs)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 565, in compile
_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 505, in compile
result = self._executor.compile(obj, args_list, phase, use_vm)
RuntimeError: mindspore/ccsrc/backend/session/kernel_build_client.h:141 GetScriptFilePath] popen failed, errno: 12
#
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[18168,1],0]
Exit code: 1
解决方案:
设置export GLOG_v=1后再跑一下,然后保存log文件。