【Mindspore】【GPU并行训练】用2块GPUs可以,四块或八块就失败了 V2.0--后续报错

问题:

【功能模块】

显卡配置:4 * Nvidia P100 

【操作步骤&问题现象】

1、使用mpirun --allow-run-as-root -n 4 python resnet50_distributed_training_gpu.py 进行分布式训练

【resnet50_distributed_training_gpu.py主函数代码】

遇到报错:(同链接 https://bbs.huaweicloud.com/forum/thread-111788-1-1.html)

2、通过运行docker时设置 "docker run --shm-size=2gb" 来解决

3.遇到后续报错:RuntimeError: mindspore/ccsrc/backend/session/kernel_build_client.h:141 GetScriptFilePath] popen failed, errno: 12

【截图信息】

【日志信息】(可选,上传日志内容或者附件)

Traceback (most recent call last):

  File "resnet50_distributed_training_gpu.py", line 139, in

    model.train(epoch_size, dataset, callbacks=[loss_cb,ckpoint_cb], dataset_sink_mode=False)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 592, in train

    sink_size=sink_size)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 385, in _train

    self._train_process(epoch, train_dataset, list_callback, cb_params)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 513, in _train_process

    outputs = self._train_network(*next_element)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 322, in __call__

    out = self.compile_and_run(*inputs)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 578, in compile_and_run

    self.compile(*inputs)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 565, in compile

    _executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)

  File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 505, in compile

    result = self._executor.compile(obj, args_list, phase, use_vm)

RuntimeError: mindspore/ccsrc/backend/session/kernel_build_client.h:141 GetScriptFilePath] popen failed, errno: 12

#

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

  Process name: [[18168,1],0]

  Exit code:    1

解决方案:

设置export GLOG_v=1后再跑一下,然后保存log文件。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值