Mindspore模型训练Modelzoo SEResNext50_32*4d GPU初始化错误

在使用MindSpore 1.5.0-rc1版本进行GPU训练时遇到错误,问题集中在Split算子上。错误信息表明Attroutput_num的值超过了输入数据允许的最大切分粒度。调整batchsize和group大小后,发现当group为7时才能正常运行。此问题在本地Ubuntu 18.04环境和ModelArts上重现。解决方案可能涉及调整网络结构或修改Split算子的参数设置。
摘要由CSDN通过智能技术生成

MindSpore  版本:1.5.0-rc1

ubuntu18.04

python3.7.5

GPU CUDA10.1

【操作步骤&问题现象】

1、修改batchsize为32和数据及路径后直接运行报错Attr output_num 32must less than28  ,修改group为16后报错Attr output_num 16must less than14,修改group为7才能够正常运行

2、上传至modelarts上与自己电脑上运行错误相同,同样group更改为7才能使用                      配置为GPU: 1*NVIDIA-V100(32GB) | CPU: 8 核 64GB

[ERROR] KERNEL(3516,7f24a92a2740,python):2021-10-23-20:03:05.062.308 [mindspore/ccsrc/backend/kernel_compiler/gpu/arrays/split_gpu_kernel.h:144] CheckParam] Attr output_num 32must less than28
[EXCEPTION] DEVICE(3516,7f24a92a2740,python):2021-10-23-20:03:05.062.651 [mindspore/ccsrc/runtime/device/gpu/gpu_kernel_build.cc:63] CreateGPUKernel] Initialize gpu kernel op[Default/network-TrainOneStepCell/network-WithLossCell/_backbone-SENet/layer2-SequentialCell/1-SEResNeXtBottleneck/conv2-GroupConv/Split-op137405] failed.
Traceback (most recent call last):
  File "/home/zxm/PycharmProjects/pythonProject3/train.py", line 288, in
    model.train(cfg.epoch_size, dataset, callbacks=cbs)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 718, in train
    sink_size=sink_size)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 502, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 564, in _train_dataset_sink_process
    outputs = self._train_network(*inputs)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 404, in __call__
    out = self.compile_and_run(*inputs)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 682, in compile_and_run
    self.compile(*inputs)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 669, in compile
    _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
  File "/home/zxm/.local/lib/python3.7/site-packages/mindspore/common/api.py", line 542, in compile
    result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name)
RuntimeError: mindspore/ccsrc/runtime/device/gpu/gpu_kernel_build.cc:63 CreateGPUKernel] Initialize gpu kernel op[Default/network-TrainOneStepCell/network-WithLossCell/_backbone-SENet/layer2-SequentialCell/1-SEResNeXtBottleneck/conv2-GroupConv/Split-op137405] failed.

【截图信息】

解答:

关键报错信息如下:

_backbone-SENet/layer2-SequentialCell/1-SEResNeXtBottleneck/conv2-GroupConv/Split

split_gpu_kernel.h:144] CheckParam] Attr output_num 32 must less than28

报错的意思是说:你网络中使用了Split算子,该算子的input_x.shape()[axis] 是 28,但是你设置的output_num 是 32,超出了输入数据在axis维度上的最大切分粒度,所以报错。

建议:调试网络结构,或修改网络配置参数。

Split算子接口说明如下。

https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.Split.html#mindspore.ops.Split

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值