MindSpore 炼丹问题汇总（更新中）

熊熊dsh

已于 2022-11-26 17:51:59 修改

阅读量1k

点赞数

分类专栏：深度学习神经网络 MindSpore 文章标签： pytorch 深度学习人工智能 MindSpore

于 2022-11-14 11:06:18 首次发布

本文链接：https://blog.csdn.net/qq_22762933/article/details/127771498

版权

深度学习同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

神经网络

10 篇文章 0 订阅

订阅专栏

MindSpore

4 篇文章 0 订阅

订阅专栏

本文档总结了在MindSpore中遇到的各种问题及其解决方案，包括动态图与静态图模式切换、网络模型处理、AI解释器使用、GradCAM实现、深拷贝错误、运行时错误以及数据处理技巧。通过这些实例，展示了MindSpore模型调试、优化及问题排查的过程。

摘要由CSDN通过智能技术生成

Q1：无法在def construck() 中打断点进行调试。

在MindSpore 中，动态图和静态图是可以切换的，因此其有两个模式，Graph模式（静态图）和PyNative模式（动态图），关于什么是静态图和动态图的基础知识，大家可以参考很多博客，在此不详细介绍。MindSpore默认使用的是静态图模式，因此不方便调试，为了方便我们调试，我们需要改为PyNative模式。

网络调试 — MindSpore master documentation

mindspore.set_context(mode=PYNATIVE_MODE)

Q2：网络模型是一个 odict_values

排序后的字典，主要是在网络定义时候使用。因为转成了列表，但它和pytorch中有区别，不能直接去除后三层的方法去去除卷积层，它的ResNet是一个条目，AVGpooling是一个条目，Denes是一个条目。

Q3：MindSpore.XAI

MindSpore的AI解释器包，可以进行安装并直接使用。具体安装方法可以参见官方的说明，包含了常用的Gradient，GradCAM，Occlusion，RISE等。

使用CV类解释器 — MindSpore master documentation

Q4：使用GradCAM报 AttributeError: 'tuple' object has no attribute 'shape'

这个问题真的是坑了我半天...本来按照官方XAI的说明，安装好了包，调用GradCAM就行了，测试官方的示例也完全成功，但是换成自己的数据和网络模型就不行。参考网上的说明改了好几次数据类型，甚至修改了网络的中间层，发现报错一样，位置也一样。

File "/root/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore_xai/explainer/backprop/gradcam.py", line 147, in __call__
    weights = self._get_bp_weights(inputs, targets)
File "/root/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore_xai/explainer/backprop/gradient.py", line 112, in _get_bp_weights
    self._num_classes = output.shape[-1]
AttributeError: 'tuple' object has no attribute 'shape'

因为是服务器端的解释器，没有办法给这个文件打断点，最后在文件中加了输出才发现了问题。因为我的网络是多支的模型，return的结果有三个（即三个N*1的tensor分类结果）。以至于它在调用shape的时候出错。那最后的解决方法就是将要模型变成了一个输出，当然后续可能还需要进行一些优化，不过确实只为了可视化的话也不需要多输出。

Q5：使用copy.deepcopy时报错

RuntimeError: Unable to cast Python instance of type <class 'models.base.BaseNet'> to C++ type 'std::shared_ptr<mindspore::tensor::Tensor>'

如果我找到了为什么会进行补充

Q6：使用gradcam报错

RuntimeError: Unsupported op [CellBackwardHook] on GPU, Please confirm whether the device target setting is correct, or refer to 'mindspore.ops' at https://www.mindspore.cn to query the operator support list.

因为我把gradcam写到了construct中，如果我能找到原因我会告诉大家。

Q7：RuntimeError: Get infer shape function failed, the operator is not support dynamic shape yet, primitive name:ResizeBilinear primitive type:Primitive

一个简单的tesnor上采样都做不了，真的是无语。和pytorch的对比可以参考官网。

比较与torch.nn.Upsample的功能差异 — MindSpore master documentation

只支持线性上采样就算了，还报错，说不支持动态的shape，真的大无语。在pytorch中使用F.interpolate等可以直接解决的事情，在MindSpore中可能要另寻解法了。

Q8： Tensor中不能使用tolist()，使用item()返回的仍然是一个tensor，而我只想要得到tensor中的值，这真的是一个痛苦的事情，这让我无法在过程中拿到tensor的值，我真的大无语。

最后我使用的解决方法是，先将其转为numpy，其转为numpy使用的方法和pytorch也不特别一样，但至少转为numpy后我可以取出值了

tensor.asnumpy()

Q9：使用XAI包中的GradCAM，多次循环调用会出现内存溢出，耗时越来越长...在gitee提了issue，不知道会不会得到解决。

Q10：Sign, ArgMaxWithValue, ArgMinWithValue。解决一个问的的全过程，心累。

我一开始觉得是sign函数的问题，于是自己写了一个。

sign = ops.Sign()
mask = sign(sign(mask-rate)+1)

[WARNING] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.830.486 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'Cast', the shape of input cannot contain zero, but got (0, 3)
[ERROR] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.830.693 [mindspore/ccsrc/kernel/kernel.h:341] GetDeviceAddress] The size of device address is zero, address index: 0, and the length of 'addr_list' is 1
[WARNING] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.830.948 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'Cast', the shape of input cannot contain zero, but got (0, 3)
[ERROR] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.831.140 [mindspore/ccsrc/kernel/kernel.h:341] GetDeviceAddress] The size of device address is zero, address index: 0, and the length of 'addr_list' is 1
[WARNING] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.870.423 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'ExpandDims', the shape of input cannot contain zero, but got (3, 0, 0)
[WARNING] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.871.143 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'ResizeBilinear', the shape of input cannot contain zero, but got (1, 3, 0, 0)
[ERROR] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.871.313 [mindspore/ccsrc/kernel/kernel.h:341] GetDeviceAddress] The size of device address is zero, address index: 0, and the length of 'addr_list' is 1
[ERROR] KERNEL(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.871.326 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/memcpy_gpu_kernel.cc:62] Launch] cudaMemcpyAsync error in MemcpyGpuKernelMod::Launch, error code is 1
[ERROR] DEVICE(3802945,7f7cd7bb8340,python):2022-11-19-03:15:00.871.336 [mindspore/ccsrc/plugin/device/gpu/hal/hardware/gpu_device_context.cc:601] LaunchKernel] Launch kernel failed, kernel full name: Default/Squeeze-op31444

自己写完这个还是会报错，看来不是Sign的问题，那到底是哪里出了问题啊

    def sign(input):
        input = input.asnumpy()
        output = np.sign(input)
        output = ms.Tensor(output)
        return output

[WARNING] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.013.948 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'Cast', the shape of input cannot contain zero, but got (0, 3)
[ERROR] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.014.188 [mindspore/ccsrc/kernel/kernel.h:341] GetDeviceAddress] The size of device address is zero, address index: 0, and the length of 'addr_list' is 1
[WARNING] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.014.458 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'Cast', the shape of input cannot contain zero, but got (0, 3)
[ERROR] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.014.663 [mindspore/ccsrc/kernel/kernel.h:341] GetDeviceAddress] The size of device address is zero, address index: 0, and the length of 'addr_list' is 1
[WARNING] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.047.806 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'ExpandDims', the shape of input cannot contain zero, but got (3, 0, 0)
[WARNING] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.048.540 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_common.h:274] CheckShapeNull] For 'ResizeBilinear', the shape of input cannot contain zero, but got (1, 3, 0, 0)
[ERROR] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.048.722 [mindspore/ccsrc/kernel/kernel.h:341] GetDeviceAddress] The size of device address is zero, address index: 0, and the length of 'addr_list' is 1
[ERROR] KERNEL(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.048.736 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/memcpy_gpu_kernel.cc:62] Launch] cudaMemcpyAsync error in MemcpyGpuKernelMod::Launch, error code is 1
[ERROR] DEVICE(3808186,7f6c2e7e1340,python):2022-11-19-03:57:30.048.746 [mindspore/ccsrc/plugin/device/gpu/hal/hardware/gpu_device_context.cc:601] LaunchKernel] Launch kernel failed, kernel full name: Default/Squeeze-op22612

结果后面发现并不是sign的问题，而是在之后我使用了argmax和argmin。是这两个出现了问题，因为我如果不使用argmax和argmin而是自己定义值得话，并不会出错。（到这里我还没意识到是因为读取不到数据的问题）

argmax_0 = ops.ArgMaxWithValue(axis=0) # 定义argmax
argmin_0 = ops.ArgMinWithValue(axis=0) # 定义argmin
reshape = ops.Reshape() # 定义reshape

mask = sign(sign(cammap-rate)+1) # 巧妙的处理让mask只有0/1
mask = reshape(mask,(mask.shape[0], 1, 448, 448)) # reshape

for k in range(data_shape): # 循环对batch中的每个元素操作
    indices = mask[k].nonzero()
    _, indices_min = argmin_0(indices.astype(ms.float16)) ###出错
    _, indices_max = argmax_0(indices.astype(ms.float16)) ###出错

ArgMinWithValue和ArgMaxWithValue这两个函数，虽然看起来一样，但调用的时候ArgMinWithValue就必须要求tensor中的数据类型为mindspore.float16或float32，不可以用Int类型，而ArgMaxWithValue就可以支持所有类型。

后面我尝试使用numpy来取最大最小值，而这个用起来显然更加顺手，这时候的报错我也终于看明白是怎么回事了！这不就是在取min或者max的时候前面的array为空嘛....

min_numpy = indices_numpy.min(axis=0)
max_numpy = indices_numpy.max(axis=0)

ValueError: zero-size array to reduction operation minimum which has no identity

那加一个判断应该就可以解决了，于是我加了一个判断列表是否为空，不为空在执行，结果又报了一个错误。因为array的数据有多个维度，使得判断存在歧义。最后加上.any()或者.all()终于解决。

ValueError: The truth value of an array with several elements is ambiguous.

if indices_numpy: # 因为这里的indices_numpy多维，存在歧义，所以需要使用.any()或者.all()
    ...
else:
    ...



if indices_numpy.any():
    ...
else:
    ...

至此，问题解决，兜兜转转一大圈，最后发现并不是mindspore中ops的问题，但它的报错实在让人难以理解，以至于花费了大量的时间，搞了一堆没用的东西。

Q11：使用Concat时报错，说必须是tensor不能是Abstract Tensor

TypeError: For 'Concat', the input must be a list or tuple of tensors. But got：AbstractTensor(shape: (16, 1176), element: AbstractScalar(Type: Float32, Value: AnyValue, Shape: NoShape), value_ptr: 0x55889f633670, value: AnyValue).

Q12：RuntimeError: cuDNN Error: cudnnConvolutionForward failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED

这个错误好像很常见，而我出现这个错误的原因是因为正在跑着另一个程序，显存占满了，也是记录一下，省得以后遇到类似的不知道为啥。