1 报错描述
1.1 系统环境
Hardware Environment(Ascend/GPU/CPU): Ascend Software Environment: -- MindSpore version (source or binary): 1.6.0 -- Python version (e.g., Python 3.7.5): 3.7.6 -- OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic -- GCC/Compiler version (if compiled from source):
1.2 基本信息
1.2.1 脚本
训练脚本是通过构建Abs的单算子网络,对输入两个张量做Sub运算后再计算Abs。脚本如下:
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.abs = ops.Abs()
def construct(self, x1,x2):
output = self.abs(x1 - x2)
return output
net = Net()
x1 = Tensor(np.ones((2, 5), dtype=np.float32), mindspore.float32)
x2 = Tensor(np.ones((3, 5), dtype=np.float32), mindspore.float32)
out = net(x1,x2)
print('out',out.shape)
2 报错
这里报错信息如下:
The function call stack (See file '/demo/rank_0/om/analyze_fail.dat' for more details):
# 0 In file demo.py(7)
output = self.abs(x1 - x2)
^
Traceback (most recent call last):
File "demo.py", line 13, in <module>
out = net(x1,x2)
File "/lib/python3.7/site-packages/mindspore/nn/cell.py", line 576, in __call__
out = self.compile_and_run(*args)
File "/lib/python3.7/site-packages/mindspore/nn/cell.py", line 942, in compile_and_run
self.compile(*inputs)
File "/lib/python3.7/site-packages/mindspore/nn/cell.py", line 915, in compile
_cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
File "/lib/python3.7/site-packages/mindspore/common/api.py", line 791, in compile
result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
File "/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 575, in __infer__
out[track] = fn(*(x[track] for x in args))
File "/lib/python3.7/site-packages/mindspore/ops/operations/math_ops.py", line 78, in infer_shape
return get_broadcast_shape(x_shape, y_shape, self.name)
File "/lib/python3.7/site-packages/mindspore/ops/_utils/utils.py", line 70, in get_broadcast_shape
raise ValueError(f"For '{prim_name}', {arg_name1}.shape and {arg_name2}.shape are supposed "
ValueError: For 'Sub', x.shape and y.shape are supposed to broadcast, where broadcast means that x.shape[i] = 1 or -1 or y.shape[i] = 1 or -1 or x.shape[i] = y.shape[i], but now x.shape and y.shape can not broadcast, got i: -2, x.shape: [2, 5], y.shape: [3, 5].
原因分析
我们看报错信息,在ValueError中,写到ValueError: For 'Sub', x.shape and y.shape are supposed to broadcast, where broadcast means that x.shape = 1 or -1 or y.shape = 1 or -1 or x.shape = y.shape,意思是abs的两个操作对象不能进行broadcast,broadcast的要求是x.shape = 1 or -1 or y.shape = 1 or -1 or x.shape = y.shape,而x.shape = y.shape要求两个参数的shape完全相等,在进一步的报错信息中也有写到but now x.shape and y.shape can not broadcast, got i: -2, x.shape: [2, 5], y.shape: [3, 5],显然,x和y的第一个维度不等,这就是问题出现的原因了。关于BroadCast,在官网做了输入限制,对输入的Tensor要求shape必须相同。在其他的双输入算子中,有一定量算子用到了BroadCast操作,也应当注意这点。
3 解决方法
基于上面已知的原因,很容易做出如下修改: 示例1:
此时执行成功,输出如下:
out: (3, 5)
示例2:
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.abs = ops.Abs()
def construct(self, x1,x2):
output = self.abs(x1 - x2)
return output
net = Net()
x1 = Tensor(np.ones((5), dtype=np.float32), mindspore.float32)
x2 = Tensor(np.ones((3, 5), dtype=np.float32), mindspore.float32)
out = net(x1,x2)
print('out',out.shape)
此时执行成功,输出如下:
out: (3, 5)
4 总结
定位报错问题的步骤:
1、找到报错的用户代码行:output = self.abs(x1 - x2);
2、 根据日志报错信息中的关键字,缩小分析问题的范围:x.shape: [2, 5], y.shape: [3, 5];
3、需要重点关注变量定义、初始化的正确性。