问题描述:
在Ascend 910A 、mindspore1.1.2 环境下运行图算融合,网络架构是Resnet 50 ,同样的参数条件下,程序正常运行没有任何问题,但是程序中加上(enable_graph_kernel=True) 之后,有时候训练时 loss变为负无穷大,有时候会变成nan,然后都会报错。
具体报错的内容如下:
amp_level: O2
WARNING: 'ControlDepend' is deprecated from version 1.1 and will be removed in a future version, use 'Depend' instead.
epoch: 1 step: 1, loss is 2.2937918
WARNING: 'ControlDepend' is deprecated from version 1.1 and will be removed in a future version, use 'Depend' instead.
epoch: 1 step: 71, loss is 4803.1064
epoch: 1 step: 72, loss is 4034.7324
epoch: 1 step: 142, loss is -5.3169115e+37
epoch: 1 step: 143, loss is 2.3025851
epoch: 1 step: 213, loss is -3.4028235e+38
epoch: 1 step: 214, loss is 2.3025851
[ERROR] RUNTIME(75982)model execute error, retCode=0x91, [the model stream execute failed].