错误信息
一般的错误表述如下:
xxx.py:xxx: RuntimeWarning: overflow encountered in reduce
xxx.py:xxx: RuntimeWarning: invalid value encountered in true_divide
...
xxx.cu:xxx: block: [xxx,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= xxx` failed.
xxx.cu:xxx: block: [xxx,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= xxx` failed.
xxx.cu:xxx: block: [xxx,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= xxx` failed.
xxx.cu:xxx: block: [xxx,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= xxx` failed.
xxx.cu:xxx: block: [xxx,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= xxx` failed.
xxx.cu:xxx: block: [xxx,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= xxx` failed.
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
原因
只要看到Assertion 'input_val >= zero && input_val <= xxx' failed.
就说明是数值溢出了, 溢出的位置就在 RuntimeWarning: overflow encountered in true_divide
警告出现的位置.
常见于:
input=torch.Tensor([[1, 2], [2, 1]])
target=torch.Tensor([[0, 1], [1, 0]])
loss = F.binary_cross_entropy(input, target)
如果你是在CPU上运行, 会显示:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-5-32fd2e5a82d3> in <module>
1 input=torch.Tensor([[1, 2], [2, 1]])
2 target=torch.Tensor([[0, 1], [1, 0]])
----> 3 loss = F.binary_cross_entropy(input, target)
D:\Miniconda3\envs\dl\lib\site-packages\torch\nn\functional.py in binary_cross_entropy(input, target, weight, size_average, reduce, reduction)
2913 weight = weight.expand(new_size)
2914
-> 2915 return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
2916
2917
RuntimeError: all elements of input should be between 0 and 1
如果在GPU上运行, 会显示:
input=torch.Tensor([[1, 2], [2, 1]]).cuda()
target=torch.Tensor([[0, 1], [1, 0]]).cuda()
loss = F.binary_cross_entropy(input, target)
In [6]: input=torch.Tensor([[1, 2], [2, 1]]).cuda()
...: target=torch.Tensor([[0, 1], [1, 0]]).cuda()
...: loss = F.binary_cross_entropy(input, target)
In [7]: C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:115: block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:115: block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
数值溢出具体有以下几个原因
1. 输出值的范围与标签值的范围不一致
如上图例子所示
2. 除0
常见于归一化操作中
# 特征归一化
mean =feature.mean((0, 2, 3), keepdims=True)
std = feature.std((0, 2, 3), keepdims=True)
feature = (feature - mean) / (std + 1e-8)
记得 + 1e-8(一个极小的数)
即可
3. sqrt(0)
亦常见于归一化操作和Normal函数中
std = torch.var(x, dim = 1, unbiased = False, keepdim = True).sqrt()
记得 + 一个极小的数
即可
std = torch.sqrt(torch.var(x, dim = 1, unbiased = False, keepdim = True) + self.eps)
4. masked_array 运算错误导致的溢出
如果用到了masked_array (numpy.ma.core.MaskedArray)
需要检查有没有对masked_array
应用常规numpy操作:
下图显示了对masked_array
应用np.concatenate
导致mask失效
In [31]: data
Out[31]:
masked_array(
data=[[0.0, 1.0, 2.0],
[3.0, 4.0, --],
[--, --, --]],
mask=[[False, False, False],
[False, False, True],
[ True, True, True]],
fill_value=1e+20)
In [32]: np.concatenate([data, data])
Out[32]:
masked_array(
data=[[0., 1., 2.],
[3., 4., 0.],
[0., 0., 0.],
[0., 1., 2.],
[3., 4., 0.],
[0., 0., 0.]],
mask=False,
fill_value=1e+20)
乍一看好像不会导致溢出.
但是有些mask住的值为一个极大的数, 如:
In [52]: arr
Out[52]:
masked_array(
data=[[--, --, --, --, --, --, --, --, --, --],
[--, --, --, --, --, --, --, --, --, --],
[--, --, --, --, --, --, --, --, --, --],
[--, --, --, --, --, --, --, --, --, --],
[--, --, --, --, --, --, --, --, --, --],
[--, --, --, --, --, --, --, --, --, --],
[-1.1238000392913818, -1.1208000183105469, -1.1162999868392944,
-1.1130000352859497, -1.1100000143051147, -1.1064000129699707,
-1.1026999950408936, -1.0999000072479248, -1.097599983215332,
-1.094099998474121],
[-1.124500036239624, -1.1211999654769897, -1.1164000034332275,
-1.1128000020980835, -1.1095999479293823, -1.1057000160217285,
-1.1019999980926514, -1.0995999574661255, -1.0972000360488892,
-1.093500018119812],
[-1.1246000528335571, -1.1208000183105469, -1.1155999898910522,
-1.111799955368042, -1.108299970626831, -1.1043000221252441,
-1.1003999710083008, -1.0978000164031982, -1.0953999757766724,
-1.0923000574111938],
[-1.1224000453948975, -1.118499994277954, -1.1129000186920166,
-1.1088999509811401, -1.1053999662399292, -1.1014000177383423,
-1.097499966621399, -1.0946999788284302, -1.0922000408172607,
-1.0896999835968018]],
mask=[[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
[False, False, False, False, False, False, False, False, False,
False],
[False, False, False, False, False, False, False, False, False,
False],
[False, False, False, False, False, False, False, False, False,
False],
[False, False, False, False, False, False, False, False, False,
False]],
fill_value=9.96921e+36,
dtype=float32)
In [53]: arr.data
Out[53]:
array([[ 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36],
[ 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36],
[ 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36],
[ 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36],
[ 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36],
[ 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36,
9.96921e+36, 9.96921e+36],
[-1.12380e+00, -1.12080e+00, -1.11630e+00, -1.11300e+00,
-1.11000e+00, -1.10640e+00, -1.10270e+00, -1.09990e+00,
-1.09760e+00, -1.09410e+00],
[-1.12450e+00, -1.12120e+00, -1.11640e+00, -1.11280e+00,
-1.10960e+00, -1.10570e+00, -1.10200e+00, -1.09960e+00,
-1.09720e+00, -1.09350e+00],
[-1.12460e+00, -1.12080e+00, -1.11560e+00, -1.11180e+00,
-1.10830e+00, -1.10430e+00, -1.10040e+00, -1.09780e+00,
-1.09540e+00, -1.09230e+00],
[-1.12240e+00, -1.11850e+00, -1.11290e+00, -1.10890e+00,
-1.10540e+00, -1.10140e+00, -1.09750e+00, -1.09470e+00,
-1.09220e+00, -1.08970e+00]], dtype=float32)
In [54]: arr.sum()
Out[54]: -44.281998
In [55]: arr.data.sum()
D:\Miniconda3\envs\dl\lib\site-packages\numpy\core\_methods.py:47: RuntimeWarning: overflow encountered in reduce
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
Out[55]: inf
当大量9.96921e+36
被泄漏出来, 并且对其执行如sum()
操作时就会导致溢出.
检查所有对masked_array
使用的操作即可, 使用np.ma
下的操作代替np
操作:
arr = np.ma.concatenate([arr, arr])