参考代码:https://github.com/eriklindernoren/PyTorch-YOLOv3
1.训练时loss出现nan
解决思路:
- 在yolov3.cfg中降低学习率后仍未解决:
- 由于修改过loss,怀疑可能是loss有问题,于是将loss加上权重,也未能解决问题
total_loss = 0.8 * (loss_x + loss_y + loss_w + loss_h) + loss_conf + loss_cls \
+ 0.2 * (vis_loss_x + vis_loss_y + vis_loss_w + vis_loss_h)
- 更换优化器后,将Adam改为SGD, 仍未解决
# optimizer = torch.optim.Adam(model.parameters())
optimizer = torch.optim.SGD(model.parameters(), lr=0.001,momentum=0.9)
- 单独训练新添加的loss,发现会出现nan值。
- 目前的结论是单独训练原始loss没有问题,但加上新loss或者单独训练新loss都会出现nan值,说明是新loss出现了问题,在参考博客训练网络loss出现Nan解决办法时里面提到
3.可能用0作为了除数;
4.可能0或者负数作为自然对数
于是怀疑可能是计算新loss前出现了以上两种情况,因为代码中确实用到了log计算:
tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
添加打印信息:
print("gw= ", gw)
print("gh= ", gh)
正常log的宽高都为正值,所以loss正常:
gw= tensor([1.0731, 1.0153, 2.6307, 1.2577, 1.0847, 2.1346, 1.8000, 2.4116, 1.5115,
2.7115, 1.8462, 0.7731, 0.6462, 0.6000, 0.7847, 1.4538, 1.0385, 0.5885,
1.3154, 1.1538, 0.8885, 0.4962, 0.9347, 1.8346, 0.9693, 0.6798, 0.4425,
0.5504, 0.3453, 0.6259, 0.7015, 0.3884, 0.6151, 0.3992, 0.4748, 0.1403,
0.5288, 0.1511, 0.1835, 0.1403, 0.2266, 0.1403, 0.1619, 0.0539, 0.3884,
0.0863, 0.3777, 0.1079, 0.0863, 0.2482, 0.0755, 0.0647, 0.0863, 0.0431,
0.0755, 0.4333, 2.2632, 3.3708, 1.3484, 1.5168, 2.1429, 1.9503, 2.0706,
1.3484, 2.2632, 3.0338, 0.4815, 2.2929, 2.1215, 1.2643, 2.2285, 1.2000,
3.7285, 1.6286, 3.0644, 2.1215, 1.7143, 1.2429, 2.4429, 2.0357, 1.9071,
0.4875, 0.5175, 0.7163, 0.7612, 1.4288, 0.6600, 0.3750, 0.7575, 1.2300,
0.9412, 0.7912, 1.1737, 1.3688, 0.9188, 0.3713, 0.5437, 1.3425, 1.8713,
0.6450, 0.6525, 0.8663, 1.1250, 0.3862, 0.1950, 0.1312, 0.7463, 0.5813,
3.6753, 3.5346, 5.5041, 2.9543, 4.5194, 4.0093, 2.6202, 2.3212, 4.7097,
5.4033, 5.8871, 4.7097, 1.6000, 4.2400, 3.5600, 0.6000, 0.5600, 0.7200,
0.3400, 0.4800, 0.4000, 0.4200, 0.4600, 0.3400], device='cuda:0')
gh= tensor([ 4.9385, 2.4808, 6.1500, 1.8577, 4.7423, 4.4539, 5.6193, 4.8115,
4.8923, 6.2770, 5.9538, 1.2116, 0.8077, 0.8769, 1.5000, 4.0961,
4.0038, 0.6346, 0.8538, 0.4270, 0.7615, 0.3461, 3.7269, 6.0808,
1.0385, 1.8561, 1.7591, 1.9425, 1.6295, 1.6726, 1.4784, 0.9604,
1.5971, 1.4244, 1.4352, 0.3453, 1.4352, 0.3993, 0.6260, 0.4533,
0.6798, 0.2158, 0.6260, 0.1835, 0.2375, 0.2158, 0.1079, 0.0756,
0.3237, 0.2699, 0.1618, 0.1835, 0.0970, 0.0970, 0.2266, 1.5168,
7.0786, 6.8620, 6.5971, 6.6453, 6.1156, 2.6485, 7.1027, 6.8860,
6.9824, 6.7656, 0.4333, 7.1785, 6.8786, 6.6214, 6.9429, 2.6357,
4.9714, 6.6643, 4.9071, 6.8571, 6.7500, 3.9643, 7.2643, 6.8143,
7.0071, 1.5938, 1.5788, 1.8112, 1.9087, 2.8575, 2.3025, 1.4738,
2.2350, 2.0400, 2.1000, 2.0813, 2.6362, 2.5950, 1.8300, 1.4850,
2.1150, 3.4162, 3.4200, 2.0400, 2.2913, 1.0462, 2.4675, 2.0775,
1.0312, 0.1687, 1.0762, 1.1887, 11.3775, 11.1313, 6.5241, 7.8782,
7.7550, 7.0339, 12.7843, 12.5557, 6.5484, 6.6290, 6.7903, 5.3710,
5.9400, 11.9200, 10.1600, 1.6000, 2.0000, 2.0800, 0.4800, 1.2800,
1.1600, 1.1201, 0.3600, 1.4599], device='cuda:0')
w_loss | 0.089969 | 0.078275 | 0.105107 |
h_loss | 0.103389 | 0.090746 | 0.095267 |
loss为nan时wh的确出现了负值情况,这时候做log运算就出现了错误,后面loss自然计算不出来
结论:log运算时出现了负数导致loss为nan
参考博客:
网络训练时出现loss为nan的情况(已解决)
神经网络训练时,出现NaN loss
训练网络loss出现Nan解决办法
CUDA error
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorMathCompare.cuh:82
RuntimeError: CUDA error: device-side assert triggered
运行时加上CUDA_LAUNCH_BLOCKING=1查看详细报错信息
CUDA_LAUNCH_BLOCKING=1 python train.py
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [108,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
参考博客
pytorch报错:RuntimeError: CUDA error: device-side assert triggered
RuntimeError: cuda runtime error (59) : device-side assert triggered