Faster RCNN pytroch训练问题：Warning: NaN or Inf found in input tensor.

最新推荐文章于 2024-04-12 09:44:10 发布

orzchenyuming

最新推荐文章于 2024-04-12 09:44:10 发布

阅读量7.9k

点赞数 8

分类专栏： python pytorch 文章标签：深度学习 pytorch

本文链接：https://blog.csdn.net/qq_29936933/article/details/111378275

版权

python 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

pytorch

11 篇文章 0 订阅

订阅专栏

problem

在自己的数据（voc格式）上训练Faster RCNN（https://github.com/jwyang/faster-rcnn.pytorch）就出现了loss=nan的问题。
在Pascal voc和coco上训练Faster RCNN都正常。

reason

可能是learning rate太大，调小learning rate。最有效的方法是learning rate设为0，看看是不是还有nan的问题。
大概率是自己的数据有问题（我的数据是voc格式），voc获取左边后是要减1的，如果你的数据的坐标框本身就是从0开始的，那减1就会导致超出图像边界。

solution

设置lr=0，如果不在出现loss=nan的问题，说明是learning rate太大，导致了梯度爆炸或梯度消失。可调整learning rate和weight decay。
如果lr=0后，依然存在loss=nan的问题，就修改pascal_voc.py中获取坐标框的代码：

原代码
x1 = float(bbox.find('xmin').text) - 1
y1 = float(bbox.find('ymin').text) - 1
x2 = float(bbox.find('xmax').text) - 1
y2 = float(bbox.find('ymax').text) - 1
修改后
x1 = float(bbox.find('xmin').text) 
y1 = float(bbox.find('ymin').text) 
x2 = float(bbox.find('xmax').text) 
y2 = float(bbox.find('ymax').text)

若设置了翻转（cfg.TRAIN.USE_FLIPPED = True），则需要在imdb.py中的def append_flipped_images(self)方法：

源代码
boxes[:, 0] = widths[i] - oldx2 - 1
boxes[:, 2] = widths[i] - oldx1 - 1
修改后
boxes[:, 0] = widths[i] - oldx2
boxes[:, 2] = widths[i] - oldx1

总结（可能导致loss=nan的情况）[2]

Coordinates out of the image resolution------------> NaN Loss
xmin=xmax-----------> Results in NaN Loss
ymin==ymax-----------> Results in Nan Loss
The size of bounding box was very small-----------> Results in NaN Loss

For the 4th case, we put a condition that the difference of |xmax -xmin| >= 20 and similarly |ymax- ymin| >=20

[1]https://github.com/VisionLearningGroup/DA_Detection/issues/11
[2]https://github.com/jwyang/faster-rcnn.pytorch/issues/136

orzchenyuming

关注

8
点赞
踩
13

收藏

觉得还不错? 一键收藏
3
评论
Faster RCNN pytroch训练问题：Warning: NaN or Inf found in input tensor.

problem在自己的数据（voc格式）上训练Faster RCNN（https://github.com/jwyang/faster-rcnn.pytorch）就出现了loss=nan的问题。在Pascal voc和coco上训练Faster RCNN都正常。reason可能是learning rate太大，调小learning rate。最有效的方法是learning rate设为0，看看是不是还有nan的问题。大概率是自己的数据有问题（我的数据是voc格式），voc获取左边后是要减1的
复制链接

扫一扫

专栏目录