1、non-finite loss, ending training tensor(nan, device=‘cuda:0‘,2、‘LogSoftmaxBackward3、Function ‘MulB

dyh_cy

已于 2022-11-09 10:06:35 修改

阅读量2.7k

点赞数 8

文章标签：深度学习 python pytorch

于 2022-11-09 09:38:56 首次发布

本文链接：https://blog.csdn.net/qq_42042528/article/details/127763332

版权

这篇博客记录了在使用PyTorch进行深度学习训练时遇到的非有限损失（nan）问题及解决方案。作者尝试更换数据集、调整损失函数和检查代码，最终发现问题是由于部分图像数据接近白色导致。通过加入torch.autograd.set_detect_anomaly(True)以获取详细错误信息，定位到问题出现在Relu层，删除相应模块后恢复正常。

摘要由CSDN通过智能技术生成

WARNING: non-finite loss, ending training tensor(nan, device='cuda:0', grad_

错误1：WARNING: non-finite loss, ending training tensor(nan, device=‘cuda:0’, grad_

错误2：Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.

错误3：Function ‘MulBackward0’ returned nan values in its 0th output

参考1：出现这种情况，大家可以尝试换一个数据集，我折腾了两天，pytorch版本也换了，各种的都试了，一直怀疑是网络结构的问题，改来改去的，还是不行，我的损失函数是交叉熵，focalloss也试了，这两个损失函数都用到了log这玩意，出现0值时就会报错，换了个数据集可以了，最后看了看数据集中的图像，有些图片接近于白色，值非常接近0 ，可能是这个原因导致的把

参考2：torch.autograd.set_detect_anomaly(True)，可以尝试在训练train.py文件最开始的位置加入这句话，这样报错的时候就会有更详细的解释，我的是其中一个模块的问题，删除了模块，代码正常，如果加入那个模块会出现这个问题，意思就是某个值被修改了，最后还祝我好运，哈哈哈，网上的办法都试了，包括改inplace=false，RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 480, 14, 14]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

加入torch.autograd.set_detect_anomaly(True)，这个后，更详细的报错如下，可以根据详细的报错位置去删除修改，可以看到我的具体报错在下文加粗的地方，将其删除后，代码正常了

C:\Users\dyh.conda\envs\dyh_torch2\python.exe M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py
5232 images were found in the dataset.
4187 images for training.
1045 images for validation.
Using 12 dataloader workers every process
0%| | 0/262 [00:00<?, ?it/s]loss= tensor(0.6223, device=‘cuda:0’, grad_fn=)
C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\autograd_init_.py:173: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 127, in
main(opt)
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 84, in main
mean_loss = train_one_epoch(model=model,
File “M:\第三个分类\第三个分类\Test8_densenet\utils.py”, line 129, in train_one_epoch
pred = model(images.to(device))
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “M:\第三个分类\第三个分类\Test8_densenet\model_three_path.py”, line 308, in forward
features_sk3 = self.sk3(left2)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in call_impl
return forward_call(*input, **kwargs)
File “M:\第三个分类\第三个分类\Test8_densenet\sk_model.py”, line 41, in forward
fea = conv(x).unsqueeze(dim=1)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\container.py”, line 139, in forward
input = module(input)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\activation.py”, line 98, in forward
return F.relu(input, inplace=self.inplace)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\functional.py”, line 1457, in relu
result = torch.relu(input)
(Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
0%| | 0/262 [00:54<?, ?it/s]
Traceback (most recent call last):
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 127, in
main(opt)
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 84, in main
mean_loss = train_one_epoch(model=model,
File “M:\第三个分类\第三个分类\Test8_densenet\utils.py”, line 133, in train_one_epoch
loss.backward()
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\autograd_init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 480, 14, 14]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Process finished with exit code 1

dyh_cy

关注

8
点赞
踩
21

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫