项目场景:
在跑深度学习时用一个神经网络进行训练,第一个自己做的数据集可以完美运行,但换了第二个开源数据集时出现错误。
问题描述
这里贴上错误代码
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: block: [0,0,0], thread: [34,0,0] Assertion `t >= 0 && t < n_classes` failed. C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: block: [0,0,0], thread: [35,0,0] Assertion `t >= 0 && t < n_classes` failed. C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: block: [0,0,0], thread: [605,0,0] Assertion `t >= 0 && t < n_classes` failed. Traceback (most recent call last): File "D:/BIT_CD-master/main_cd.py", line 76, in <module> train(args) File "D:/BIT_CD-master/main_cd.py", line 15, in train model.train_models() File "D:\BIT_CD-master\models\trainer.py", line 294, in train_models self._backward_G() File "D:\BIT_CD-master\models\trainer.py", line 274, in _backward_G self.G_loss.backward() File "C:\Users\czc\anaconda3\envs\MobileVIT\lib\site-packages\torch\_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "C:\Users\czc\anaconda3\envs\MobileVIT\lib\site-packages\torch\autograd\__init__.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
原因分析:
刚开始我以为是loss或者自己改动了模型的问题,但是经过验证都排除了,问题还是回到数据集上。我的数据集lable是二值化数据,然后我将数据增强前,进入模型和进入loss的数据都打印出来
果然发现错误
未变换 train_248_3_2.png [ 0 156 255] 变换后 [ 0 1 156]
然后我将这个图找出来查看
果然有不等于0或255的像素,这就简单了。
解决方案:
在传入数据时直接加入代码
if self.label_transform == 'norm': label = label // 255
运行成功