RuntimeError: CUDA error: device-side assert triggered

晨阳流苏

已于 2022-10-27 11:07:26 修改

阅读量2.5k

点赞数 1

文章标签：深度学习 pytorch 神经网络

于 2022-10-27 10:50:46 首次发布

本文链接：https://blog.csdn.net/m0_63828426/article/details/127546970

版权

项目场景：

在跑深度学习时用一个神经网络进行训练，第一个自己做的数据集可以完美运行，但换了第二个开源数据集时出现错误。

问题描述

这里贴上错误代码

C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: block: [0,0,0], thread: [34,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: block: [0,0,0], thread: [35,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: block: [0,0,0], thread: [605,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "D:/BIT_CD-master/main_cd.py", line 76, in <module>
    train(args)
  File "D:/BIT_CD-master/main_cd.py", line 15, in train
    model.train_models()
  File "D:\BIT_CD-master\models\trainer.py", line 294, in train_models
    self._backward_G()
  File "D:\BIT_CD-master\models\trainer.py", line 274, in _backward_G
    self.G_loss.backward()
  File "C:\Users\czc\anaconda3\envs\MobileVIT\lib\site-packages\torch\_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\czc\anaconda3\envs\MobileVIT\lib\site-packages\torch\autograd\__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

原因分析：

刚开始我以为是loss或者自己改动了模型的问题，但是经过验证都排除了，问题还是回到数据集上。我的数据集lable是二值化数据，然后我将数据增强前，进入模型和进入loss的数据都打印出来

果然发现错误
未变换 train_248_3_2.png [  0 156 255]
变换后 [  0   1 156]
然后我将这个图找出来查看