解决RuntimeError: cuda runtime error (59) : device-side assert triggered

最新推荐文章于 2022-10-27 19:24:29 发布

程序员对白

最新推荐文章于 2022-10-27 19:24:29 发布

阅读量2.2k

点赞数 2

分类专栏：深度学习

原文链接：https://blog.csdn.net/qq_22821801/article/details/90212788

版权

深度学习专栏收录该内容

52 篇文章 9 订阅

订阅专栏

运行程序时发现这个问题

Traceback (most recent call last):
File "train_pytorch1.py", line 217, in <module>
loss = F.cross_entropy(output, target)
File "/usr/local/python3/lib/python3.5/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/usr/local/python3/lib/python3.5/site-packages/torch/nn/functional.py", line 1790, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111

这个异常大概是和计算损失值有关

查阅资料时发现很多道友都遇到过这种cuda runtime error(59)，大部分都是索引异常

根据这篇帖子中一位Pytorch Dev所述，由于cuda的异步性质，断言可能不会指向指向断言从哪里触发的完整正确的堆栈跟踪。

在程序导入模块前，加入下述语句，可以打印出更多的细节


 
 
   
   
    
    
   
   
   
   
    
    
     
     import os
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     os.environ[
     
     'CUDA_LAUNCH_BLOCKING'] = 
     
     "1"

异常输出如下：

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=111 error=59 : device-side assert triggered
Traceback (most recent call last):
File "train_pytorch1.py", line 217, in <module>
loss = F.cross_entropy(output, target)
File "/usr/local/python3/lib/python3.5/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/usr/local/python3/lib/python3.5/site-packages/torch/nn/functional.py", line 1790, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111

根据所打印出的信息，猜测是label的索引出现问题。

于是在打印出读入数据的label，发现果然出现了问题

正确的索引应该是0~44，而程序中读入的是1~45。对label修正后，程序正常运行。

关于调试CUDA断言，推荐一博客，个人认为解释的比较清楚了。Debugging CUDA device-side assert in PyTorch