RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

在两台不同服务器上尝试训练Resnet18模型时遇到了CUDA内存不足和cuDNN初始化失败的错误。最初是CUDAoutofmemory,后来更换服务器和增加GPU数量后,报出cuDNN_STATUS_NOT_INITIALIZED。作者怀疑并非数据标签或PyTorch与CUDA版本不匹配的问题,因为环境以前能运行其他代码。最后,通过降级PyTorch版本从1.8到1.7解决了问题。
摘要由CSDN通过智能技术生成

我觉得不应该三张卡训练都带不动resnet18呀,最初在机子A报错如下:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so th                     e stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

然后我换了个服务器,有卡!五张我不信带不动!结果报错:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

好吧,还是有问题
首先百度了这个问题,发现基本都说数据label不匹配导致不能反向传播、或者pytorch版本和cuda不匹配我一直对环境不匹配感觉不大可能,因为也不是新环境,以前也用这个环境跑过的代码。
而且就卡在输入模型那儿,也没网络打印,让我怎么看网络输出和原本label是不是匹配。抱着试试看的态度,换了个环境pytorch1.7,OK了!之前用的pytorch1.8环境,好吧,有时候不是代码问题那就是环境问题!!

在这里插入图片描述

完整报错如下

Traceback (most recent call last):
  File "pretrain_resnet18.py", line 242, in <module>
    train_resnet(args)
  File "pretrain_resnet18.py", line 163, in train_resnet
    outputs = net(inputs)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 5 on device 5.
Original Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/lxd-workplace/landf/new_new/FACE/FER-2013/resnet3_18.py", line 206, in forward
    return self._forward_impl(x)
  File "/home/ubuntu/lxd-workplace/landf/new_new/FACE/FER-2013/resnet3_18.py", line 189, in _forward_impl
    x = self.conv1(x)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

lanmy_dl

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值