我觉得不应该三张卡训练都带不动resnet18呀,最初在机子A报错如下:
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so th e stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
然后我换了个服务器,有卡!五张我不信带不动!结果报错:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
好吧,还是有问题
首先百度了这个问题,发现基本都说数据label不匹配导致不能反向传播、或者pytorch版本和cuda不匹配我一直对环境不匹配感觉不大可能,因为也不是新环境,以前也用这个环境跑过的代码。
而且就卡在输入模型那儿,也没网络打印,让我怎么看网络输出和原本label是不是匹配。抱着试试看的态度,换了个环境pytorch1.7,OK了!,之前用的pytorch1.8环境,好吧,有时候不是代码问题那就是环境问题!!
完整报错如下
Traceback (most recent call last):
File "pretrain_resnet18.py", line 242, in <module>
train_resnet(args)
File "pretrain_resnet18.py", line 163, in train_resnet
outputs = net(inputs)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 5 on device 5.
Original Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/lxd-workplace/landf/new_new/FACE/FER-2013/resnet3_18.py", line 206, in forward
return self._forward_impl(x)
File "/home/ubuntu/lxd-workplace/landf/new_new/FACE/FER-2013/resnet3_18.py", line 189, in _forward_impl
x = self.conv1(x)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/miniconda3/envs/pt1.8/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED