Pytorch遇到的问题及解决方案 - 更新ing

Tsingzao-于廷照

已于 2023-11-28 16:37:18 修改

阅读量4.8w

点赞数 22

分类专栏： Pytorch ML python 文章标签： pytorch 深度学习人工智能

于 2018-01-17 13:58:49 首次发布

本文链接：https://blog.csdn.net/yutingzhaomeng/article/details/79084405

版权

python 同时被 3 个专栏收录

32 篇文章 1 订阅

订阅专栏

31 篇文章 1 订阅

订阅专栏

Pytorch

20 篇文章 6 订阅

订阅专栏

1、安装完成后，不能import torch，提示 ImportError: dlopen: cannot load any more object with static TLS

解决办法：有很多答案都说是将import torch放在import cv2之前，但我试了之后还是不能解决，最后是通过在jupyter notebook中可以直接import torch。我是通过mobarxterm连接实验室的服务器，在console下以及spyder下均不能import torch，只有在jupyter下可以。

更新：也可通过修改backend解决。

2、对两个variable进行concat操作，按道理实现方式是c = torch.cat([a, b], dim=0)，但提示错误

TypeError: cat received an invalid combination of arguments - got (tuple, int), but expected one of:

(sequence[torch.cuda.FloatTensor] tensors)
(sequence[torch.cuda.FloatTensor] tensors, int dim)
didn’t match because some of the arguments have invalid types: (tuple, int)

解决办法：根据提示刚开始以为是cat不接受tuple作为输入，然而真正的问题在于a和b的type不一样，比如可能出现a是torch.cuda.DoubleTensor而b是torch.cuda.FloatTensor，因此，将a和b转换为相同的type即可。

3、模型训练时提示 RuntimeError: tensors are on different GPUs

这个问题出现的原因在于训练数据data或者模型model其中有一个是*.cuda()，而另一个不是。全都改为data.cuda()和model.cuda()即可

解决办法：data = data.cuda()

model = model.cuda()

4、模型训练时提示 TypeError: argument 0 is not a Variable

原因在于输入data不是Variable，需转化成Variable格式。

解决办法：from torch.autograd import Variable

data = Variable(data).cuda()

5、自定义Loss训练时提示 AttributeError: 'MyLoss' object has no attribute '_forward_pre_hooks'

根据题感觉像是loss在forward之前出错了，关于pytorch如何自定义loss可以参见这里。

解决办法：在loss初始化函数里加入 super(MyLoss, self).__init__()

6、训练过程没有问题，验证是提示CUDA Error：Out of Memory

提示是Memory的问题，第一反应是降低batch size大小，据说是有用的，但我试着将batch size降为1，仍然不行。再考虑其他办法，发现在定义Variable时，没有限制不求梯度（比如输入的input和target并不需要求梯度），根据搜索，有两种方法：一是采用requires_grad=False，另一种是使用volatile=True，一般推荐使用第二种。但我用的是Pytorch的0.4版本，volatile不再支持。

解决方法：用with torch.no_grad()替代volition。即如果源代码为

target_var = torch.autograd.Variable(target.cuda(async=True))

如果用0.4之前的版本可采用

target_var = torch.autograd.Variable(target.cuda(async=True),volatile=True)

如果0.4之后的版本，可采用

with torch.no_grad()
    target_var = torch.autograd.Variable(target.cuda(async=True),volatile=True)

问题基本解决。如果还有问题，那可能出在代码中可能出现了反复叠加的操作，比如acc的叠加，或者loss 的叠加，将loss中的data提取出，并且记得用完之后del即可。

7、提示‘BatchNorm2d’ object has no attribute ‘track_running_stats’错误

pytorch 0.4 不支持，由于版本不对应而出现的问题。

解决方法：更换pytorch版本，如降低至pytorch 0.3版本。

8、提示“Expected object of type torch.DoubleTensor but found type torch.FloatTensor for argument #2 'weight'”

解决方法：添加model.double()即可

9、提示Expected object of type torch.DoubleTensor but found type torch.cuda.DoubleTensor for argument #2 'weight'

之前的写法是inputs.cuda(), outputs.cuda()

解决方法：改写为inputs=inputs.cuda(). outputs=outputs.cuda()

10、Debug时候卡在第一个epoch，但run时没有任何问题。

解决方法：将dataloader的num_works设置为1即可

11、RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

出现问题的原因是Train的代码中至少调用了两次loss.backward()

解决办法：在第二次调用loss.backward()之前更新output，即在loss.backward()前添加output = model(input)

12、加载已有模型提示Unexpected key(s) in state_dict: "module.aaa. ...".，Expected ".aaa...."

出现问题的原因是在训练保存模型是采用了数据并行。

解决办法：一方面可以直接读取state_dict后重新修改其key值，将module去掉；另一方面可以采用

model = nn.DataParallel(model)

将模型重新定义为并行方式，即可加载。

13、定义加载数据模块时，自定义数据反转，如data = data[:,::-1,:]，提示错误 ValueError: some of the strides of a given numpy array are negative. This is currently not supported, but will be added in future releases.

解决办法：提示的错误很直接，就是pytorch不支持数据反转用负号索引。解决办法有两种，第一种是事先存储好反转数据，比较麻烦；第二种方式返回data.copy()

class Loader(Dataset):
    def __init__(self):
        pass;
    def __getitem__(self,index):
        pass;
    def flip(self,data):
        data = data[:,::-1,:]
        return data.copy()
    def __len__(self):
        pass;

14、模型测试时

TypeError: Broadcast function not implemented for CPU tensors

解决办法：出现该问题的原因在于旧版pytorch不支持在CPU上的并行。最新版的pytorch已经支持，更新pytorch版本即可。

15、加载模型时

torch.load('model.pth')

提示 RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

解决办法：问题出在默认加载模型采用cuda形式，而机器没有cuda。解决方法很直接，根据提示，修改加载模型代码为

torch.load('model.pth',map_location=torch.device('cpu'))

16、计算交叉熵损失CrossEntropyLoss时

提示Runtime Error: 1only batches of spatial targets supported (non-empty 3D tesnors) but got target of size ...

原因在于pytorch的CrossEntropyLoss中target需要时三维张量，

解决办法：将多余的维度squeeze即可

17、计算损失时

提示RuntimeError: bool value of Tensor with more than one value is ambiguous

这个问题比较囧。碰到这个问题时，是为了快速检查网络是否能跑通，因此错误的采用了nn.*Loss(output, target)导致。

解决办法：nn.*Loss()(output, target)

18、加载预训练模型参数update不能更新

问题如下：

In [1]: import torch

In [2]: from collections import OrderedDict

In [3]: class test(torch.nn.Module):
   ...:      def __init__(self):
   ...:          super(test,self).__init__()
   ...:          self.conv = torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=3)
   ...:      def forward(self,input):
   ...:          return self.conv(input)
   ...:

In [4]: temp = test()

In [5]: dic = OrderedDict()

In [6]:  dic['conv.weight'] = torch.rand((1,1,3,3))

In [7]: temp.state_dict()
Out[7]:
OrderedDict([('conv.weight', tensor([[[[-0.1748,  0.0271, -0.3102],
                        [-0.1261, -0.2181,  0.0350],
                        [ 0.0762, -0.0180, -0.1770]]]])),
             ('conv.bias', tensor([0.2851]))])

In [8]: temp.state_dict().update(dic)

In [9]: temp.state_dict()
Out[9]:
OrderedDict([('conv.weight', tensor([[[[-0.1748,  0.0271, -0.3102],
                        [-0.1261, -0.2181,  0.0350],
                        [ 0.0762, -0.0180, -0.1770]]]])),
             ('conv.bias', tensor([0.2851]))])
In [16]: dic
Out[16]:
OrderedDict([('conv.weight', tensor([[[[0.2074, 0.9585, 0.9153],
                        [0.0786, 0.8215, 0.8277],
                        [0.3613, 0.6411, 0.4371]]]]))])

如上所示，我们期望通过自定义的dic去更新模型参数，发现与预期不一致，模型state_dict并未更新。

解决办法：参见第19

19、部分加载预训练模型

正确的加载部分预训练模型参数方法如下（接上述问题的ipython代码）：

In [10]: model_state = temp.state_dict()

In [12]: model_state.update(dic)

In [13]: temp.load_state_dict(model_state)
Out[13]: IncompatibleKeys(missing_keys=[], unexpected_keys=[])

In [15]: temp.state_dict()
Out[15]:
OrderedDict([('conv.weight', tensor([[[[0.2074, 0.9585, 0.9153],
                        [0.0786, 0.8215, 0.8277],
                        [0.3613, 0.6411, 0.4371]]]])),
             ('conv.bias', tensor([0.2851]))])

In [16]: dic
Out[16]:
OrderedDict([('conv.weight', tensor([[[[0.2074, 0.9585, 0.9153],
                        [0.0786, 0.8215, 0.8277],
                        [0.3613, 0.6411, 0.4371]]]]))])

我们发现，模型参数更新了。即问题出在不能直接采用update更新，用update更新后需重新通过load_state_dict函数加载进去，才能完成模型的部分加载参数。

20、程序运行过程中，提示 WARNING:root:NaN or Inf found in input tensor.

因为提示的“found in input tensor”，所以第一反应是数据没有做过滤，出现了NaN或者Inf。万万没想到，出现问题的原因竟然是梯度消失。。。

解决办法：修改学习率或者变换优化方法。

21、在import torchvision过程中提示 AttributeError: module 'torch.jit' has no attribute 'unused'

提示错误的原因在于torchvision版本问题，应该是高版本的torchvision不支持。

解决办法：降低torchvision版本至0.4及以下。

22、Dataloader读取数据时，提示 RuntimeError: invalid argument 0: Sizes of tensors must match except in dime

原因在于同一batch的不同样本，其通道数不一致，以读取图片为例，如有的以灰度方式读取，其他以RGB方式读取，那么会提示上述错误，

解决办法：筛查数据读取方式，将读取方式更改为一致即可。

23、用torch.index_select对指定维度的Tensor选取数据时，提示

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #3 'index' in call to _th_index_select

解决方法很直白，要求index数据类型为Long的Tensor，另外如果需要提取多index，可采用range的方式，如：

data_index, label_index = torch.Tensor([0, 12, 18, 21, 22, 23]).long(), torch.Tensor(range(24, 48)).long()
data, label = torch.index_select(input, dim=2, index=data_index), torch.index_select(input, dim=2, index=label_index)

24、提示 ImportError: cannot import name 'amp' from 'torch.cuda'

解决办法：安装apex

git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cpp_ext

将from torch.cuda import amp 替换为 from apex import amp

25、提示：TypeError: only integer tensors of a single element can be converted to an index

解决办法：可以debug注意一下出错的位置，一般为数据的类型存在问题，或者误将函数的多个返回值当作一个返回值进行运算。

26、调用pytorch-lightning的self.hparams=hparams时提示AttributeError: can't set attribute

解决办法：将self.hparams=hparams替换为self.save_hyperparameters(hparams)

27、Pytorch模型部署用ONNX提示：orch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of operator upsample_bilinear2d, align_corners == True.

解决办法：将导出onnx模型的opset_version参数设置为11。

28、不能通过pip安装cartopy或cinrad库，提示GEOS或PROJ等错误

解决办法：采用conda方式直接安装

conda install cartopy
conda install cinrad

29、提示：CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

解决办法：将device设置为cpu一般可查看具体问题在哪儿，另一种方法是查看GPU显存是否够用。

30、用torchmetrics自定义metric时，提示：RuntimeError: result type Double can't be cast to the desired output type Long

解决办法：有时候问题并不来自update或compute，如按如下自定义RMSE

class RMSE(torchmetrics.Metric):
    def __init__(self):
        super(RMSE, self).__init__()
        self.add_state('sum_squared_errors', torch.tensor(0), dist_reduce_fx='sum')
        self.add_state('n_observations', torch.tensor(0), dist_reduce_fx='sum')

    def update(self, preds, target):
        print(preds.dtype, target.dtype)
        self.sum_squared_errors += torch.sum((preds-target)**2)
        self.n_observations += preds.numel()

    def compute(self):
        return torch.sqrt(self.num_squared_errors/self.n_observations)

会提示上述错误。通过查看preds和target确认属于float或double而非Long。所以问题出在注测sum_squared_errors这里，将torch.tensor(0)改为torch.tensor(0.0)即可。