pytorch 踩坑记录

最新推荐文章于 2024-06-21 22:37:52 发布

_Cade_

最新推荐文章于 2024-06-21 22:37:52 发布

阅读量1.7w

点赞数 21

分类专栏： pytorch 文章标签： pytorch

本文链接：https://blog.csdn.net/u010510549/article/details/91390953

版权

pytorch 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

其他坑一些别人踩过的坑，知乎上的回答

1.Broadcast function not implemented for CPU tensors

这是因为model不在gpu上所致。model.to(device)。DataParallel会对模型参数所在的gpu位置进行检查，见源码

DataParallel是每次forward时对模型进行broadcast，当模型不在第一个GPU上时，就会出现错误

https://github.com/pytorch/pytorch/issues/17065

2.all tensors must be on devices[0]

这是因为model不在DataParallel设置的ids中的第一个上。输入的变量可以随便放在一个GPU上，而模型必须在你设置DataParallel的ids中的第一个

3. 多GPU模型转换到cpu上

通过DataParallel包装的model会再加一层module。所以state_dict会多一个module前缀。假设net1 是通过DataParallel包装的模型Net的实例，我们要把它装换到cpu上。方法就是重新建一个对象，把参数迁移过去

    state_dict = net.module.state_dict()
    net = Net()
    net.load_state_dict(state_dict)

4.使用DataParallel包装模型时，如果gpu>1且模型是多输出的，会出现梯度为None的错误

参数的梯度永远都是None，这个是pytorch 1.0 的一个bug 或见FloWaveNet issues,pytorch issues 15716

据说是因为引用计数的问题出的bug。所以这里的一个解决方案是上面链接提供的方法，我将其修改为可供多次输入。将forward过程分发的模型和output保留下来。backward之后再清除掉（丢除引用）。这里用一个list保存是因为可能一个模型要经过多次输入计算loss。如果仅仅一次输入，那么就只需要保存一次。之后要记得调用reset将保留的引用清空，不然的话，全都存着，gpu内存暴涨。

class DataParallelFix(nn.DataParallel):
    """
    Temporary workaround for https://github.com/pytorch/pytorch/issues/15716.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self._replicas = []
        self._outputs = []
        self.src_device_obj = torch.device("cuda:{}".format(self.device_ids[0]))

    def reset(self):
        self._replicas = []
        self._outputs = []

    def forward(self, *inputs, **kwargs):
        if not self.device_ids:
            return self.module(*inputs, **kwargs)

        for t in chain(self.module.parameters(), self.module.buffers()):
            if t.device != self.src_device_obj:
                raise RuntimeError(
                    "module must have its parameters and buffers "
                    "on device {} (device_ids[0]) but found one of "
                    "them on device: {}".format(self.src_device_obj,
                                                t.device))

        inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
        if len(self.device_ids) == 1:
            return self.module(*inputs[0], **kwargs[0])

        _replicas = self.replicate(self.module,
                                  self.device_ids[:len(inputs)])
        _outputs = self.parallel_apply(_replicas, inputs, kwargs)
        self._replicas.append(_replicas)
        self._outputs.append(_outputs)
        return self.gather(_outputs, self.output_device)

5. 或者training loss忽高忽低或者不下降。loss变为nan

一个原因是学习率太高

6. ByteTensor和LongTensor不会自动转换成FloatTensor

所以一个LongTensor除以一个数会只保留整数部分

比如

((out == label).sum() / float(batch_size)).item()
#结果是0
应该改成
((out == label).sum().item() / batch_size)
或者
((out == label).sum().float() / batch_size).item()

7.Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

这是由于某个模块进行两次backward导致的。有时候 retain_graph=True并不是你想要的，而是由于两次forward的第二次用了第一次的部分中间结果，backward的时候自动回传到第一次forward的模块，这时候也会出现这个问题。

8.RuntimeError start () + length () exceeds dimension size ()

使用LSTM的时候，由pack_padded_sequence打包的数据传入的input 和lengths不对，传入lstm对数据进行恢复时lengths比实际数据大所致。我在计算lengths之后又对input做了处理，导致input变短和lengths不对应导致的错误。确保输入input和lengths是正确的即可。

也有可能是在进行某些操作，如split的时候某个图片的shape不对，见其他原因

9.RuntimeError: cudaEventSynchronize in future::wait: device-side assert triggered

计算loss的时候，数值不对，要么是标签范围错误，要么数值logit范围错误。比如标签要从0开始，你没有或者有负数。

10.做除法要异常小心，除了要关注除零错误之外，还要注意各种误差。细小的误差可能导致后面的错误

11. RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

torch.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False) 中的inputs包含了在计算图中的变量。也就是有个变量已经求过梯度，计算图依然在，没有使用过。这可能是错误的。如果这是正常的行为，那么把allow_unused设为True即可

11. GPU占用随着batch占用不断增大

多半是你保留了计算图，一直累计。比如你保存结果时没有加detach或item。

12.同样的batch size，train的时候好好的，eval的时候却out of memory了

model.eval()只会改变像batchnorm, dropout等层的计算状态。其他的不会改变。也就是，它依旧会保留中间的结果，feature map啊，还有activate后的结果。以及构建计算图。然后你不进行backward，计算图没有被回收，一直保留着。除非你显示del变量或者backward。只要加一句with torch.no_grad()，然后在环境中进行模型的计算就不会保留中间结果，也不会构建计算图。

13.nn.Module子类中注册的Parameter，当执行model.to('cuda:0')转换到cuda上时，注册的参数is_leaf为true。而自己定义的tensor（requires_grad=True），转换到cuda时，就不是叶子节点了。因为torch保证，只有用户手动创建的tensor或者requires_grad=False的tensor为叶子。至于nn.Module.to做了什么操作，我也不知道（没看c源码，知道的麻烦告知，感谢）。如果要让自定义的转cuda时依旧保留为叶子节点，可以利用torch.tensor重新创建一个在cuda上的变量即可，见下面

a = torch.Tensor([1.,2.,3.]) #is_leaf=True
a.requires_grad = True
a_cuda = a.to('cuda:0')   #is_leaf=False
//copy_操作和clone操作都不行，也会在计算图上保留原节点
b = torch.tensor(a, device=torch.device('cuda:0')) #is_leaf=True

14.RuntimeError: grad_columns needs to be contiguous

我用cpu多次调用autograd会产生这个问题，用cuda就没事，我也不知道为什么。有相似的问题：https://github.com/pytorch/pytorch/issues/33168

15. conda安装pytorch出现错误

UnsatisfiableError: The following specifications were found to be incompatible with each other:

或者

UnsatisfiableError: The following specifications were found to be incompatible with each other

原因就是没办法找到不冲突的安装方案。可能是connda版本不合适，安装合适的conda版本即可。也可能是由于安装源中pytorch及其依赖版本冲突，可以通过添加合适的channel解决，比如（一些安装包只有默认channel有合适的版本）

conda config --add channels defaults
conda config --add channels conda-forge

16.transform() got an unexpected keyword argument 'fillcolor'

pillow版本过低，某些变换不支持fillcolor参数，比如旋转，平移。将pillow升级到5.0以上

待补充

_Cade_

关注

21
点赞
踩
44

收藏

觉得还不错? 一键收藏
0
评论
pytorch 踩坑记录

其他坑一些别人踩过的坑，知乎上的回答1.Broadcast function not implemented for CPU tensors 这是因为model不在gpu上所致。model.to(device)。DataParallel会对模型参数所在的gpu位置进行检查，见源码 DataParallel是每次forward时对模型进行broadcast，当模型不在第一个G...
复制链接

扫一扫