PyTorch填坑攻略

mingo_敏

已于 2023-03-23 13:17:18 修改

阅读量2k

点赞数 1

于 2020-04-14 19:41:14 首次发布

本文链接：https://blog.csdn.net/shanglianlm/article/details/105519359

版权

Deep Learning 同时被 3 个专栏收录

269 篇文章 43 订阅

订阅专栏

pytorch

84 篇文章 32 订阅

订阅专栏

Python

43 篇文章 2 订阅

订阅专栏

1 RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 337 and 336 in dimension 3 at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/TH/generic/THTensor.cpp:689
解决方法:
这是因为输入的大小不匹配，跟数据集有关，也跟数据预处理中的函数相关：

transforms.Resize(input_size)

该函数是按比例缩放，可能是因为该数据集的分辨率不同，所以出来的结果不是(224,224)的，解决办法是改为使用：

transforms.Resize((input_size, input_size))

即可
参考资料:
1 pytorch数据预处理错误的解决

2 RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [16, 256, 20, 20]], which is output 0 of ReluBackward1, is at version 2; expected version 1 instead.

解决方法:
有些时候，往往会遇到比如 Adam 就没有 nan 而 SGD 就会出现 nan，这种通常都是 Loss 设得太大，可以调低学习率试试。
其他可能产生 nan 的地方可以尝试定位下：
1、脏数据，输入有 NaN
2、设置 clip gradient
3、更换初始化参数方法

参考资料:
1 关于 pytorch inplace operation, 需要知道的几件事
2 PyTorch 常见问题整理

3 Expected more than 1 value per channel when training

一个可能的原因是出现了输入 batch_size = 1 的情况

解决方法:

可以考虑在 DataLoader 属性加上 drop_last=True 解决，它会抛弃掉不够一个 batch size 的情况。
如果需要 batch_size = 1 的训练方式，还可以考虑把网络中的 BatchNorm 换成 InstanceNorm，或者直接去掉报错位置的BatchNorm 。

参考资料:
1 PyTorch 常见问题整理

4 Can’t call numpy() on Variable that requires grad. Use var.detach().numpy() instead

解决方法:
通常有两种情况：
1）这个变量是含有训练参数的，需要反向传播，则使用 var.detach().numpy() 获取。
2）如果这个变量是不进行训练的不需要反向传播，则将相关的代码用（with torch.no_grad()）修饰即可

参考资料:
1 PyTorch 常见问题整理

5 TypeError: ToTensor() takes no arguments

类需要实例化
将

transform = getattr(ctf, t)(**v) if v is not None else getattr(ctf, t)

改为

transform = getattr(ctf, t)(**v) if v is not None else getattr(ctf, t)()

参考资料:
1 TypeError: ToTensor() takes no arguments

6 Assertion `t >= 0 && t < n_classes` failed.

void cunn_SpatialClassNLLCriterion_updateOutput_kernel
(T *, T *, T *, long *, T *, int, int, int, int, int, long)
[with T = float, AccumT = float]: 
block: [0,0,0], thread: [277,0,0]
Assertion `t >= 0 && t < n_classes` failed.

检查三个地方：
a. 标签类别数；
b. 标签中类别的表示，是不是从 0 到类别数-1 (最大概率)
c. 网络中输出的类别数
一般如果数据确定没问题，那就是网络输出的问题。

参考资料:
1 BUG解决：SpatialClassNLLCriterion.cu:103:void cunn_SpatialClassNLLCriterion_updateOutput_kernel

7 RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 ‘other’

原因：变量没有加cuda
data=data.cuda()

8 RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:519

可能的原因1：
在网络外要用到网络中的子模块，却没有加上.module，例如：

if __name__ == '__main__':
    model = yourmodel()
    model = torch.nn.DataParallel(model, device_ids = gpu_id)
    model.to(device)
    if args.run_type == 'train':
        seq.train()  # 单GPU
    elif args.run_type == 'predict':
        seq.predict()  # 单GPU

正确的做法：

if __name__ == '__main__':
    model = yourmodel()
    
    model = torch.nn.DataParallel(model, device_ids = gpu_id)
    model.to(device)
    if args.run_type == 'train':
        seq.module.train()  # 多GPU
    elif args.run_type == 'predict':
        seq.module.predict()  # 多GPU

可能的原因2：
某一处没有使用".to(device)"，例如：

   if self.use_cuda:
        decoder_input = decoder_input.cuda()
        decoder_context = decoder_context.cuda()

正确的做法：

  if self.use_cuda:
        decoder_input = decoder_input.to(device)
        decoder_context = decoder_context.to(device)

9 RuntimeError: copy_if failed to synchronize: device-side assert triggered

问题描述：

这个问题是我在使用SSD做目标检测时遇到的，我要检测的目标有5种类别，所以我在data/config.py中的num_classes参数写了5，经过多方查找，发现了一个没注意到的细节，类别应该是5+1，那个1应该是背景。

还有一个原因就是标签的标号没有从0开始。

mingo_敏

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

PyTorch填坑攻略

1 RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0

2 RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [16, 256, 20, 20]], which is output 0 of ReluBackward1, is at version 2; expected version 1 instead.

3 Expected more than 1 value per channel when training

4 Can’t call numpy() on Variable that requires grad. Use var.detach().numpy() instead

5 TypeError: ToTensor() takes no arguments

6 Assertion t >= 0 && t < n_classes failed.

7 RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 ‘other’

8 RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:519

9 RuntimeError: copy_if failed to synchronize: device-side assert triggered

6 Assertion `t >= 0 && t < n_classes` failed.