错误记录帖(持续更新)

任小喵r

已于 2023-03-15 16:57:47 修改

阅读量1.6k

点赞数 2

分类专栏： Python 文章标签：深度学习目标检测

于 2021-10-21 14:31:57 首次发布

本文链接：https://blog.csdn.net/renzijing/article/details/120885454

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1 错误：RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0
在这里插入图片描述

代码来源：detectron2中的GeneralizedRCNNWithTTA

分析：输入all_boxes变量的tensor来自不同设备，所以从出错行往上找输入all_boxes的tensor有哪些

for output, tfm in zip(outputs, tfms):
    # Need to inverse the transforms on boxes, to obtain results on original image
    pred_boxes = output.pred_boxes.tensor
    original_pred_boxes = tfm.inverse().apply_rotated_box(pred_boxes.cpu().numpy())
    all_boxes.append(torch.from_numpy(original_pred_boxes).to(pred_boxes.device))
    all_scores.extend(output.scores)
    all_classes.extend(output.pred_classes)

可以看到第四行代码中all_boxes中插入了转为pred_boxes.device的变量，而上一句pred_boxes转为了cpu上，所以更改这里的设备

更改后的代码为：

for output, tfm in zip(outputs, tfms):
    # Need to inverse the transforms on boxes, to obtain results on original image
    pred_boxes = output.pred_boxes.tensor
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    original_pred_boxes = tfm.inverse().apply_rotated_box(pred_boxes.cpu().numpy())
    all_boxes.append(torch.from_numpy(original_pred_boxes).to(device))
    all_scores.extend(output.scores)
    all_classes.extend(output.pred_classes)

2 错误：RuntimeError: Address already in use

代码来源：moco

分析：使用 netstat -nltp 命令查看监听端口的进程列表

在这里插入图片描述
如果就是自己使用的可以直接kill掉4041109进程，如果不可以就要找一个没被占用的端口号，比如16308，而moco默认的 –dist-url ‘tcp://localhost:10001’ 就需要改为 –dist-url ‘tcp://localhost:16038’

3 ReDet的riroi align下执行python setup.py install时出错

具体会出现的警告有很多，但根本原因是源代码要求的pytorch版本过低，一些代码习惯已经被弃用，从而导致编译出错。

官方有提供一个修改帮助，即把.cpp和.cu里的AT_CHECK都替换为TORCH_CHECK，但除此之外还有一些需要更改的（这里列举的只是riroi align里的，其他未出现的也是仿照提示更改即可）：

比如：warning: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations]
AT_DISPATCH_FLOATING_TYPES_AND_HALF(

也就是top_grad.data<scalar_t>()这类代码需要改为top_grad.data_ptr<scalar_t>()

这里碰到的还需要修改的有：tensor.type()改为tensor.scalar_type()，CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")中x.type()改为x.device()等等，这里做的就是注意每一个Warning并修改。

（ps：这套改法在cuda11.0+pytorch1.7下生效，在cuda11.4+pytorch1.7下未成功，大胆猜测（胡言乱语）可能是mmcv暂未支持cuda11.4？？？）

4 错误：RuntimeError: CUDA error: device-side assert triggered

代码来源：mmrotate训练

分析：这里看错误提示可以发现是dataloader方面的问题，看网上发现具体原因还是比较多的，但可能大多都表现在某一种data超出界限类似，这里我的具体错误原因是忘记改数据类别了（非常没有水平的错误），只需要把两个num_classes都改成自己使用的数据集类别即可

5 错误：IndexError: invalid index of a 0-dim tensor. Use tensor.item() in Python or tensor.item<T>() in C++ to convert a 0-dim tensor to a number

代码来源：tps_stn_pytorch

分析：pytorch版本更新导致的问题，根据提示更改，如 loss.data[0] 改为loss.ietm() ，F.nll_loss(output, target).data[0]改为F.nll_loss(output, target).item()

6 错误：RuntimeError: invalid argument 7: equal number of batches expected at /opt/conda/conda-bld /pytorch_1607370156314/work/aten/src/THC/generic/THCTensorMathBlas.cu:32

代码来源：tps_stn_pytorch

分析：tensor维度不一样，原代码针对mnist数据设计，在nn.Linear(x1,x2)，T.view(x1,x2)等操作中，x1、x2的维度根据mnist设置，需要根据自己数据集更改（不熟悉数据以及更改操作的可以先debug看一下原数据维度与代码中维度的对应情况，再根据自己数据维度进行调整）

7 错误：ValueError: Expected input batch_size (1) to match target batch_size (4)

代码来源：tps_stn_pytorch

分析：同样是因为tensor维度问题，设置batch_size为4，但传入的数据却是1，这里错误原因还是因为T.view(x1,x2)中维度的设置与使用的数据集不符，更改即可

8 错误：pycharm debug卡住 “connected”

分析：【File】 >>【Settings】 >>【Build, Execution, Deployment】>>【Python Debugger】 >> 勾选【Gevent compatible】（没有深入探究原因，照着网上方法试了后起效）

9 错误：RuntimeError: Tensor for ‘out’ is on CPU, Tensor for argument #1 ‘self’ is on CPU, but expected them to be on GPU (while checking arguments for addmm)

代码来源：试图在resnet50中加入stn

分析：
错误代码

参考这篇博客，应该是代码已经运行到GPU阶段才初始化网络，导致参数及网络加载到了CPU上（可能是这样）

更改方法是将网络初始化放到__init__中，在代码最初统一进行初始化，改完如下

在这里插入图片描述

10 错误：RuntimeError: mat1 dim 1 must match mat2 dim 0

代码来源：更改mocov2的网络结构

分析：根据输出的错误定位代码，再上网搜索确定出错原因是网络结构其中两层维度不匹配，但是一直没找到正确位置

原因是resnet50网络结构输出是：layer、avgpool、fc，所以更改网络结构是这样改的

    def __init__(self, base_encoder, dim=128, mlp=False, preangle=1):
        super(stn_net, self).__init__()
		self.backbone = nn.Sequential(*list(net.children())[:-2])
		self.stn = stn(2048, 7, 7, 0.8)
		self.avgpool = net.avgpool
		self.fc = net.fc

    def forward(self, x, angles):
        feature = self.backbone(x)
        stn_feature = self.stn(feature, angles)
        stn_feature = self.avgpool(stn_feature)
        net_out = self.fc(stn_feature)

在排除掉stn结构的加入不会改变维度后，单独输出一张图片到标准resnet50结构中查看每一层维度，发现图片从avgpool输出的维度是[batch_size,2048,1,1]，但输入fc的维度需要是[batch_size,2048]，而拆开resnet50后不会从四维变二维，所以需要额外调整一下维度

    def forward(self, x, angles):
        feature = self.backbone(x)
        stn_feature = self.stn(feature, angles)
        stn_feature = self.avgpool(stn_feature).view(stn_feature.size(0), -1)
        net_out = self.fc(stn_feature)

对avgpool的输出做一个维度调整就不会报错了

11 错误：RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

代码来源：更改mocov2的网络结构

分析：还是在对mocov2进行一番重新创造后，终于能在开始训练前不报错了，但训练刚开始就又error

根据这个错误输出分析，重点就在inplace operation

在网上搜到类似问题及回答，类似问题及回答2

总结一些网上看到的inplace操作（大概指改变某变量时，直接在其原内存上改变）

x += 1        x *= 2
tanh()        scatter_add_()      add_()
x[:,1] /= 4   x[:,1] = x[:,3]/4   x[:,3] = x[:,1]

不会引起的有

y = x + 1
y = x.clone()

(但并不是所有inplace操作什么情况都会出错，还是要具体问题具体分析)

由于调试时是一个错误一个错误排除的（上述inplace操作在代码里占了好几个），所以这里并不能肯定具体是哪一个引起的

当然我这里还有一种形式的错误

def forward(self, img, angles):
	nn.init.constant_(self.fc[2].weight, 0)
    self.fc[2].bias.data.copy_(self.bias[angles[i]])

这里的复制操作可能也是直接在原内存改变导致出错，但由于本身确实要改变fc层的权重，所以后面在init里创建了三个不同的fc层，分别赋值，forward里直接根据传进的标签选择fc层

12 错误：ImportError: libGL.so.1: cannot open shared object file: No such file or directory

代码来源：mmrotate训练

分析：明显是什么包没有下载，下载即可

pip install opencv-python-headless

任小喵r

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
错误记录帖(持续更新)

错误记录帖
复制链接

扫一扫

专栏目录

错误记录帖(持续更新)

“相关推荐”对你有帮助么？