试用mmdetection

最新推荐文章于 2024-05-16 13:35:36 发布

xiaoshuzi666

最新推荐文章于 2024-05-16 13:35:36 发布

阅读量7.1k

点赞数 2

分类专栏：目标检测

本文链接：https://blog.csdn.net/xiaoshuzi666/article/details/100123195

版权

目标检测专栏收录该内容

10 篇文章 0 订阅

订阅专栏

用vistart/mmdetection 的镜像ID创建容器crazy_murdock(容器id为32f1a377fd2f)并配置好端口、frp、ssh之后，我用mobaxterm来连接它，/mmdetection整个文件夹下载到本地D盘，再用pycharm远程连接映射。
一、阅读说明文档
用Notepad++打开查看README.md,INSTALL.md,GETTING_STARTED.md,MODEL_ZOO.md,TECHNICAL_DETAILS.md。
1.根据INSTALL.md，在mmdetection下创建data目录并创建软连接

cd /mmdetection
mkdir data
ln -s /mmdetection/data/coco data

然后进入coco文件夹，下载coco2017的数据集(wget+文件地址)。
2.根据GETTING_STARTED.md里面的例子，先下载MODEL_ZOO.md里面列出的checkpoint文件(https://open-mmlab.oss-cn-beijing.aliyuncs.com/mmdetection/models/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth和https://open-mmlab.oss-cn-beijing.aliyuncs.com/mmdetection/models/mask_rcnn_r50_fpn_1x_20181010-069fa190.pth)到/mmdetection/checkpoints文件夹下
二、运行解决bugs
1.在mobaxterm命令窗口运行

python tools/test.py configs/faster_rcnn_r50_fpn_1x.py \
    checkpoints/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth \
    --show

出错：
AttributeError: module ‘torch.nn’ has no attribute ‘SyncBatchNorm’，同门说升级torch到1.1.0版本。
2.安装nccl
升级torch到1.1.0版本后再运行，出现了新的错误：
ImportError: /mmdetection/mmdet/ops/sigmoid_focal_loss/sigmoid_focal_loss_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE
在github的issues上搜索，有说改文件夹名字的，有说改mmdetection的路径的，同门建议重新compile，试过了都没有作用。反复斟酌INSTALL.md，我想着试试安装nccl再compile。
先在https://developer.nvidia.com/nccl上下载对应cuda10.1的网络版的nccl，然后

sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2=2.4.8-1+cuda10.1 libnccl-dev=2.4.8-1+cuda10.1

完成后，我按照INSTALL.md里面的指示，在/mmdetection文件夹下执行

./compile.sh
python setup.py develop

重新编译一遍。
3.升级torchvision到0.4
再次运行，报错：
OSError: checkpoints/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth is not a checkpoint file
我把配置checkpoints/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth改为/mmdetection/checkpoints/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth，还是不行，尝试升级torchvision到0.4，并在服务器上/mmdetection/checkpoints文件夹下重新wget https://open-mmlab.oss-cn-beijing.aliyuncs.com/mmdetection/models/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth（之前是在本机上下载后上传到服务器的）。
4.改共享内存大小
（1）再次运行，出错：
ETA:: cannot connect to X server
百度了一下说是因为linux下不能调用imshow()可视化，于是尝试运行另一个例子：

python tools/test.py configs/mask_rcnn_r50_fpn_1x.py \
    checkpoints/mask_rcnn_r50_fpn_1x_20181010-069fa190.pth \
    --out results.pkl --eval bbox segm

终于得出一个结果。
（2）接着试着运行训练文件：

python tools/train.py configs/faster_rcnn_r50_fpn_1x.py

报错：Unexpected bus error encountered in worker.This might be caused by insufficient shared memory(shm).
……
Runtimeerror:Dataloader worker(pid 11816) is killed by signal:Bus error.
同门说应该是创建容器的共享内存不足。于是我stop了我的容器，保存它的镜像。

docker commit 32f1a377fd2f qdetection:v2

其中32f1a377fd2f是我原来容器的id，qdetection是我指定新镜像的name，v2是我指定新镜像的tag。docker通过哈希算法分配了新镜像的地址:
sha256:25636cf40ce2da265f1fc2acbe54bbc1ee5e728969ab765beedc66262a8815de
用docker images列出镜像，找到qdetection的ID为b2734dab62b4。于是我用这个镜像重新生成一个容器，同时指定shm大小为30g:

nvidia-docker run --shm-size='30g' -it b2734dab62b4 /bin/bash

进入了所创建的id为65dc56327480的容器root@65dc56327480:/#，执行frp和ssh配置。
5.修改anchor_target.py中的类型值
在新容器中再次运行训练文件，又出现了新的错误：
File “/mmdetection/mmdet/core/anchor/anchor_target.py”, line 169, in anchor_inside_flags
(flat_anchors[:, 2] < img_w + allowed_border) &
RuntimeError: Expected object of scalar type Byte but got scalar type Bool for argument #2 ‘other’
百度搜索不到，github的issues也搜不到相关的解决办法，于是翻墙谷歌去找，有位仁兄（https://www.gitmemory.com/youkaichao）说是要手动地进行类型提升，如下：

引用自https://www.gitmemory.com/youkaichao关于torch中类型不匹配的问题
6.pytorch0.3中才能用inplace?
再次运行训练文件，出现新错误：
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1024, 256, 7, 7]], which is output 0 of IndexPutBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
搜索了一下说是要把所有的 ReLU默认的参数inplace=True改成inplace=False、将out+=residual这样所有的+=操作，改成out=out+residual，但是在mmdetection整个项目文件下找所有关联文件不太会，而提示所说用with torch.autograd.set_detect_anomaly(True)来检测所有torch不符合的操作我也不太会（把这句话放到程序里运行老出错）

xiaoshuzi666

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
6
评论
试用mmdetection

用vistart/mmdetection 的镜像ID创建容器crazy_murdock(容器id为32f1a377fd2f)并配置好端口、frp、ssh之后，我用mobaxterm来连接它，/mmdetection整个文件夹下载到本地D盘，再用pycharm远程连接映射。一、阅读说明文档用Notepad++打开查看README.md,INSTALL.md,GETTING_STARTED.md,...
复制链接

扫一扫

专栏目录