Docker训练mmdetection过程记录

最新推荐文章于 2024-08-07 09:58:29 发布

zebra_zzh

最新推荐文章于 2024-08-07 09:58:29 发布

阅读量1.8k

点赞数

分类专栏：深度学习文章标签：深度学习 pytorch mmdetecion

本文链接：https://blog.csdn.net/qq_24282081/article/details/105197710

版权

深度学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

下面是我本人的训练记录，过程中会出现错误，错误在后面会解决，因此一定要看到底再参照。

首先服务器上已经装好了Docker，下面开始自己总docker hub上下载一个mmdetection镜像，在容器里训练自己的数据。

step1：

我选择的镜像为：

https://hub.docker.com/r/pengzili/mmdetection

然后pull镜像，等待下载...

docker pull pengzili/mmdetection

下载完毕后启动一个该镜像的容器

sudo nvidia-docker run -it -v /home/csswork/zzh/:/mnt/  a841d5c9a7e7 /bin/bash

然后准备VOC数据集，我的做法是在主机目录上做好，然后拷贝到容器的对应目录里。

先在主机目录中创建，目录格式如下。Annotation里放xml文件，JPEGImages放jpeg图片文件，注意这里一定要一张图对应以一个xml。xml使用labelImage标注生成的。然后拷贝到mmdetection目录下。

|-- data
|   `-- VOCdevkit
|       `-- VOC2007
|           |-- Annotations
|           |-- ImagesSets
|           |   `-- Main
|           `-- JEPGImages

下面就是修改各种参数了，这里网上教程都有，参考：https://zhuanlan.zhihu.com/p/101263456

配置好后开始训练：

python tools/train.py  configs/faster_rcnn_r50_fpn_1x.py --gpus 2

下面开始报错：

root@137fe107a909:/mmdetection# python tools/train.py  configs/faster_rcnn_r50_fpn_1x.py --gpus 2
2020-03-30 05:16:06,524 - INFO - Distributed training: False
2020-03-30 05:16:06,944 - INFO - load model from: torchvision://resnet50
2020-03-30 05:16:07,152 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2020-03-30 05:16:10,334 - INFO - Start running, host: root@137fe107a909, work_dir: /mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2020-03-30 05:16:10,335 - INFO - workflow: [('train', 1)], max: 12 epochs
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 31) is killed by signal: Bus error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 108, in <module>
    main()
  File "tools/train.py", line 104, in main
    logger=logger)
  File "/mmdetection/mmdet/apis/train.py", line 60, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/mmdetection/mmdet/apis/train.py", line 221, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 260, in train
    for i, data_batch in enumerate(data_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 576, in __next__
    idx, batch = self._get_batch()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 553, in _get_batch
    success, data = self._try_get_batch()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 31) exited unexpectedly

这里的问题参考：https://blog.csdn.net/zcgyq/article/details/83085266 即需要修改一下/dev/shm大小，于是我exit容器。

重启启动docker，这里加了一个参数 --shm-size 8G

sudo nvidia-docker run -it -v /home/csswork/zzh/:/mnt/ --shm-size 8G a841d5c9a7e7 /bin/bash

进去后又出现了问题，之前docker里做的修改都不见了。后来发现容器id不一样，于是用

sudo docker ps -a

查找刚刚退出的容器，然后先启动容器，再进入容器：

sudo docker start 137fe107a909
sudo docker attach 137fe107a909

这时候我想到还是把容器的修改提交一下吧，不然以后操作很麻烦

先退出容器。再：

sudo docker commit 137fe107a909 zhang:v1.0

此时查看一下镜像：sudo docker images

REPOSITORY               TAG                 IMAGE ID            CREATED              SIZE
zhang                    v1.0                a841d5c9a7e7        About a minute ago   8.01GB

紧接着启动这个容器：

sudo nvidia-docker run -it -v /home/csswork/zzh/:/mnt/ --shm-size 8G a841d5c9a7e7 /bin/bash

然后继续训练：

python tools/train.py  configs/faster_rcnn_r50_fpn_1x.py --gpus 2

后面的过程可以参考链接：https://zhuanlan.zhihu.com/p/101263456

包括模型测试，计算map等过程。

zebra_zzh

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录