Docker训练mmdetection过程记录

下面是我本人的训练记录,过程中会出现错误,错误在后面会解决,因此一定要看到底再参照。

首先服务器上已经装好了Docker,下面开始自己总docker hub上下载一个mmdetection镜像,在容器里训练自己的数据。

step1:

我选择的镜像为:

https://hub.docker.com/r/pengzili/mmdetection

然后pull镜像,等待下载...

docker pull pengzili/mmdetection

下载完毕后启动一个该镜像的容器

sudo nvidia-docker run -it -v /home/csswork/zzh/:/mnt/  a841d5c9a7e7 /bin/bash

然后准备VOC数据集,我的做法是在主机目录上做好,然后拷贝到容器的对应目录里。

先在主机目录中创建,目录格式如下。Annotation里放xml文件,JPEGImages放jpeg图片文件,注意这里一定要一张图对应以一个xml。xml使用labelImage标注生成的。然后拷贝到mmdetection目录下。

|-- data
|   `-- VOCdevkit
|       `-- VOC2007
|           |-- Annotations
|           |-- ImagesSets
|           |   `-- Main
|           `-- JEPGImages

下面就是修改各种参数了,这里网上教程都有,参考:https://zhuanlan.zhihu.com/p/101263456

配置好后开始训练:

python tools/train.py  configs/faster_rcnn_r50_fpn_1x.py --gpus 2

下面开始报错: 

root@137fe107a909:/mmdetection# python tools/train.py  configs/faster_rcnn_r50_fpn_1x.py --gpus 2
2020-03-30 05:16:06,524 - INFO - Distributed training: False
2020-03-30 05:16:06,944 - INFO - load model from: torchvision://resnet50
2020-03-30 05:16:07,152 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2020-03-30 05:16:10,334 - INFO - Start running, host: root@137fe107a909, work_dir: /mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2020-03-30 05:16:10,335 - INFO - workflow: [('train', 1)], max: 12 epochs
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 31) is killed by signal: Bus error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 108, in <module>
    main()
  File "tools/train.py", line 104, in main
    logger=logger)
  File "/mmdetection/mmdet/apis/train.py", line 60, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/mmdetection/mmdet/apis/train.py", line 221, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 260, in train
    for i, data_batch in enumerate(data_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 576, in __next__
    idx, batch = self._get_batch()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 553, in _get_batch
    success, data = self._try_get_batch()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 31) exited unexpectedly

这里的问题参考:https://blog.csdn.net/zcgyq/article/details/83085266 即需要修改一下/dev/shm大小,于是我exit容器。

重启启动docker,这里加了一个参数  --shm-size 8G

sudo nvidia-docker run -it -v /home/csswork/zzh/:/mnt/ --shm-size 8G a841d5c9a7e7 /bin/bash

进去后又出现了问题,之前docker里做的修改都不见了。后来发现容器id不一样,于是用

sudo docker ps -a

查找刚刚退出的容器,然后先启动容器,再进入容器:

sudo docker start 137fe107a909
sudo docker attach 137fe107a909

 这时候我想到还是把容器的修改提交一下吧,不然以后操作很麻烦

先退出容器。再:

sudo docker commit 137fe107a909 zhang:v1.0

此时查看一下镜像:sudo docker images

REPOSITORY               TAG                 IMAGE ID            CREATED              SIZE
zhang                    v1.0                a841d5c9a7e7        About a minute ago   8.01GB

紧接着启动这个容器:

sudo nvidia-docker run -it -v /home/csswork/zzh/:/mnt/ --shm-size 8G a841d5c9a7e7 /bin/bash

然后继续训练:

python tools/train.py  configs/faster_rcnn_r50_fpn_1x.py --gpus 2

后面的过程可以参考链接:https://zhuanlan.zhihu.com/p/101263456

包括模型测试,计算map等过程。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值