[win 10] maskrcnn-benchmark 上手（2）——开始训练

最新推荐文章于 2021-04-20 14:38:00 发布

*小呆

最新推荐文章于 2021-04-20 14:38:00 发布

阅读量969

点赞数 1

分类专栏： torch deep learning

本文链接：https://blog.csdn.net/qq_39575835/article/details/105277638

版权

deep learning 同时被 2 个专栏收录

56 篇文章 6 订阅

订阅专栏

torch

24 篇文章 0 订阅

订阅专栏

全系列
[win 10] maskrcnn-benchmark 上手（1）——配置环境与coco数据集介绍
 [win 10] maskrcnn-benchmark 上手（2）——开始训练
 [win 10] maskrcnn-benchmark 上手（3）—— faster-rcnn 推理

首先实现faster-rcnn 部分,其实mask-rcnn实现也就非常容易迁移了。也就是说，在maskrcnn-benchmark中会配置一种，学会了一种train，inference，visualization，那么其他的都可以很快实现了，这就是zoo的好处吧。

目录

1.训练前的配置
2.训练前改BUG
3. 修改数据集路径
4. 开始训练
5. 程序大貌
Reference

1.训练前的配置

首先映入眼帘的是这句话：
Most of the configuration files that we provide assume that we are running on 8 GPUs.

看完后：？？？，大家实验室都那么有钱吗？流下了贫穷的泪水。。

官网也有单卡的解决方案：
But the drawback is that it will use much more GPU memory. The reason is that we set in the configuration files a global batch size that is divided over the number of GPUs. So if we only have a single GPU, this means that the batch size for that GPU will be 8x larger, which might lead to out-of-memory errors.

好了，最气的就是If you have a lot of memory available, this is the easiest solution. 对于没有那么多GPUs的CVer，就忍忍吧…

一个例子（Mask R-CNN R-50 FPN with the 1x schedule）

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000

注意上面学习率和epoch 都8x了。我找到了多少个GPUs用什么样的参数。不得不说FAIR是真的细。

  # Equivalent schedules with...
  # 1 GPU:
  #   BASE_LR: 0.0025
  #   MAX_ITER: 60000
  #   STEPS: [0, 30000, 40000]
  # 2 GPUs:
  #   BASE_LR: 0.005
  #   MAX_ITER: 30000
  #   STEPS: [0, 15000, 20000]
  # 4 GPUs:
  #   BASE_LR: 0.01
  #   MAX_ITER: 15000
  #   STEPS: [0, 7500, 10000]
  # 8 GPUs:
  #   BASE_LR: 0.02
  #   MAX_ITER: 7500
  #   STEPS: [0, 3750, 5000]

开启混合精度训练

混合精度训练参考：传送门。简单翻译下作用：混合精度训练可以通过以半精度格式执行操作来显着提高计算速度，同时以单精度存储最少的信息以在网络的关键部分中保留尽可能多的信息。自从在Volta和Turing架构中引入 Tensor Core以来，通过切换到混合精度即可体验到显着的训练提速-在算术强度最高的模型架构上，整体提速高达3倍。code中设置成DTYPE "float16"

export NGPUS=8
python -m torch.distributed.launch --nproc_per_node=$NGPUS /path_to_maskrcnn_benchmark/tools/train_net.py --config-file "path/to/config/file.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN images_per_gpu x 1000 DTYPE "float16"

在这里插入图片描述

2.训练前改BUG

bug0

ImportError: numpy.core.multiarray failed to import

卸载当前numpy，然后pip install -U numpy

bug1

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 25: invalid start byte

这个应该是个pytorch的bug。。。太底层了。需要修改pytorch的底层文件E:/anaconda3/envs/mask-rcnn/Lib/site-packages/torch/utils/collect_env.py
参考传送门。

bug2

AttributeError: module 'matplotlib' has no attribute 'verbose'

这是pycharm的问题，参考：传送门。

bug3

IndexError: list index out of range

这是代码的bug，在issue区域找到了完美解决方案

3. 修改数据集路径

他想要的路径类似

└── coco
    ├── annotations
    │   ├── instances_train2014.json
    │   └── instances_val2014.json
    ├── test2014
    │   └── cjh_993.png
    ├── train2014
    │   ├── cjh_968.png
    │   ├── cjh_969.png
    │   ├── cjh_976.png
    │   ├── cjh_977.png
    │   └── cjh_984.png
    └── val2014
        ├── cjh_985.png
        └── cjh_992.png

修改路径在maskrcnn_benchmark/config/paths_catalog.py中。

4. 开始训练

在这里插入图片描述
哇，2080Ti单卡，跑这么久，hhh。
最后贴一下我的config文件。configs/e2e_faster_rcnn_R_50_FPN_1x.yaml

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: r"E:\ins_seg\mask_rcnn\maskrcnn-benchmark-master\weight\R-101.pkl"
  BACKBONE:
    CONV_BODY: "R-50-FPN"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 256
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
DATASETS:
  TRAIN: ("coco_2017_train",)
  TEST: ("coco_2017_val",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000

5. 程序大貌

博客中有个脉络图挺好的，贴一下。

(maskrcnn_benchmark) [zuosi@localhost]$tree -L 3
.
├── configs
│   ├── e2e_faster_rcnn_R_101_FPN_1x.yaml #训练和验证要用到的faster r-cnn模型配置文件
│   ├── e2e_mask_rcnn_R_101_FPN_1x.yaml #训练和验证要用到的mask r-cnn模型配置文件
│   └── quick_schedules
├── CONTRIBUTING.md
├── datasets
│   └── coco
│       ├── annotations
│  		│  ├── instances_train2014.json #训练集标注文件
│  		│  └── instances_val2014.json #验证集标注文件
│       ├── train2014  #存放训练集图片
│       └── val2014  #存放验证集图片
├── maskrcnn_benchmark
│   ├── config
│   │   ├── defaults.py #masrcnn_benchmark默认配置文件,启动时会读取訪配置文件,configs目录下的模型配置文件进行参数合并
│   │   ├── __init__.py
│   │   ├── paths_catalog.py #在訪文件中配置训练和测试集的路径
│   │   └── __pycache__
│   ├── csrc
│   ├── data
│   │   ├── build.py #生成数据集的地方
│   │   ├── datasets #訪目录下的coco.py提供了coco数据集的访问接口
│   │   └── transforms
│   ├── engine
│   │   ├── inference.py #验证引擎
│   │   └── trainer.py #训练引擎
│   ├── __init__.py
│   ├── layers
│   │   ├── batch_norm.py
│   │   ├── __init__.py
│   │   ├── misc.py
│   │   ├── nms.py
│   │   ├── __pycache__
│   │   ├── roi_align.py
│   │   ├── roi_pool.py
│   │   ├── smooth_l1_loss.py
│   │   └── _utils.py
│   ├── modeling
│   │   ├── backbone
│   │   ├── balanced_positive_negative_sampler.py
│   │   ├── box_coder.py
│   │   ├── detector
│   │   ├── __init__.py
│   │   ├── matcher.py
│   │   ├── poolers.py
│   │   ├── __pycache__
│   │   ├── roi_heads
│   │   ├── rpn
│   │   └── utils.py
│   ├── solver
│   │   ├── build.py
│   │   ├── __init__.py
│   │   ├── lr_scheduler.py #在此设置学习率调整策略
│   │   └── __pycache__
│   ├── structures
│   │   ├── bounding_box.py
│   │   ├── boxlist_ops.py
│   │   ├── image_list.py
│   │   ├── __init__.py
│   │   ├── __pycache__
│   │   └── segmentation_mask.py
│   └── utils
│       ├── c2_model_loading.py
│       ├── checkpoint.py #检查点
│       ├── __init__.py
│       ├── logger.py #日志设置
│       ├── model_zoo.py
│       ├── __pycache__
│       └── README.md
├── output #我自己设定的输出目录
├── tools
│   ├── test_net.py #验证入口
│   └── train_net.py #训练入口
└── TROUBLESHOOTING.md