MEGA训练自己的数据集-docker

conda install ipython pip

# mega and coco api dependencies
pip install ninja yacs cython matplotlib tqdm opencv-python scipy


export INSTALL_DIR=$PWD

# install pycocotools
cd $INSTALL_DIR
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
python setup.py build_ext install

# install cityscapesScripts
cd $INSTALL_DIR
git clone https://github.com/mcordts/cityscapesScripts.git
cd cityscapesScripts/
python setup.py build_ext install

# install apex
cd $INSTALL_DIR
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext

# install PyTorch Detection
cd $INSTALL_DIR
git clone https://github.com/Scalsol/mega.pytorch.git
cd mega.pytorch

# the following will install the lib with
# symbolic links, so that you can modify
# the files if you want and won't need to
# re-build it
python setup.py build develop

pip install 'pillow<7.0.0'

unset INSTALL_DIR

# or if you are on macOS
# MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build develop

【注意】如果不严格按照作者所述的cuda10.0+python3.7+pytorch1.3.0环境配置，在后续安装apex的时候可能会报错

Cuda extensions are being compiled with a version of Cuda that does not match the verson

【以下为错误尝试】

apex安装报错：Cuda extensions are being compiled with a version of Cuda that does not match the verson_沃特么.拆基.达柴机的博客-CSDN博客j

将apex的安装代码

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

修改为

pip install -v --no-cache-dir ./

但是后续运行时仍会报错

AttributeError: module 'torch.cuda' has no attribute 'amp'

但是如果安装apex报错： Error: command ‘/usr/local/cuda-10.0/bin/nvcc’ failed with exit status 1

参考：http://www.cxyzjd.com/article/qq_22565865/107532012

2.制作自己的数据集

参考作者提供的customize.md文件

2.1 数据集格式

datasets
├── vid_custom
|   |── train
|   |   |── video_snippet_1
|   |   |   |── 000000.JPEG
|   |   |   |── 000001.JPEG
|   |   |   |── 000002.JPEG
|   |   |   ...
|   |   |── video_snippet_2
|   |   |   |── 000000.JPEG
|   |   |   |── 000001.JPEG
|   |   |   |── 000002.JPEG
|   |   |   ...
|   |   ...
|   |── val
|   |   |── video_snippet_1
|   |   |   |── 000000.JPEG
|   |   |   |── 000001.JPEG
|   |   |   |── 000002.JPEG
|   |   |   ...
|   |   |── video_snippet_2
|   |   |   |── 000000.JPEG
|   |   |   |── 000001.JPEG
|   |   |   |── 000002.JPEG
|   |   |   ...
|   |   ...
|   |── annotation
|   |   |── train
|   |   |   |── video_snippet_1
|   |   |   |   |── 000000.xml
|   |   |   |   |── 000001.xml
|   |   |   |   |── 000002.xml
|   |   |   |   ...
|   |   |   |── video_snippet_2
|   |   |   |   |── 000000.xml
|   |   |   |   |── 000001.xml
|   |   |   |   |── 000002.xml
|   |   |   |   ...
|   |   ...
|   |   |── val
|   |   |   |── video_snippet_1
|   |   |   |   |── 000000.xml
|   |   |   |   |── 000001.xml
|   |   |   |   |── 000002.xml
|   |   |   |   ...
|   |   |   |── video_snippet_2
|   |   |   |   |── 000000.xml
|   |   |   |   |── 000001.xml
|   |   |   |   |── 000002.xml
|   |   |   |   ...
|   |   ...

【注意事项】

1.图片编号是从0开始的6位数字；

2.annotation内的xml文件与train、val钟文件一一对应。

2.2 xml文件格式

参考：ImageNet VID数据集介绍 - 知乎 (zhihu.com)

【注意事项】

1.xml文件内的name名称对应于标准ImageNetVID的类别，一个编号对应于一个类别，也可以在代码中找到（后续参数更改提到）；

2.更多详细介绍请参照上述链接。

2.3 训练和测试文档准备

1.VID_train_15frames.txt

train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 10 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 30 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 50 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 70 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 90 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 110 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 130 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 150 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 170 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 190 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 210 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 230 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 250 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 270 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 290 300
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 1 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 4 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 8 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 11 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 14 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 17 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 20 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 24 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 27 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 30 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 33 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 36 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 40 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 43 48
train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 46 48
...

【说明】

（1）作者的解释：每一行4列依次代表：video folder， no meaning(just ignore it)，frame number，video length；

（2）需要注意的是video length可以看作某个视频文件夹内的图片总数，frame number可以人为设置，只要满足在一个视频文件夹内选择15帧作为关键帧即可。

2.VID_val_frames.txt

val/ILSVRC2015_val_00000000/000000 1
val/ILSVRC2015_val_00000000/000001 2
val/ILSVRC2015_val_00000000/000002 3
val/ILSVRC2015_val_00000000/000003 4
val/ILSVRC2015_val_00000000/000004 5
val/ILSVRC2015_val_00000000/000005 6

但是在训练的时候使用这个文档会报错：AttributeError: 'VIDMEGADataset' object has no attribute 'frame_seg_len' ，所以最终使用的是下面的文档

3.VID_val_videos.txt（较简单，不再叙述）

以上三个文档具体参考源MEGA代码中datasets\ILSVRC2015\ImageSets提供的文档。

2.4 参数修改

1.mega_core/data/datasets/vid.py修改VIDDataset内classes和classes_map：

classes=['__background__',#alwaysindex0

'car']

classes_map=['__background__',#alwaysindex0

'n02958343']

2.mega_core/config/paths_catalog.py

（1）修改 DatasetCatalog.DATASETS

"vid_custom_train":{

    "img_dir":"vid_custom/train",

    "anno_path":"vid_custom/annotation",

    "img_index":"vid_custom/VID_train_15frames.txt"

},

    "vid_custom_val":{

    "img_dir":"vid_custom/val",

    "anno_path":"vid_custom/annotation",

    "img_index":"vid_custom/VID_val_frames.txt"

}

（2）修改if函数下if语句，添加上vid条件

if("DET"inname)or("VID"inname)or("vid"inname):

3.修改configs/BASE_RCNN_4gpu.yaml（取决于你用几张gpu训练）

num_classes:2#(物体类别数+背景）

TRAIN: ("vid_custom_train",)#记得加“,”
  TEST: ("vid_custom_val",)#记得加“,”

3.训练和测试代码

1.训练代码（训练脚本会自动在训练完成进行测试，如果不想测试可注释掉测试代码）

python -m torch.distributed.launch     --nproc_per_node=4     tools/train_net.py     --master_port=$((RANDOM + 10000))     --config-file configs/MEGA/vid_R_101_C4_MEGA_1x.yaml         OUTPUT_DIR training_dir/MEGA_R_101_1x

2.测试代码

python -m torch.distributed.launch     --nproc_per_node 4     tools/test_net.py     --config-file configs/MEGA/vid_R_101_C4_MEGA_1x.yaml  MODEL.WEIGHT training_dir/MEGA_R_101_1x/model_final.pth

【注意】上述提供的测试和训练代码与作者提供的稍有不同，都删去了---motion-specific参数。

如果不删去该参数会报错：IndexError: list index out of range，具体原因是添加上述参数时代码会加载一个测试模型（大概可以这么说？），与自己制作的的数据集不配适，故删去。（以上只是自己的粗略推测，仅供参考，如果有人知道具体原因，望告知）

4.报错汇总

（1）FileNotFoundError: [Errno 2] No such file or directory: 'configs/BASE_RCNN_3gpu.yaml'

解决：作者只提供了1或4个GPU的训练配置文件，改一下训练代码里的--nproc_per_node参数为1或4即可。

（2）FileNotFoundError: [Errno 2] No such file or directory: 'datasets/ILSVRC2015/Annotations/DET/train/ILSVRC2014_train_0000/ILSVRC2014_train_00000663.xml'

解决：检查自己的数据集路径是否错误。

（3）ValueError: Type mismatch (<class 'tuple'> vs. <class 'str'>) with values (() vs. vid_custom_train) for config key: DATASETS.TRAIN

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/ILSVRC2015/ImageSets/DET_train_30classes.txt'

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/ILSVRC2015/ImageSets/VID_train_15frames.txt'

解决：两个yml文件里需要修改训练参数，并在最后加上两个逗号

TRAIN: ("vid_custom_train",)
  TEST: ("vid_custom_val",)

（4）RuntimeError: copy_if failed to synchronize: device-side assert triggered

解决：num_classes参数设置错误，类别应该是1+1，需要加上背景

（5）xml.etree.ElementTree.ParseError: mismatched tag: line 197, column 14

解决：检查自己的xml文件，一般出错的应该是每个文件夹的第一个或者最后一个。

（6）RuntimeError: Error(s) in loading state_dict for GeneralizedRCNN

解决：训练时使用的时4个gpu，但是测试时使用的是1个gpu，故可以修改测试代码（如下）或者修改BASE_RCNN_1gpu.yaml。

args=parser.parse_args()
cfg.merge_from_file("configs/BASE_RCNN_4gpu.yaml")#改为4gpu，因为1gpu没改配置文件
cfg.merge_from_file(args.config)
cfg.merge_from_list(["MODEL.WEIGHT",args.checkpoint])

（7）AttributeError: 'VIDMEGADataset' object has no attribute 'frame_seg_len'

解决：如前所述，更改验证文档

（8）使用--motion-specific参数报错

  for i in range(len(motion_ious['motion_iou']))])
Evaluating motion iou range 0.0 - 1.0
Traceback (most recent call last):
  File "tools/train_net.py", line 226, in <module>
    main()
  File "tools/train_net.py", line 222, in main
    run_test(cfg, model, args.distributed, args.motion_specific)
  File "tools/train_net.py", line 137, in run_test
    output_folder=output_folder,
  File "/workspace/mega2/mega.pytorch/mega_core/engine/inference.py", line 132, in inference
    **extra_args)
  File "/workspace/mega2/mega.pytorch/mega_core/data/datasets/evaluation/__init__.py", line 29, in evaluate
    return vid_evaluation(**args)
  File "/workspace/mega2/mega.pytorch/mega_core/data/datasets/evaluation/vid/__init__.py", line 15, in vid_evaluation
    logger=logger,
  File "/workspace/mega2/mega.pytorch/mega_core/data/datasets/evaluation/vid/vid_eval.py", line 51, in do_vid_evaluation
    use_07_metric=False
  File "/workspace/mega2/mega.pytorch/mega_core/data/datasets/evaluation/vid/vid_eval.py", line 149, in eval_detection_vid
    motion_range=motion_range,
  File "/workspace/mega2/mega.pytorch/mega_core/data/datasets/evaluation/vid/vid_eval.py", line 180, in calc_detection_vid_prec_rec
    if motion_iou[gt_index] < motion_range[0] or motion_iou[gt_index] > motion_range[1]:
IndexError: list index out of range

（9）subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'tools/train_net.py', '--local_rank=3', '--master_port=36874', '--config-file', 'configs/MEGA/vid_R_101_C4_MEGA_1x.yaml', '--motion-specific', 'OUTPUT_DIR', 'training_dir/MEGA_R_101_1x']' returned non-zero exit status 1.

未解决，但是只要其他没问题，这条报错也不会出现

【参考】

2021-06-30 MEGA论文阅读+代码复现_qq_42068904的博客-CSDN博客

MEGA配置和在单GPU训练_莫愁-前路的博客-CSDN博客

MEGA环境配置，踩坑无数，记录。_没拿过奖学金的废物的博客-CSDN博客