MMSegmentation系列之模型训练与推理（二）

qq_41627642

已于 2023-03-26 20:09:39 修改

阅读量3k

点赞数 2

分类专栏： MMSegmentation 深度学习文章标签：深度学习机器学习 python

于 2022-06-01 17:03:14 首次发布

本文链接：https://blog.csdn.net/qq_41627642/article/details/125011752

版权

深度学习同时被 2 个专栏收录

57 篇文章 17 订阅

订阅专栏

MMSegmentation

14 篇文章 20 订阅

订阅专栏

1、模型训练

MMSegmentation实现了分布式训练和非分布式训练，分别使用MMDistributedDataParallel和MMDataParallel。所有输出(日志文件和检查点)将被保存到配置文件中的work_dir指定的工作目录中。

默认情况下，我们在一些迭代之后对验证集上的模型进行评估，您可以通过在训练配置中添加interval参数来更改评估间隔。

evaluation = dict(interval=4000)  # This evaluate the model per 4000 iterations.

重要:配置文件中的默认学习速率为4 gpu和2 img/gpu(批处理大小= 4x2 = 8)。同样地，你也可以使用8 gpu和1 imgs/gpu，因为所有型号都使用cross_gpu SyncBN。

为了用GPU内存交换速度，你可以传–cfg-options model.backbone.with_cp=True启用骨干检查点。

1、在一台机器上训练

1、Train with a single GPU

sh tools/dist_train.sh ${CONFIG_FILE} 1 [optional arguments]
experimental support (Convert SyncBN to BN):
python tools/train.py ${CONFIG_FILE} [optional arguments]

如果您想在命令中指定工作目录，您可以添加一个参数--work-dir ${YOUR_WORK_DIR}

2、Train with CPU

如果机器没有GPU, CPU上的训练过程与单GPU训练过程一致。如果它有gpu但不想使用它，我们只需要在训练过程之前禁用gpu。

export CUDA_VISIBLE_DEVICES=-1

3、Train with multiple GPUs

sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

–no-validate (not suggested): By default, the codebase will perform evaluation at every k iterations during the training. To disable this behavior, use --no-validate.
–work-dir ${WORK_DIR}: Override the working directory specified in the config file.
–resume-from ${CHECKPOINT_FILE}: Resume from a previous checkpoint file (to continue the training process).
–load-from ${CHECKPOINT_FILE}: Load weights from a checkpoint file (to start finetuning for another task).
–deterministic: Switch on “deterministic” mode which slows down training but the results are reproducible.

Difference between resume-from and load-from:

resume-from loads both the model weights and optimizer state including the iteration number.

load-from loads only the model weights, starts the training from iteration

An example:
checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_512x512_80k_ade20k/
If work_dir is not set, it will be generated automatically.

sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_512x512_80k_ade20k.py 8 --work_dir work_dirs/pspnet_r50-d8_512x512_80k_ade20k/ --deterministic

注意:在训练期间，检查点和日志保存在与work_dirs/下的配置文件相同的文件夹结构中。不建议使用自定义工作目录，因为评估脚本从配置文件名推断工作目录。如果你想在其他地方保存你的权重，请使用符号链接，例如:

ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs

4、在一台机器上启动多个作业

如果您在一台机器上启动多个作业，例如，在一台有8个gpu的机器上启动2个4-GPU训练的作业，您需要为每个作业指定不同的端口(默认为29500)，以避免通信冲突。否则，将会有错误消息说
If you use dist_train.sh to launch training jobs, you can set the port in commands with environment variable PORT.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4

5、Train with multiple machines

如果您使用多台仅连接以太网的机器启动，您可以简单地运行以下命令:
On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPU

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

6、用Slurm管理训练工作

Slurm是一种很好的计算集群作业调度系统。在由Slurm管理的集群上，可以使用slurm_train.sh生成培训作业。支持单节点和多节点培训。
多机器训练:

[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}

下面是一个使用16个gpu在dev分区训练PSPNet的例子。

GPUS=16 sh tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py work_dirs/pspnet_r50-d8_512x1024_40k_cityscapes/

可以设置不同的通信端口，不需要修改配置文件，但需要设置cfg-options，覆盖配置文件中的默认端口。

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1 --cfg-options dist_params.port=29500
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2 --cfg-options dist_params.port=29501

你可以使用环境变量’MASTER_PORT’在命令中设置端口:

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2

2、使用预训练模型进行推理

我们提供测试脚本来评估整个数据集（Cityscapes、PASCAL VOC、ADE20k 等），还提供一些高级 API，以便更轻松地与其他项目集成。
测试数据集
单GPU
中央处理器
单节点多GPU
多节点

您可以使用以下命令来测试数据集。

# single-gpu testing
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] [--show]

# CPU: If GPU unavailable, directly running single-gpu testing command above
# CPU: If GPU available, disable GPUs and run single-gpu testing script
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] [--show]

# multi-gpu testing
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}]

可选参数：

RESULT_FILE：pickle 格式的输出结果的文件名。如果未指定，结果将不会保存到文件中。（mmseg v0.17之后，输出结果变成预评估结果或者格式化结果路径）

EVAL_METRICS：要根据结果进行评估的项目。允许值取决于数据集，例如，mIoU可用于所有数据集。城市景观可以通过cityscapes标准mIoU指标进行评估。

–show：如果指定，分割结果将绘制在图像上并显示在新窗口中。仅适用于单GPU测试，用于调试和可视化。请确保 GUI 在您的环境中可用，否则您可能会遇到类似.cannot connect to X server

–show-dir：如果指定，分割结果将绘制在图像上并保存到指定目录。仅适用于单GPU测试，用于调试和可视化。您不需要环境中可用的 GUI 即可使用此选项。

–eval-optionsdataset.format_results：评估期间和评估期间的可选参数dataset.evaluate。时efficient_test=True，它将中间结果保存到本地文件以节省
CPU 内存。确保您有足够的本地存储空间（超过 20GB）。（efficient_test参数在 mmseg v0.17> 之后无效，我们使用渐进模式来评估和格式化结果，可以大大节省内存成本和评估时间。）

例子：

假设您已经将检查点下载到目录checkpoints/中。

1、测试 PSPNet 并可视化结果。按任意键查看下一张图像。

python tools/test.py configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth  --show

2、测试 PSPNet 并保存绘制的图像以供以后可视化

python tools/test.py configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth --show-dir psp_r50_512x1024_40ki_cityscapes_results

3、在 PASCAL VOC 上测试 PSPNet（不保存测试结果）并评估 mIoU

python tools/test.py configs/pspnet/pspnet_r50-d8_512x1024_20k_voc12aug.py \
    checkpoints/pspnet_r50-d8_512x1024_20k_voc12aug_20200605_003338-c57ef100.pth \
    --eval mAP

4、使用 4 个 GPU 测试 PSPNet，并评估标准 mIoU 和城市景观指标。

./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py   checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth   4 --out results.pkl --eval mIoU cityscapes

城市景观 mIoU 和我们的 mIoU 之间存在一些差距（~0.1%）。原因是默认情况下，城市景观平均每个类的类规模。我们对所有数据集使用没有平均值的简单版本

5、使用 4 个 GPU 在 citys test split 上测试 PSPNet，并生成 png 文件提交给官方评估服务器。

首先，将以下内容添加到配置文件configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py中，

data = dict(
    test=dict(
        img_dir='leftImg8bit/test',
        ann_dir='gtFine/test'))

然后运行测试。

./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth  4 --format-only --eval-options "imgfile_prefix=./pspnet_test_results"

您将在./pspnet_test_results目录下获得 png 文件。您可以运行zip 文件并将其提交到评估服务器。zip -r results.zip pspnet_test_results/

7、CPU 内存效率在 Cityscapes 上测试 DeeplabV3+（不保存测试结果）并评估 mIoU。

python tools/test.py \
configs/deeplabv3plus/deeplabv3plus_r18-d8_512x1024_80k_cityscapes.py deeplabv3plus_r18-d8_512x1024_80k_cityscapes_20201226_080942-cff257fe.pth --eval-options efficient_test=True \
--eval mIoU

pmap用于查看 CPU 内存占用情况，它使用 2.25GB CPU 内存和efficient_test=True11.06GB CPU 内存、efficient_test=False。这个可选参数可以节省大量内存。（在 mmseg v0.17 之后，efficient_test 不再生效，我们默认使用渐进模式高效评估和格式化结果。

8、使用 1 个 GPU 在 LoveDA 测试拆分上测试 PSPNet，并生成 png 文件提交到官方评估服务器。首先，将以下内容添加到配置文件configs/pspnet/pspnet_r50-d8_512x512_80k_loveda.py中

data = dict(
    test=dict(
        img_dir='img_dir/test',
        ann_dir='ann_dir/test'))

然后运行测试。

python ./tools/test.py configs/pspnet/pspnet_r50-d8_512x512_80k_loveda.py  checkpoints/pspnet_r50-d8_512x512_80k_loveda_20211104_155728-88610f9f.pth  --format-only --eval-options "imgfile_prefix=./pspnet_test_results"

您将在./pspnet_test_results目录下获得 png 文件。您可以运行zip 文件并将其提交到评估服务器。zip -r -j Results.zip pspnet_test_results/

qq_41627642

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
2
评论
MMSegmentation系列之模型训练与推理（二）

1、模型训练MMSegmentation实现了分布式训练和非分布式训练，分别使用MMDistributedDataParallel和MMDataParallel。所有输出(日志文件和检查点)将被保存到配置文件中的work_dir指定的工作目录中。默认情况下，我们在一些迭代之后对验证集上的模型进行评估，您可以通过在训练配置中添加interval参数来更改评估间隔。evaluation = dict(interval=4000) # This evaluate the model per 4000 i
复制链接

扫一扫