地平线旭日x3 deeplav3训练分割模型训练流程（2024.4.2 笔记）

sudo_Ene

已于 2024-04-08 10:14:44 修改

阅读量868

点赞数 26

分类专栏：笔记文章标签：深度学习人工智能

于 2024-04-02 15:56:03 首次发布

本文链接：https://blog.csdn.net/weixin_61044335/article/details/137265003

版权

笔记专栏收录该内容

14 篇文章 0 订阅

订阅专栏

step2:在地平线OE主目录下，启动docker并挂载OE目录和数据集目录

step3:cityscapes_packer.py文件位置

step4:进行训练数据集打包

step5:进行验证数据集打包

step6:退出docker环境

step7: 获得data.mdb、lock.mdb

2.模型训练

step1：deeplabv3plus.py 配置文件中的数据集路径（ data_rootdir ）设置为数据集 LMDB 文件所在位置

step2：测试一下网络的计算量和参数数量

step3：启动训练

step4：验证模型精度

step5：导出onnx模型

官方文档

地平线x3开发资料，版本2.6.2b

旭日X3派用户手册https://developer.horizon.ai/api/v1/fileData/documents_pi/Quick_Start/Quick_Start.html 地平线X3J3算法工具链https://developer.horizon.cc/api/v1/fileData/horizon_xj3_open_explorer_cn_doc/oe_mapper/source/advanced_content.html

操作流程

1.数据集打包

step1:在宿主机上，准备两个目录

——地平线OE主目录：

/home1/lixinyiDownloads/horizon_xj3_open_explorer_v2.6.2b-py38_20230606

——cityscapes数据集主目录：

/home1/lixinyi/Downloads/dataset/cityscapes

step2:在地平线OE主目录下，启动docker并挂载OE目录和数据集目录

./run_docker_lxy.sh /home1/liXinYi/Downloads/dataset/cityscapes/，进入Docker环境

注意：这里如果使用从机训练，遇到docker权限不够的情况，先登录主机，在主机内添加设置docker权限：

1.检查docker用户组内成员

grep docker /etc/group

2.如果输出结果为空，新建一个docker用户组。如果输出了组内成员等信息则跳过此步。

sudo groupadd docker

3.添加用户到docker组中

sudo usermod -aG docker username

4.确保更改生效，终端关了重开一个。然后就能正常使用了

step3:cityscapes_packer.py文件位置

在docker环境中，数据集主目录变成了/data/horizon_x3/data。且cityscapes_packer.py经查找发现在/open_explorer/ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/datasets/cityscapes_packer.py

step4:进行训练数据集打包

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/datasets/cityscapes_packer.py --src-data-dir /data/horizon_x3/data --split-name train --pack-type lmdb

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/datasets/cityscapes_packer.py --src-data-dir /data/horizon_x3/data  --split-name train --pack-type lmdb

step5:进行验证数据集打包

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/datasets/cityscapes_packer.py --src-data-dir /data/horizon_x3/data --split-name val --pack-type lmdb

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/datasets/cityscapes_packer.py --src-data-dir /data/horizon_x3/data  --split-name val --pack-type lmdb

step6:退出docker环境

exit

step7: 获得data.mdb、lock.mdb

退出docker后，在宿主机cityscapes数据集主目录下发现打包好的目录data.mdb、lock.mdb

2.模型训练

参考：https://developer.horizon.cc/api/v1/fileData/horizon_xj3_open_explorer_cn_doc/oe_mapper/source/advanced_content.html

step1：deeplabv3plus.py 配置文件中的数据集路径（ data_rootdir ）设置为数据集 LMDB 文件所在位置

vi /home1/lixinyi/Downloads/horizon_xj3_open_explorer_v2.6.2b-py38_20230606/ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py

vi /home1/lixinyi/Downloads/horizon_xj3_open_explorer_v2.6.2b-py38_20230606/ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py

将data_rootdir变量改为/data/horizon_x3/data

step2：测试一下网络的计算量和参数数量

root下

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/calops.py --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --input-shape "1,3,1024,2048"

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/calops.py --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --input-shape "1,3,1024,2048"

step3：启动训练

(注意，这里手册上的原有写法python3 tools/train.py --step float --config configs/segmentation/deeplabv3plus_efficientnetm0.py无法运行，参数写错了，且没指定gpu编号)

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py --stage float --device-ids 0,1 --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py --stage float --device-ids 0,1 --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py

开始训练模型

日，报错超多，总的来说就是CUDA设备序号错误：

2024-04-02 18:55:11,221 INFO [logger.py:148] Node[0] ==================================================BEGIN FLOAT STAGE==================================================
2024-04-02 18:55:11,280 INFO [logger.py:148] Node[0] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,
/usr/local/lib/python3.8/dist-packages/horizon_plugin_pytorch/nn/interpolate.py:236: UserWarning: default upsampling behavior when mode=bilinear is changed to align_corners=False since torch 0.4.0. Please specify align_corners=True if the old behavior is desired.
warnings.warn(
ERROR:__main__:train failed!

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
File "/usr/local/lib/python3.8/dist-packages/hat/engine/ddp_trainer.py", line 448, in _main_func
    torch.cuda.set_device(local_rank % num_devices)
File "/root/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 311, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Traceback (most recent call last):
File "./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 287, in <module>
    raise e
File "./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 273, in <module>
    train(
File "./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 254, in train
    launch(
File "/usr/local/lib/python3.8/dist-packages/hat/engine/ddp_trainer.py", line 377, in launch
    mp.spawn(
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

解决过程：

查看当前系统中的 GPU 设备信息和设备序号

nvidia-smi

发现当前GPU一个没开，开一个。开启命令：

nvidia-smi -pm 1

现在GPU1就开启了，0为关闭状态，此时是在主机下开启的

进入docker空间内再查看一下，以docker空间内编号为准，因此我的GPU编号是0。其他情况同理，以docker空间内为准。

注：docker空间内GPU编号在进入docker时的.sh文件中进行设置

开始正常训练了。中途打印的信息：

step4：验证模型精度

(注意，手册上的原有写法无法运行，参数写错了，且没指定gpu编号，自己修改参数内容)

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py --stage float --device-ids 0 --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --val-ckpt float-checkpoint-best.pth.tar --val-only

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py --stage float --device-ids 0 --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --val-ckpt float-checkpoint-best.pth.tar --val-only

无法运行，地平线2.6.2b版本的docker中，train.py文件不支持此参数。我不信，再试试。

行吧，确实不行。

step5：导出onnx模型

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/export_onnx.py --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --ckpt float-checkpoint-best-fd43ddf4.pth.tar --onnx-name deeplabv3plus_efficientnetm0.onnx

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/export_onnx.py --config  ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --ckpt float-checkpoint-best-fd43ddf4.pth.tar --onnx-name deeplabv3plus_efficientnetm0.onnx

这东西！export_onnx.py脚本只能输出固定的onnx文件名，我不信，我试试。

行吧，真的对不上...没事儿

3.模型单图测试

看训练的模型单图推理效果

python3 ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/infer.py --config ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --dataset cityscapes --input-size 1024x2048x3 --input-images /data/horizon_x3/data/leftImg8bit/test/berlin/berlin_000467_000019_leftImg8bit.png --input-format yuv --is-plot

python3  ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/infer.py --config  ./ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/configs/segmentation/deeplabv3plus_efficientnetm0.py --dataset cityscapes --input-size 1024x2048x3 --input-images /data/horizon_x3/data/leftImg8bit/test/berlin/berlin_000467_000019_leftImg8bit.png --input-format yuv --is-plot

报错说

FileNotFoundError: [Errno 2] No such file or directory: './tmp_models/deeplabv3plus_efficientnetm0_seg/calibration-checkpoint-best.pth.tar'

估计是需要进行量化感知训练QAT的calibration步骤之后得到的伪量化模型才能进行单图推理效果，有待验证。

4.模型转换验证

地平线ai工具链中，py模型转onnx再转bin格式是总体流程。在onnx转bin的过程中，不仅是将onnx转成地平线特有的bin格式，而且还进行了模型量化。量化有两种方式：

1.边训练边量化的“量化训练QAT”

2.训练完成后量化的“训练后量化PTQ”

地平线的文档中说PTQ比较简单，且已经大量应用在多种场景中了，所以推荐PTQ。参考3.2. 算法模型PTQ量化+上板快速上手 — Horizon Open Explorer

步骤分成两步：

1.对浮点型onnx模型进行验证，以确保其符合计算平台的支持约束

命令举例：hb_mapper checker --model-type onnx \

--model efficientnet_lite0_fp32.onnx \

--march bernoulli2

hb_mapper checker --model-type onnx \

--model efficientnet_lite0_fp32.onnx \

--march bernoulli2

生成内容：

5.模型的转换和量化

模型验证通过后，就可以使用 hb_mapper makertbin 工具进行模型转换，参考命令如下：

hb_mapper makertbin --config mobilenet_config.yaml \

--model-type caffe

hb_mapper makertbin --config mobilenet_config.yaml \                     

--model-type caffe

其中， mobilenet_config.yaml 为模型转换对应的配置文件，将在 Yaml配置文件中进行介绍。 model-type 则用于指定检查输入的模型类型，可配置为caffe或者onnx，不同模型类型对应的配置文件参数会稍有不同。另外，PTQ 方案的模型量化还需要依赖一定数量预处理后的样本进行校准，将在校准数据预处理中进行介绍。

准备完校准数据和yaml配置文件后，即可一步命令完成模型解析、图优化、校准、量化、编译的全流程转换。

配置yaml文件
准备校准数据
量化&格式转换

hb_mapper makertbin --config deeplabv3plus_efficientnetb0_config_lxy.yaml --model-type onnx

hb_mapper makertbin --config deeplabv3plus_efficientnetb0_config_lxy.yaml --model-type onnx

最后转换得到的量化的bin格式模型

bin文件是用于在地平线计算平台上加载运行的模型

6. 动态性能评估

1.确保完成开发板部署，当前使用地平线旭日x3

2.将转换生成的bin文件拷贝至开发板/userdata 下任意路径

3.通过 hrt_model_exec perf 工具快捷评估模型的耗时和帧率

sudo_Ene

关注

26
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
地平线旭日x3 deeplav3训练分割模型训练流程（2024.4.2 笔记）

地平线旭日x3 deeplav3训练分割模型训练流程（2024.4.2 笔记）
复制链接

扫一扫

专栏目录

地平线旭日x3 deeplav3训练 分割模型训练流程（2024.4.2 笔记）

官方文档

操作流程

1.数据集打包

step1:在宿主机上，准备两个目录

step2:在地平线OE主目录下，启动docker并挂载OE目录和数据集目录

step3:cityscapes_packer.py文件位置

step4:进行训练数据集打包

step5:进行验证数据集打包

step6:退出docker环境

step7: 获得data.mdb、lock.mdb

2.模型训练

step1：deeplabv3plus.py 配置文件中的数据集路径（ data_rootdir ）设置为 数据集 LMDB 文件所在位置

step2：测试一下网络的计算量和参数数量

step3：启动训练

step4：验证模型精度

step5：导出onnx模型

3.模型单图测试

4.模型转换验证

5.模型的转换和量化

6. 动态性能评估

“相关推荐”对你有帮助么？

地平线旭日x3 deeplav3训练分割模型训练流程（2024.4.2 笔记）

step1：deeplabv3plus.py 配置文件中的数据集路径（ data_rootdir ）设置为数据集 LMDB 文件所在位置