基于Xavier(jetson)完成segnet模型训练
本文基于源码:https://github.com/dusty-nv/jetson-inference,完成segnet语义分割模型的开发(training部分)
环境
硬件:Xavier
JettPack Version:JetPack 4.4
docker container:dustynv/jetson-inference:r32.4.3
训练流程
1、进入源码训练目录
cd /jetson-inference/python/training/segmentation
2、修改数据导入函数(基于VOC格式,数据制作脚本更新中)
def get_dataset(path, image_set, transform):
"""
load voc datasets
name: voc
path: path of datasets
image_set: "train"
transform: get_transform(train=True, resolution=resolution)
"""
num_classes = 2
ds = VOCSegmentationLoad(path, image_set=image_set, transforms=transform)
return ds, num_classes
3、模型训练
3.1 源码采用的fcn_resnet18, 但是torchversion只支持fcn_resnet50、fcn_resnet101,所以这里需要自定义fcn_resnet18:
源码支持模型:
__all__ = [ 'fcn_resnet50', 'fcn_resnet101', 'deeplabv3_resnet50', 'deeplabv3_resnet101']
model_urls = {
'fcn_resnet50_coco': 'https://download.pytorch.org/models/fcn_resnet50_coco-1167a1af.pth',
'fcn_resnet101_coco': 'https://download.pytorch.org/models/fcn_resnet101_coco-7ecb50ca.pth',
'deeplabv3_resnet50_coco': 'https://download.pytorch.org/models/deeplabv3_resnet50_coco-cd0a2569.pth',
'deeplabv3_resnet101_coco': 'https://download.pytorch.org/models/deeplabv3_resnet101_coco-586e9e4e.pth',
}
这里需要自定义模型fcn_resnet18(路径:/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-linux-aarch64.egg/torchvision/models/segmentation/segmentation.py):
def fcn_resnet18(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs):
"""Constructs a Fully-Convolutional Network model with a ResNet-50 backbone.
Args:
pretrained (bool): If True, returns a model pre-trained on COCO train2017 which
contains the same classes as Pascal VOC
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _load_model('fcn', 'resnet18', pretrained, progress, num_classes, aux_loss, **kwargs)
resnet18采用Basicblock, resnet101采用Bottleneck,所以需要修改函数:def _segm_resnet(name, backbone_name, num_classes, aux, pretrained_backbone=True):
backbone = resnet.__dict__[backbone_name](
pretrained=pretrained_backbone,
replace_stride_with_dilation=[False, True, True])
改为:
backbone = resnet.__dict__[backbone_name](
pretrained=pretrained_backbone,
replace_stride_with_dilation=[False, False, False])
resnet18输出512,resnet50输出2048,修改
inplanes = 2048
为
inplanes = 512
3.2 回到/jetson-inference/python/training/segmentation目录,开始训练
python3 train.py path_voc_dataset --width=512 --height=320 -b=1 -j=8 --model-dir=models_dir
训练开始
pytorch-segmentation/datasets/__init__.py
Not using distributed mode
Namespace(arch='fcn_resnet18', aux_loss=False, batch_size=1, data='/jetson-inference/data/label/ts_data/', dataset='voc', device='cuda', dist_url='env://', distributed=False, epochs=30, height=320, lr=0.01, model_dir='models/', momentum=0.9, pretrained=False, print_freq=10, resolution=320, resume='', test_only=False, weight_decay=0.0001, width=512, workers=8, world_size=1)
=> training with dataset: 'voc' (train=714, val=179)
=> training with resolution: 512x320, 3 classes
=> training with model: fcn_resnet18
arch:fcn_resnet18
.
.
.
target = torch.as_tensor(np.asarray(target), dtype=torch.int64)
Test: [ 0/179] eta: 0:01:05 time: 0.3666 data: 0.3359 max mem: 286
Test: [100/179] eta: 0:00:02 time: 0.0227 data: 0.0035 max mem: 286
Test: Total time: 0:00:04
global correct: 98.7
average row correct: ['99.7', '18.9', '65.6']
IoU: ['98.7', '15.7', '53.9']
mean IoU: 56.1
saved checkpoint to: models/model_29.pth (56.118% mean IoU, 98.651% accuracy)
生成pytorch模型:
root@jykj-desktop:/jetson-inference/python/training/segmentation# ls models/
model_0.pth model_14.pth model_2.pth model_25.pth model_4.pth model_best.pth
model_1.pth model_15.pth model_20.pth model_26.pth model_5.pth
model_10.pth model_16.pth model_21.pth model_27.pth model_6.pth
model_11.pth model_17.pth model_22.pth model_28.pth model_7.pth
model_12.pth model_18.pth model_23.pth model_29.pth model_8.pth
model_13.pth model_19.pth model_24.pth model_3.pth model_9.pth
4、 模型转化:
首先,torchversion的fcn具有上采样逻辑,作者的onnx模型无上采样操作,所以,首先需要修改torchversion源码,去掉上采样(否者启动模型时会报错):
源码路径
/usr/local/lib/python3.6/dist-packages/torchvision-0.7.0a0+78ed10c-py3.6-li
nux-aarch64.egg/torchvision/models/segmentatio/_utils.py
修改def _SimpleSegmentationModel(nn.Module):
def forward(self, x):
input_shape = x.shape[-2:]
# contract: features is a dict of tensors
features = self.backbone(x)
result = OrderedDict()
x = features["out"]
x = self.classifier(x)
x = F.interpolate(x, size=input_shape, mode='bilinear', align_corners=False)
result["out"] = x
if self.aux_classifier is not None:
x = features["aux"]
x = self.aux_classifier(x)
x = F.interpolate(x, size=input_shape, mode='bilinear', align_corners=False)
result["aux"] = x
return result
修改为
def forward(self, x):
input_shape = x.shape[-2:]
# contract: features is a dict of tensors
features = self.backbone(x)
result = OrderedDict()
x = features["out"]
x = self.classifier(x)
if self.train_status: #训练的时候需要上采样,模型转换的时候去掉
x = F.interpolate(x, size=input_shape, mode='bilinear', align_corners=False)
result["out"] = x
if self.aux_classifier is not None:
x = features["aux"]
x = self.aux_classifier(x)
x = F.interpolate(x, size=input_shape, mode='bilinear', align_corners=False)
result["aux"] = x
return result
回到training目录:
/jetson-inference/python/training/segmentation
修改onnx_export.py文件
# model = models.segmentation.__dict__[arch](num_classes=num_classes,
# aux_loss=None,
# pretrained=False,
# export_onnx=True #去掉,torchversion源码不识别
)
model = models.segmentation.__dict__[arch](num_classes=num_classes,
aux_loss=None,
pretrained=False,
train_status=False) # 这个是去掉长采样的status,可以自定义
开始模型转换
python3 onnx_export.py --input=path_of_model.pth --output=output_model.onnx
完成转化:
Namespace(input='models/model_best.pth', model_dir='', output='fcn_resnet18.onnx')
running on device cuda:0
loading checkpoint: models/model_best.pth
using model: fcn_resnet18
num classes: 3
input size: 512x320
exporting model to ONNX...
graph(%input_0 : Float(1:491520, 3:163840, 320:512, 512:1),
.
.
.
%output_0 : Float(1:480, 3:160, 10:16, 16:1) = onnx::Conv[dilations=[1, 1], group=1, kernel_shape=[1, 1], pads=[0, 0, 0, 0], strides=[1, 1]](%197, %classifier.4.weight, %classifier.4.bias) # /usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py:416:0
return (%output_0)
model exported to: fcn_resnet18.onnx
至此,xavier上的segnet模型训练与转化部分完成,可用于模型训练与测试
模型检测结果(自定义输出结构):
{"frame_id": 199, "time": "2020-11-27 01:57:48", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 200, "time": "2020-11-27 01:57:48", "objects": [{"class_id": 1, "name": "something", "pixel_num": 160}]}
{"frame_id": 201, "time": "2020-11-27 01:57:48", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 202, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 203, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 204, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 160}]}
{"frame_id": 205, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 206, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 2}, {"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 207, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 160}]}
{"frame_id": 208, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 209, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 210, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 211, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 212, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 213, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 160}]}
{"frame_id": 214, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 215, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 216, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 4}]}
{"frame_id": 217, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 218, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 219, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 220, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 3}, {"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 221, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 222, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 223, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 3}, {"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 224, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 225, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 1, "name": "something", "pixel_num": 3}, {"class_id": 2, "name": "machine", "pixel_num": 3}]}
{"frame_id": 226, "time": "2020-11-27 01:57:49", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 227, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 228, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 229, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 4}]}
{"frame_id": 230, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 160}]}
{"frame_id": 231, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 232, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 233, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 234, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 235, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 4}]}
{"frame_id": 236, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 237, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 238, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 239, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 240, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 241, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 242, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 2}]}
{"frame_id": 243, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 244, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 3}, {"class_id": 2, "name": "machine", "pixel_num": 1}]}
{"frame_id": 245, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 160}]}
{"frame_id": 246, "time": "2020-11-27 01:57:50", "objects": [{"class_id": 1, "name": "something", "pixel_num": 1}, {"class_id": 2, "name": "machine", "pixel_num": 3}]}
5、QA
1、training报错
AttributeError: module 'torchvision.models.segmentation.segmentation' has no attribute 'fcn_resnet18'
解决方案:修改源码,添加fcn_resnet18,参考3.1
2、training 报错
NotImplementedError: Dilation > 1 not supported in BasicBlock
解决方案:由于resnet18采用basicblock, 修改 replace_stride_with_dilation=[False, False, False])
3、training报错:
RuntimeError: input and target batch or spatial sizes don't match: target [1 x 320 x 512], input [1 x 3 x 10 x 16] at /media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:2
解决方案:训练时需要添加上采样,模型转化时去掉上采样
if self.train_status: #训练的时候需要上采样,模型转换的时候去掉
x = F.interpolate(x, size=input_shape, mode='bilinear', align_corners=False)
4、推理报错
ERROR: builtin_op_importers.cpp:3460 In function importUpsample:
[8] Assertion failed: scales_input.is_weights()
[TRT] failed to parse ONNX model '/usr/local/bin/networks/FCN-ResNet18-Pascal-VOC-512x320/fcn_resnet18.onnx'
[TRT] device GPU, failed to load networks/FCN-ResNet18-Pascal-VOC-512x320/fcn_resnet18.onnx
[TRT] segNet -- failed to load.
jetson.inference -- segNet failed to load network
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "./segnet.py", line 151, in process
net = jetson.inference.segNet(opt.network, sys.argv)
Exception: jetson.inference -- segNet failed to load network
解决方案:模型转换时,去掉上采样
加上上采样的onnx网络结构:
%201 = Conv[dilations = [1, 1], group = 1, kernel_shape = [3, 3], pads = [1, 1, 1, 1], strides = [1, 1]](%200, %classifier.0.weight)
%202 = BatchNormalization[epsilon = 9.99999974737875e-06, momentum = 0.899999976158142](%201, %classifier.1.weight, %classifier.1.bias, %classifier.1.running_mean, %classifier.1.running_var)
%203 = Relu(%202)
%204 = Conv[dilations = [1, 1], group = 1, kernel_shape = [1, 1], pads = [0, 0, 0, 0], strides = [1, 1]](%203, %classifier.4.weight, %classifier.4.bias)
%205 = Unsqueeze[axes = [0]](%131)
%206 = Unsqueeze[axes = [0]](%134)
%207 = Concat[axis = 0](%205, %206)
%208 = Constant[value = <Tensor>]()
%209 = Cast[to = 1](%207)
%210 = Shape(%204)
%211 = Slice[axes = [0], ends = [9223372036854775807], starts = [2]](%210)
%212 = Cast[to = 1](%211)
%213 = Div(%209, %212)
%214 = Concat[axis = 0](%208, %213)
%output_0 = Upsample[mode = 'linear'](%204, %214)
return %output_0
去掉上采样的onnx网络结构:
%194 = Relu(%193)
%195 = Conv[dilations = [1, 1], group = 1, kernel_shape = [3, 3], pads = [1, 1, 1, 1], strides = [1, 1]](%194, %classifier.0.weight)
%196 = BatchNormalization[epsilon = 9.99999974737875e-06, momentum = 0.899999976158142](%195, %classifier.1.weight, %classifier.1.bias, %classifier.1.running_mean, %classifier.1.running_var)
%197 = Relu(%196)
%output_0 = Conv[dilations = [1, 1], group = 1, kernel_shape = [1, 1], pads = [0, 0, 0, 0], strides = [1, 1]](%197, %classifier.4.weight, %classifier.4.bias)
return %output_0
}