CLRNet推理详解及部署实现（下）

爱听歌的周童鞋

已于 2024-08-15 22:25:16 修改

阅读量1.3k

点赞数 11

分类专栏：模型部署文章标签： lane detection clrnet onnx tensorrt cuda

于 2024-08-11 20:49:27 首次发布

本博客上原创文章未经本人许可，不得用于商业用途。转载请注明出处，否则保留追究法律责任的权利

本文链接：https://blog.csdn.net/qq_40672115/article/details/141107365

版权

模型部署专栏收录该内容

49 篇文章

订阅专栏

注意事项

一、2024/8/15更新

修复视频推理 bug，目前代码已完成修改

前言

在 CLRNet推理详解及部署实现（上）文章中我们有提到如何导出 CLRNet 的 ONNX 模型，这篇文章就来看看如何在 tensorRT 上推理得到结果

Note：开始之前大家务必参考 CLRNet推理详解及部署实现（上）将对应的环境配置好，并将 CLRNet 的 ONNX 导出来，这里博主就不再介绍了

repo：https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8

在这里插入图片描述

一、CLRNet推理(Python)

1. CLRNet预测

我们先尝试利用官方预训练权重来推理一张图片并保存，看能否成功

在 CLRNet 目录下新建 predict.py 文件，其内容如下：

import cv2
import torch
import numpy as np
from clrnet.utils.config import Config
from mmcv.parallel import MMDataParallel
from clrnet.models.registry import build_net
from clrnet.utils.visualization import imshow_lanes

def preprocess(img, img_w, img_h, cut_height):
    img_pre = img[cut_height:, :, :]
    img_pre = cv2.resize(img_pre, (img_w, img_h))
    img_pre = (img_pre / 255.0).astype(np.float32)
    img_pre = img_pre.transpose(2, 0, 1)[None]
    img_pre = torch.from_numpy(img_pre)
    return img_pre

if __name__ == "__main__":

    img = cv2.imread("./data/CULane/driver_37_30frame/05181432_0203.MP4/00210.jpg")
    img_pre = preprocess(img, 800, 320, 270).to("cuda")

    cfg = Config.fromfile("configs/clrnet/clr_resnet18_culane.py")
    checkpoint_file_path = "culane_r18.pth"
    net = build_net(cfg)
    net = MMDataParallel(net, device_ids=range(1)).cuda()
    pretrained_model = torch.load(checkpoint_file_path)
    net.load_state_dict(pretrained_model['net'], strict=False)
    net.eval()
    model = net.to("cuda")

    with torch.no_grad():
        output = model(img_pre)
        lanes  = model.module.heads.get_lanes(output)[0]
        lanes  = [lane.to_array(cfg) for lane in lanes]
        imshow_lanes(img, lanes, None, "./result.jpg")

Note：代码，权重和数据集大家可以点击 here 下载，脚本运行需要编译 NMS 插件，其环境配置请参考上篇文章，这边博主不再赘述

执行该脚本后会在当前目录下生成 result.jpg 推理结果图片，如下所示：

在这里插入图片描述

2. CLRNet预处理

模型推理成功后我们就要来梳理下 CLRNet 的预处理和后处理，方便后续在 C++ 上实现，我们先来看预处理的实现，这里博主主要参考了 CLRNet-onnxruntime-and-tensorrt-demo 的实现

经过我们的调试分析（省略…😄）可知 CLRNet 的预处理过程在 clrnet/datasets/base_dataset.py 文件中，可以参考：base_dataset.py#L37

def __getitem__(self, idx):
    data_info = self.data_infos[idx]
    img = cv2.imread(data_info['img_path'])
    img = img[self.cfg.cut_height:, :, :]
    sample = data_info.copy()
    sample.update({'img': img})

    if self.training:
        label = cv2.imread(sample['mask_path'], cv2.IMREAD_UNCHANGED)
        if len(label.shape) > 2:
            label = label[:, :, 0]
        label = label.squeeze()
        label = label[self.cfg.cut_height:, :]
        sample.update({'mask': label})

        if self.cfg.cut_height != 0:
            new_lanes = []
            for i in sample['lanes']:
                lanes = []
                for p in i:
                    lanes.append((p[0], p[1] - self.cfg.cut_height))
                new_lanes.append(lanes)
            sample.update({'lanes': new_lanes})

    sample = self.processes(sample)
    meta = {'full_img_path': data_info['img_path'],
            'img_name': data_info['img_name']}
    meta = DC(meta, cpu_only=True)
    sample.update({'meta': meta})

    return sample

首先通过 opencv 读取一张图片，并进行裁剪，接着通过 self.processes 函数读取 cfg 配置文件来进行的预处理，我们来看下配置文件 clr_resnet18_culane.py 的内容：

val_process = [
    dict(type='GenerateLaneLine',
         transforms=[
             dict(name='Resize',
                  parameters=dict(size=dict(height=img_h, width=img_w)),
                  p=1.0),
         ],
         training=False),
    dict(type='ToTensor', keys=['img']),
]

val=dict(
    type=dataset_type,
    data_root=dataset_path,
    split='test',
    processes=val_process,
),

从 val_process 列表中我们可以看到如下两个操作：

Resize
ToTensor

我们在 CLRNet/clrnet/datasets/process/transforms.py 文件中可以找到这两个操作具体实现的内容：

def to_tensor(data):
    """Convert objects of various python types to :obj:`torch.Tensor`.

    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
    :class:`Sequence`, :class:`int` and :class:`float`.

    Args:
        data (torch.Tensor | numpy.ndarray | Sequence | int | float): Data to
            be converted.
    """

    if isinstance(data, torch.Tensor):
        return data
    elif isinstance(data, np.ndarray):
        return torch.from_numpy(data)
    elif isinstance(data, int):
        return torch.LongTensor([data])
    elif isinstance(data, float):
        return torch.FloatTensor([data])
    else:
        raise TypeError(f'type {type(data)} cannot be converted to tensor.')

@PROCESS.register_module
class Resize(object):
    def __init__(self, size, cfg=None):
        assert (isinstance(size, collections.Iterable) and len(size) == 2)
        self.size = size

    def __call__(self, sample):
        out = list()
        sample['img'] = cv2.resize(sample['img'],
                                   self.size,
                                   interpolation=cv2.INTER_CUBIC)
        if 'mask' in sample:
            sample['mask'] = cv2.resize(sample['mask'],
                                        self.size,
                                        interpolation=cv2.INTER_NEAREST)
        return sample

@PROCESS.register_module
class ToTensor(object):
    """Convert some results to :obj:`torch.Tensor` by given keys.

    Args:
        keys (Sequence[str]): Keys that need to be converted to Tensor.
    """
    def __init__(self, keys=['img', 'mask'], cfg=None):
        self.keys = keys

    def __call__(self, sample):
        data = {}
        if len(sample['img'].shape) < 3:
            sample['img'] = np.expand_dims(img, -1)
        for key in self.keys:
            if key == 'img_metas' or key == 'gt_masks' or key == 'lane_line':
                data[key] = sample[key]
                continue
            data[key] = to_tensor(sample[key])
        data['img'] = data['img'].permute(2, 0, 1)
        return data

    def __repr__(self):
        return self.__class__.__name__ + f'(keys={self.keys})'

此外还有一个预处理操作在 generate_lane_line.py 中：

def __call__(self, sample):
    img_org = sample['img']
    line_strings_org = self.lane_to_linestrings(sample['lanes'])
    line_strings_org = LineStringsOnImage(line_strings_org,
                                            shape=img_org.shape)

    for i in range(30):
        if self.training:
            mask_org = SegmentationMapsOnImage(sample['mask'],
                                                shape=img_org.shape)
            img, line_strings, seg = self.transform(
                image=img_org.copy().astype(np.uint8),
                line_strings=line_strings_org,
                segmentation_maps=mask_org)
        else:
            img, line_strings = self.transform(
                image=img_org.copy().astype(np.uint8),
                line_strings=line_strings_org)
        line_strings.clip_out_of_image_()
        new_anno = {'lanes': self.linestrings_to_lanes(line_strings)}
        try:
            annos = self.transform_annotation(new_anno,
                                                img_wh=(self.img_w,
                                                        self.img_h))
            label = annos['label']
            lane_endpoints = annos['lane_endpoints']
            break
        except:
            if (i + 1) == 30:
                self.logger.critical(
                    'Transform annotation failed 30 times :(')
                exit()

    sample['img'] = img.astype(np.float32) / 255.
    sample['lane_line'] = label
    sample['lanes_endpoints'] = lane_endpoints
    sample['gt_points'] = new_anno['lanes']
    sample['seg'] = seg.get_arr() if self.training else np.zeros(
        img_org.shape)

    return sample

我们总结下 CLRNet 预处理包含以下步骤：

img = img[self.cfg.cut_height:, :, :]：高度裁剪，将图片中的天空部分裁掉
Resize：将图片缩放到 800x320
ToTensor：维度变换，HWC -> BCHW
sample[‘img’] = img.astype(np.float32) / 255.：除以 255，归一化

它和我们前面说的 LaneATT 模型的预处理非常像，只是添加了 cut_height 这个操作，因此我们不难写出对应的预处理代码，如下所示：

def preprocess(self, img):
    # 0. cut 
    img = img[self.cut_height:, :, :]
    # 1. resize
    img = cv2.resize(img, (self.img_w, self.img_h))
    # 2. normalize
    img = (img / 255.0).astype(np.float32)
    # 3. to bchw
    img = img.transpose(2, 0, 1)[None]
    return img

Note：预处理 resize 的目标尺寸是 width=800，height=320，并且没有 BGR->RGB 这个操作，因此 CLRNet 模型的输入是 1x3x320x800

3. CLRNet后处理

我们再来看看后处理的实现

经过我们的调试分析（省略…😄）可知 CLRNet 的后处理过程在 clrnet/models/heads/clr_head.py 文件中，可以参考：clr_head.py#L440

def predictions_to_pred(self, predictions):
    '''
    Convert predictions to internal Lane structure for evaluation.
    '''
    self.prior_ys = self.prior_ys.to(predictions.device)
    self.prior_ys = self.prior_ys.double()
    lanes = []
    for lane in predictions:
        lane_xs = lane[6:]  # normalized value
        start = min(max(0, int(round(lane[2].item() * self.n_strips))),
                    self.n_strips)
        length = int(round(lane[5].item()))
        end = start + length - 1
        end = min(end, len(self.prior_ys) - 1)
        # end = label_end
        # if the prediction does not start at the bottom of the image,
        # extend its prediction until the x is outside the image
        mask = ~((((lane_xs[:start] >= 0.) & (lane_xs[:start] <= 1.)
                    ).cpu().numpy()[::-1].cumprod()[::-1]).astype(np.bool))
        lane_xs[end + 1:] = -2
        lane_xs[:start][mask] = -2
        lane_ys = self.prior_ys[lane_xs >= 0]
        lane_xs = lane_xs[lane_xs >= 0]
        lane_xs = lane_xs.flip(0).double()
        lane_ys = lane_ys.flip(0)

        lane_ys = (lane_ys * (self.cfg.ori_img_h - self.cfg.cut_height) +
                    self.cfg.cut_height) / self.cfg.ori_img_h
        if len(lane_xs) <= 1:
            continue
        points = torch.stack(
            (lane_xs.reshape(-1, 1), lane_ys.reshape(-1, 1)),
            dim=1).squeeze(2)
        lane = Lane(points=points.cpu().numpy(),
                    metadata={
                        'start_x': lane[3],
                        'start_y': lane[2],
                        'conf': lane[1]
                    })
        lanes.append(lane)
    return lanes

def get_lanes(self, output, as_lanes=True):
    '''
    Convert model output to lanes.
    '''
    softmax = nn.Softmax(dim=1)

    decoded = []
    for predictions in output:
        # filter out the conf lower than conf threshold
        threshold = self.cfg.test_parameters.conf_threshold
        scores = softmax(predictions[:, :2])[:, 1]
        keep_inds = scores >= threshold
        predictions = predictions[keep_inds]
        scores = scores[keep_inds]

        if predictions.shape[0] == 0:
            decoded.append([])
            continue
        nms_predictions = predictions.detach().clone()
        nms_predictions = torch.cat(
            [nms_predictions[..., :4], nms_predictions[..., 5:]], dim=-1)
        nms_predictions[..., 4] = nms_predictions[..., 4] * self.n_strips
        nms_predictions[...,
                        5:] = nms_predictions[..., 5:] * (self.img_w - 1)

        keep, num_to_keep, _ = nms(
            nms_predictions,
            scores,
            overlap=self.cfg.test_parameters.nms_thres,
            top_k=self.cfg.max_lanes)
        keep = keep[:num_to_keep]
        predictions = predictions[keep]

        if predictions.shape[0] == 0:
            decoded.append([])
            continue

        predictions[:, 5] = torch.round(predictions[:, 5] * self.n_strips)
        if as_lanes:
            pred = self.predictions_to_pred(predictions)
        else:
            pred = predictions
        decoded.append(pred)

    return decoded

它包含以下步骤：

nms：非极大值抑制
predictions_to_pred：车道线 proposal 解码

在分析后处理代码之前我们先来看下模型的输出含义，CLRNet 模型的输出是 1x192x78，其中代表的含义是：

1：batch 维度
192：每张图像中预测的车道线 proposal 数量
78：每个车道线 proposal 的特征向量长度，它包括以下五部分
- cls（2 维）：分类概率，分别代表背景的概率以及车道线的概率
- start_y，start_x（2 维）：车道线的起始点坐标
- theta（1 维）：车道线的起点与车道先验的 x 轴之间的角度
- length（1 维）：车道线的长度
- x_offset（72 维）：车道线在每个 anchor 的水平偏移，单位是像素

OK，把模型输出的每个维度含义梳理清楚之后，我们再来看后处理代码就比较清晰了

NMS 我们已经非常熟悉了，值得注意的是车道线的 NMS 中 IoU 的计算和我们常见的检测框 IoU 计算有所不同，它主要是针对车道线的特殊性，官方提供的 IoU 计算代码如下：

template <typename scalar_t>
// __device__ inline scalar_t devIoU(scalar_t const * const a, scalar_t const * const b) {
__device__ inline bool devIoU(scalar_t const * const a, scalar_t const * const b, const float threshold) {
  const int start_a = (int) (a[2] * N_STRIPS - DATASET_OFFSET + 0.5); // 0.5 rounding trick
  const int start_b = (int) (b[2] * N_STRIPS - DATASET_OFFSET + 0.5);
  const int start = max(start_a, start_b);
  const int end_a = start_a + a[4] - 1 + 0.5 - ((a[4] - 1) < 0); //  - (x<0) trick to adjust for negative numbers (in case length is 0)
  const int end_b = start_b + b[4] - 1 + 0.5 - ((b[4] - 1) < 0);
  const int end = min(min(end_a, end_b), N_OFFSETS - 1);
  // if (end < start) return 1e9;
  if (end < start) return false;
  scalar_t dist = 0;
  for(unsigned char i = 5 + start; i <= 5 + end; ++i) {
    if (a[i] < b[i]) {
      dist += b[i] - a[i];
    } else {
      dist += a[i] - b[i];
    }
  }
  // return (dist / (end - start + 1)) < threshold;
  return dist < (threshold * (end - start + 1));
  // return dist / (end - start + 1);
}

上述函数的输入参数是两条车道线的 proposal 以及 IoU 判断的阈值，主要计算两个 proposal 在重叠区域内的水平偏移差异，适用于车道线检测。具体步骤如下：

1. 计算两条车道线 proposal 的起始和结束索引
2. 检查两个 proposal 是否有重叠区域
3. 计算两个 proposal 在重叠区域内的水平偏移差异总和
4. 将水平偏移差异总和与阈值进行比较，判断两个 proposal 是否重叠

经过 NMS 之后我们还需要对预测的车道线 proposal 进行 decode 解码将其转换为实际的车道线坐标，代码如下：

def predictions_to_pred(self, predictions):
    '''
    Convert predictions to internal Lane structure for evaluation.
    '''
    self.prior_ys = self.prior_ys.to(predictions.device)
    self.prior_ys = self.prior_ys.double()
    lanes = []
    for lane in predictions:
        lane_xs = lane[6:]  # normalized value
        start = min(max(0, int(round(lane[2].item() * self.n_strips))),
                    self.n_strips)
        length = int(round(lane[5].item()))
        end = start + length - 1
        end = min(end, len(self.prior_ys) - 1)
        # end = label_end
        # if the prediction does not start at the bottom of the image,
        # extend its prediction until the x is outside the image
        mask = ~((((lane_xs[:start] >= 0.) & (lane_xs[:start] <= 1.)
                    ).cpu().numpy()[::-1].cumprod()[::-1]).astype(np.bool))
        lane_xs[end + 1:] = -2
        lane_xs[:start][mask] = -2
        lane_ys = self.prior_ys[lane_xs >= 0]
        lane_xs = lane_xs[lane_xs >= 0]
        lane_xs = lane_xs.flip(0).double()
        lane_ys = lane_ys.flip(0)

        lane_ys = (lane_ys * (self.cfg.ori_img_h - self.cfg.cut_height) +
                    self.cfg.cut_height) / self.cfg.ori_img_h
        if len(lane_xs) <= 1:
            continue
        points = torch.stack(
            (lane_xs.reshape(-1, 1), lane_ys.reshape(-1, 1)),
            dim=1).squeeze(2)
        lane = Lane(points=points.cpu().numpy(),
                    metadata={
                        'start_x': lane[3],
                        'start_y': lane[2],
                        'conf': lane[1]
                    })
        lanes.append(lane)
    return lanes

它主要包括如下步骤：(from ChatGPT)

1. 准备 anchor 的 y 坐标
- anchor_ys 是均匀分布固定的值，CLRNet 模型利用了车道线的先验知识，大多数车道线都是从图像底部开始向上延伸，因此 y 坐标的分布是相对固定的
2. 处理每个 proposal
- 获取水平偏移量：lane_xs = lane[6:]
- 计算起始点和长度：
  - start：车道线的起始点在 anchor 中的索引
  - length：车道线的长度
  - end：车道线终点在 anchor 中的索引，确保不超过 anchor 范围
- 处理起始部分无效点：如果车道线 proposal 的起始点不在图像底部，将其扩展到图像外部，并将无效点设置为 -2
- 提取有效的车道线点：将无效点过滤掉，只保留有效的车道线点
- 翻转车道线点：将车道线点翻转，使其从起始点到终点
- 构造车道线对象：将车道线点构造成 Lane 对象，并包含一些元数据（如起始点和置信度）
3. 返回车道线列表：将所有车道线对象返回。

通过上述的分析后我们不难写出对应的后处理代码，如下所示：

def postprocess(self, pred):
    # pred->1x192x78
    # pred = 
    scores = pred[:, :, :2]
    def softmax(x):
        e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return e_x / e_x.sum(axis=-1, keepdims=True)        
    scores = softmax(scores)
    pred[:, :, :2] = scores

    lanes = []
    for img_id, lane_id in zip(*np.where(pred[..., 1] > self.conf_thresh)):
        lane = pred[img_id, lane_id]
        lanes.append(lane.tolist())
    lanes = sorted(lanes, key=lambda x:x[1], reverse=True)
    lanes = self._nms(lanes)
    lanes_points = self._decode(lanes)
    return lanes_points[:self.nms_topk]

def _nms(self, lanes):
    
    remove_flags = [False] * len(lanes)
    
    keep_lanes = []
    for i, ilane in enumerate(lanes):
        if remove_flags[i]:
            continue
            
        keep_lanes.append(ilane)
        for j in range(i + 1, len(lanes)):
            if remove_flags[j]:
                continue
            
            jlane = lanes[j]
            if self._lane_iou(ilane, jlane) < self.nms_thres:
                remove_flags[j] = True
    return keep_lanes

def _lane_iou(self, lane_a, lane_b):
    # lane = (_, conf, start_y, start_x, theta, length, ...) = 2+2+1+1+72 = 78
    start_a = int(lane_a[2] * self.n_strips + 0.5)
    start_b = int(lane_b[2] * self.n_strips + 0.5)
    start   = max(start_a, start_b)
    end_a   = start_a + int(lane_a[5] * self.n_strips + 0.5) - 1
    end_b   = start_b + int(lane_b[5] * self.n_strips + 0.5) - 1
    end     = min(min(end_a, end_b), self.n_strips)
    dist = 0
    for i in range(start, end + 1):
        dist += abs((lane_a[i + 6] - lane_b[i + 6]) * (self.img_w - 1))
    dist = dist / float(end - start + 1)
    return dist

def _decode(self, lanes):
    lanes_points = []
    for lane in lanes:
        start  = int(lane[2] * self.n_strips + 0.5)
        # revise 1.
        end    = start + int(lane[5] * self.n_strips + 0.5) - 1
        end    = min(end, self.n_strips)
        points = []
        for i in range(start, end + 1):
            y = self.anchor_ys[i]
            factor = self.cut_height / self.ori_h
            ys = (1 - factor) * y + factor
            points.append([lane[i + 6], ys])
        points = torch.from_numpy(np.array(points))
        lanes_points.append(points)
    return lanes_points

4. CLRNet推理

通过上面对 CLRNet 的预处理和后处理分析之后，整个推理过程就显而易见了。CLRNet 的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 cut_height 和 resize，后处理主要包括 NMS 和 decode 解码两部分

完整的推理代码如下：

import cv2
import torch
import numpy as np
import onnxruntime as ort

class CLRNet(object):
    def __init__(self, model_path, S=72, cut_height=270, img_w=800, img_h=320, conf_thresh=0.5, nms_thres=50., nms_topk=5) -> None:
        self.predictor   = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.n_strips    = S - 1
        self.n_offsets   = S
        self.cut_height  = cut_height
        self.img_w       = img_w
        self.img_h       = img_h
        self.conf_thresh = conf_thresh
        self.nms_thres   = nms_thres
        self.nms_topk    = nms_topk
        self.anchor_ys   = [1 - i / self.n_strips for i in range(self.n_offsets)]
        self.ori_w       = 1640
        self.ori_h       = 590

    def preprocess(self, img):
        # 0. cut 
        img = img[self.cut_height:, :, :]
        # 1. resize
        img = cv2.resize(img, (self.img_w, self.img_h))
        # 2. normalize
        img = (img / 255.0).astype(np.float32)
        # 3. to bchw
        img = img.transpose(2, 0, 1)[None]
        return img
    
    def forward(self, input):
        # input->1x3x320x800
        output = self.predictor.run(None, {"images": input})[0]
        return output

    def postprocess(self, pred):
        # pred->1x192x78
        # pred = 
        scores = pred[:, :, :2]
        def softmax(x):
            e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
            return e_x / e_x.sum(axis=-1, keepdims=True)        
        scores = softmax(scores)
        pred[:, :, :2] = scores

        lanes = []
        for img_id, lane_id in zip(*np.where(pred[..., 1] > self.conf_thresh)):
            lane = pred[img_id, lane_id]
            lanes.append(lane.tolist())
        lanes = sorted(lanes, key=lambda x:x[1], reverse=True)
        lanes = self._nms(lanes)
        lanes_points = self._decode(lanes)
        return lanes_points[:self.nms_topk]

    def _nms(self, lanes):
        
        remove_flags = [False] * len(lanes)
        
        keep_lanes = []
        for i, ilane in enumerate(lanes):
            if remove_flags[i]:
                continue
                
            keep_lanes.append(ilane)
            for j in range(i + 1, len(lanes)):
                if remove_flags[j]:
                    continue
                
                jlane = lanes[j]
                if self._lane_iou(ilane, jlane) < self.nms_thres:
                    remove_flags[j] = True
        return keep_lanes
    
    def _lane_iou(self, lane_a, lane_b):
        # lane = (_, conf, start_y, start_x, theta, length, ...) = 2+2+1+1+72 = 78
        start_a = int(lane_a[2] * self.n_strips + 0.5)
        start_b = int(lane_b[2] * self.n_strips + 0.5)
        start   = max(start_a, start_b)
        end_a   = start_a + int(lane_a[5] * self.n_strips + 0.5) - 1
        end_b   = start_b + int(lane_b[5] * self.n_strips + 0.5) - 1
        end     = min(min(end_a, end_b), self.n_strips)
        dist = 0
        for i in range(start, end + 1):
            dist += abs((lane_a[i + 6] - lane_b[i + 6]) * (self.img_w - 1))
        dist = dist / float(end - start + 1)
        return dist

    def _decode(self, lanes):
        lanes_points = []
        for lane in lanes:
            start  = int(lane[2] * self.n_strips + 0.5)
            # revise 1.
            end    = start + int(lane[5] * self.n_strips + 0.5) - 1
            end    = min(end, self.n_strips)
            points = []
            for i in range(start, end + 1):
                y = self.anchor_ys[i]
                factor = self.cut_height / self.ori_h
                ys = (1 - factor) * y + factor
                points.append([lane[i + 6], ys])
            points = torch.from_numpy(np.array(points))
            lanes_points.append(points)
        return lanes_points

if __name__ == "__main__":
    
    image = cv2.imread("data/CULane/driver_37_30frame/05181432_0203.MP4/00210.jpg")
    model_file_path = "clrnet.sim.onnx"
    model   = CLRNet(model_file_path)
    img_pre = model.preprocess(image)
    pred    = model.forward(img_pre)
    lanes_points = model.postprocess(pred)

    for points in lanes_points:
        points[:, 0] *= image.shape[1]
        points[:, 1] *= image.shape[0]
        points = points.numpy().round().astype(int)
        # for curr_p, next_p in zip(points[:-1], points[1:]):
        #     cv2.line(image, tuple(curr_p), tuple(next_p), color=(0, 255, 0), thickness=3)
        for point in points:
            cv2.circle(image, point, 3, color=(0, 255, 0), thickness=-1)
    
    cv2.imwrite("result.jpg", image)
    print("save done.")

Note：这里直接使用的 ONNX 模型进行推理，它不像 torch 模型还需要去编译 NMS 插件比较麻烦，ONNX 模型的导出可以参考上篇文章

推理效果如下图：

在这里插入图片描述

至此，我们在 Python 上面完成了 CLRNet 的整个推理过程，下面我们去 C++ 上实现

二、CLRNet推理(C++)

C++ 上的实现我们使用的 repo 依旧是 tensorRT_Pro，现在我们就基于 tensorRT_Pro 完成 CLRNet 在 C++ 上的推理。

1. ONNX导出

ONNX 导出的细节请参考 CLRNet推理详解及部署实现（上），这边不再赘述。

2. CLRNet预处理

之前有提到过 CLRNet 的预处理就是一个 cut 加 resize 操作，因此我们在 tensorRT_Pro 中 CLRNet 模型的预处理可以简单修改下 resize 的 CUDA 核函数的实现，另外需要注意在 CUDAKernel::Norm 的指定时不需要做 channel invert 操作

预处理代码如下：

__global__ void cut_resize_bilinear_and_normalize_kernel(
	uint8_t* src, int src_line_size, int src_width, int src_height, float* dst, int dst_width, int dst_height,
	float sx, float sy, int cut_height, Norm norm, int edge
){
	int position = blockDim.x * blockIdx.x + threadIdx.x;
	if (position >= edge) return;

	int dx      = position % dst_width;
	int dy      = position / dst_width;
	float src_x = (dx + 0.5f) * sx - 0.5f;
	float src_y = (dy + 0.5f) * sy - 0.5f + cut_height;	  // modify
	float c0, c1, c2;

	int y_low = floorf(src_y);
	int x_low = floorf(src_x);
	int y_high = limit(y_low + 1, 0, src_height - 1);
	int x_high = limit(x_low + 1, 0, src_width - 1);
	y_low = limit(y_low, 0, src_height - 1);
	x_low = limit(x_low, 0, src_width - 1);

	int ly    = rint((src_y - y_low) * INTER_RESIZE_COEF_SCALE);
	int lx    = rint((src_x - x_low) * INTER_RESIZE_COEF_SCALE);
	int hy    = INTER_RESIZE_COEF_SCALE - ly;
	int hx    = INTER_RESIZE_COEF_SCALE - lx;
	int w1    = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;
	float* pdst = dst + dy * dst_width + dx * 3;
	uint8_t* v1 = src + y_low * src_line_size + x_low * 3;
	uint8_t* v2 = src + y_low * src_line_size + x_high * 3;
	uint8_t* v3 = src + y_high * src_line_size + x_low * 3;
	uint8_t* v4 = src + y_high * src_line_size + x_high * 3;

	c0 = resize_cast(w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0]);
	c1 = resize_cast(w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1]);
	c2 = resize_cast(w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2]);

	if(norm.channel_type == ChannelType::Invert){
		float t = c2;
		c2 = c0;  c0 = t;
	}

	if(norm.type == NormType::MeanStd){
		c0 = (c0 * norm.alpha - norm.mean[0]) / norm.std[0];
		c1 = (c1 * norm.alpha - norm.mean[1]) / norm.std[1];
		c2 = (c2 * norm.alpha - norm.mean[2]) / norm.std[2];
	}else if(norm.type == NormType::AlphaBeta){
		c0 = c0 * norm.alpha + norm.beta;
		c1 = c1 * norm.alpha + norm.beta;
		c2 = c2 * norm.alpha + norm.beta;
	}

	int area = dst_width * dst_height;
	float* pdst_c0 = dst + dy * dst_width + dx;
	float* pdst_c1 = pdst_c0 + area;
	float* pdst_c2 = pdst_c1 + area;
	*pdst_c0 = c0;
	*pdst_c1 = c1;
	*pdst_c2 = c2;
}

void cut_resize_bilinear_and_normalize(
	uint8_t* src, int src_line_size, int src_width, int src_height, float* dst, int dst_width, int dst_height, int cut_height,
	const Norm& norm,
	cudaStream_t stream) {

	int jobs   = dst_width * dst_height;
	auto grid  = CUDATools::grid_dims(jobs);
	auto block = CUDATools::block_dims(jobs);

	checkCudaKernel(cut_resize_bilinear_and_normalize_kernel <<< grid, block, 0, stream >>>(
		src, src_line_size,
		src_width, src_height, dst,
		dst_width, dst_height, src_width/(float)dst_width, (src_height - cut_height)/(float)dst_height,
		cut_height, norm, jobs
	));	
}

关于预处理部分其实就是调用了上述 CUDA 核函数来实现 cut 和 resize，由于在 CUDA 中我们是对每个像素进行操作，因此非常容易实现 BGR->RGB，/255.0 等操作。

3. CLRNet后处理

之前我们有提到过 CLRNet 的后处理部分主要是包括 NMS 和 decode 两部分

关于 NMS 的实现这里博主是将其放在 GPU 上完成的，也是沿用了 tensorRT_Pro 的代码，如下所示：

static __device__ float LaneIoU(float* a, float* b, int input_width){
    int start_a = (int)(a[2] * N_STRIPS + 0.5f);
    int start_b = (int)(b[2] * N_STRIPS + 0.5f);
    int start   = max(start_a, start_b);
    int end_a   = start_a + (int)(a[5] + 0.5f) - 1;
    int end_b   = start_b + (int)(b[5] + 0.5f) - 1;
    int end     = min(min(end_a, end_b), N_STRIPS);
    float dist  = 0.0f;
    for(int i = 7 + start; i <= 7 + end; ++i){
        dist += fabsf(a[i] - b[i]);
    }
    return dist * (input_width - 1) / (float)(end - start + 1);
}

static __global__ void nms_kernel(float* lanes, int max_lanes, int input_width, float threshold){
    
    int position = (blockDim.x * blockIdx.x + threadIdx.x);
    int count = min((int)*lanes, max_lanes);
    if(position >= count)
        return;
    
    float* pcurrent = lanes + 1 + position * NUM_LANE_ELEMENT;
    if(pcurrent[6] == 0) return;

    for(int i = 0; i < count; ++i){
        float* pitem = lanes + 1 + i * NUM_LANE_ELEMENT;
        if(i == position)   continue;

        if(pitem[1] >= pcurrent[1]){
            if(pitem[1] == pcurrent[1] && i < position)
                continue;
            
            float iou = LaneIoU(pcurrent, pitem, input_width);
            if(iou < threshold){
                pcurrent[6] = 0;  // 1=keep, 0=ignore
                return;
            }
        }
    }
}

关于 NMS 的具体实现也是启动多个线程，每个线程处理一条车道线 proposal，如果剩余 proposal 中的置信度大于当前线程中处理的车道线，则计算两个车道线的 IoU，通过 IoU 值判断是否保留该框。相比于 CPU 版的 NMS 应该是少套了一层循环，另外一层循环是通过 CUDA 上线程的并行操作处理的

decode 解码部分中通过置信度过滤的实现是放在 GPU 上做的，代码如下：

static __global__ void decode_kernel(float* predict, int num_lanes, float confidence_threshold, float* parray){

    int position = blockDim.x * blockIdx.x + threadIdx.x;
    if (position >= num_lanes)  return;

    float* pitem = predict + (NUM_LANE_ELEMENT - 1) * position;
    float conf   = pitem[1];
    if(conf < confidence_threshold)
        return;
    
    int index = atomicAdd(parray, 1);
    float conf1   = *pitem++;
    float conf2   = *pitem++;
    float start_y = *pitem++;
    float start_x = *pitem++;
    float theta   = *pitem++;
    float length  = *pitem++;

    float* pout_item = parray + 1 + index * NUM_LANE_ELEMENT;
    *pout_item++ = conf1;
    *pout_item++ = conf2;
    *pout_item++ = start_y;
    *pout_item++ = start_x;
    *pout_item++ = theta;
    *pout_item++ = length;
    *pout_item++ = 1;   // 1 = keep, 0 = ignore

    for(int i = 0; i < N_OFFSETS; ++i){
        float point  = *pitem++;
        *pout_item++ = point;
    }
}

另外 proposal 中点的解码是放在 CPU 上做的，代码如下：

for(auto& lane : image_based_lanes){
    lane.points.reserve(N_OFFSETS / 2);
    int start = (int)(lane.start_y * N_STRIPS + 0.5f);
    int end   = start + (int)(lane.length + 0.5f) - 1;
    end       = min(end, N_STRIPS);
    for(int i = start; i <= end; ++i){
        lane.points.push_back(cv::Point2f(lane.lane_xs[i], anchor_ys_[i]));
    }
}

4. CLRNet推理

通过上面对 CLRNet 的预处理和后处理分析之后，整个推理过程就显而易见了。C++ 上 CLRNet 的预处理部分需简单修改 tensorRT_Pro 中的 CUDA resize，后处理中的 decode 和 NMS 部分也需要简单修改

我们在终端执行如下指令即可完成推理（注意！完整流程博主会在后续内容介绍，这边只是简单演示）：

make clrnet -j64

编译图解如下所示：

在这里插入图片描述

推理结果如下图所示：

在这里插入图片描述

至此，我们在 C++ 上面完成了 CLRNet 的整个推理过程，下面我们将完整的走一遍流程

三、CLRNet部署

博主新建了一个仓库 tensorRT_Pro-YOLOv8，该仓库基于 shouxieai/tensorRT_Pro，并进行了调整以支持 YOLOv8 的各项任务，目前已支持分类、检测、分割、姿态点估计任务。

下面我们就来具体看看如何利用 tensorRT_Pro-YOLOv8 这个 repo 完成 CLRNet 模型的推理。

1. 源码下载

tensorRT_Pro-YOLOv8 的代码可以直接从 GitHub 官网上下载，源码下载地址是 https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8，Linux 下代码克隆指令如下：

git clone https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8.git

也可手动点击下载，点击右上角的 Code 按键，将代码下载下来。至此整个项目就已经准备好了。也可以点击 here 下载博主准备好的源代码（注意代码下载于 2024/8/11 日，若有改动请参考最新）

2. 环境配置

需要使用的软件环境有 TensorRT、CUDA、cuDNN、OpenCV、Protobuf，所有软件环境的安装可以参考 Ubuntu20.04软件安装大全，这里不再赘述，需要各位看官自行配置好相关环境😄，外网访问较慢，这里提供下博主安装过程中的软件安装包下载链接 Baidu Drive【pwd:yolo】🚀🚀🚀

tensorRT_Pro-YOLOv8 提供 CMakeLists.txt 和 Makefile 两种方式编译，二者选一即可

2.1 配置CMakeLists.txt

主要修改五处

1. 修改第 13 行，修改 OpenCV 路径

set(OpenCV_DIR   "/usr/local/include/opencv4/")

2. 修改第 15 行，修改 CUDA 路径

set(CUDA_TOOLKIT_ROOT_DIR     "/usr/local/cuda-11.6")

3. 修改第 16 行，修改 cuDNN 路径

set(CUDNN_DIR    "/usr/local/cudnn8.4.0.27-cuda11.6")

4. 修改第 17 行，修改 tensorRT 路径（版本必须大于 8.6）

set(TENSORRT_DIR "/home/jarvis/lean/TensorRT-8.6.1.6")

5. 修改第 20 行，修改 protobuf 路径

set(PROTOBUF_DIR "/home/jarvis/protobuf")

2.2 配置Makefile

主要修改五处

1. 修改第 4 行，修改 protobuf 路径

lean_protobuf  := /home/jarvis/protobuf

2. 修改第 5 行，修改 tensorRT 路径（版本必须大于 8.6）

lean_tensor_rt := /home/jarvis/lean/TensorRT-8.6.1.6

3. 修改第 6 行，修改 cuDNN 路径

lean_cudnn     := /usr/local/cudnn8.4.0.27-cuda11.6

4. 修改第 7 行，修改 OpenCV 路径

lean_opencv    := /usr/local

5. 修改第 8 行，修改 CUDA 路径

lean_cuda      := /usr/local/cuda-11.6

3. ONNX导出

导出细节可以查看 CLRNet推理详解及部署实现（上），这边不再赘述。记得将导出的 ONNX 模型放在 tensorRT_Pro-YOLOv8/workspace 文件夹下。

4. engine生成

在 workspace 下新建 clrnet_build.sh，其内容如下：

#! /usr/bin/bash

TRTEXEC=/home/jarvis/lean/TensorRT-8.6.1.6/bin/trtexec

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/jarvis/lean/TensorRT-8.6.1.6/lib

${TRTEXEC} \
  --onnx=clrnet.sim.onnx \
  --minShapes=images:1x3x320x800 \
  --optShapes=images:1x3x320x800 \
  --maxShapes=images:8x3x320x800 \
  --memPoolSize=workspace:2048 \
  --saveEngine=clrnet.sim.FP16.trtmodel \
  --fp16 \
  > clrnet.log 2>&1

其中需要修改 TRTEXEC 的路径为你自己的路径，终端执行如下指令：

cd tensorRT_Pro-YOLOv8/workspace
bash clrnet_build.sh

执行后等待一段时间会在当前文件夹生成 clrnet.sim.FP16.trtmodel 即模型引擎文件，注意终端看不到任何日志打印输出，这是因为博主将 tensorRT 输出的日志信息保存到了 clrnet.log 文件中，大家也可以删除保存直接在终端显示相关日志信息

Note：博主也提供了 TRT::compile 接口生成 engine 文件，不过在反序列化的时候可能会出现如下的问题：

在这里插入图片描述

这个主要是因为 tensorRT_Pro-YOLOv8 自己构建的 onnxparser 版本太老，不支持 GridSample 和 LayerNormalization 节点的解析，我们可以手动替换 onnxparser 解析器具体可以参考：RT-DETR推理详解及部署实现

5. 源码修改

如果你想推理自己训练的模型还需要修改下源代码，CLRNet 模型的推理代码主要在 app_clrnet.cpp 文件中，我们就只需要修改这一个文件的内容即可，源码修改较简单主要有以下几点：

app_clrnet.cpp 236 行，“clrnet.sim” 修改为你导出的 ONNX 模型名

具体修改示例如下：

test(TRT::Mode::FP16, "clrnet.sim");	// 修改1 236 行 "clrnet.sim" 改成 "best"

6. 运行

OK！源码修改好了，Makefile 编译文件也搞定了，engine 模型也准备好了，现在可以编译运行了，直接在终端执行如下指令即可：

make clrnet -j64

推理结果如下图所示：

在这里插入图片描述

推理成功后会生成 clrnet.sim_CLRNet_FP16_result 文件夹，该文件夹下保存了推理的图片。

模型推理效果如下图所示：

在这里插入图片描述

OK，以上就是使用 tensorRT_Pro-YOLOv8 推理 CLRNet 的大致流程，若有问题，欢迎各位看官批评指正。

7. 补充说明

博主在进行视频流推理时遇到一个 bug，在指定 NMS 方法为 FastGPU 时推理会出现 Segmentation fault 段错误，但指定 NMS 方法为 CPU 则没有任何问题。经博主调试发现并不是说 GPU 版的 NMS 存在问题，而是在前面的 decode 部分就出现了问题

在使用测试视频的第四帧数据推理时就会触发段错误，引发该问题的代码在 clrnet.cpp 文件中，内容如下：

for(int i = 0; i < count; ++i){
    float* plane = parray + 1 + i * (NUM_LANE_ELEMENT + 1);
    int keepflag = plane[6];
    if(keepflag == 0)
        continue;

    Lane lane;
    lane.unknow  = plane[0];
    lane.score   = plane[1];
    lane.start_y = plane[2];
    lane.start_x = plane[3];
    lane.theta   = plane[4];
    lane.length  = plane[5];
    for(int i = 0; i < N_OFFSETS; ++i){
        lane.lane_xs[i] = plane[i + 7];
    }
    image_based_lanes.push_back(lane);
}
// sort
std::sort(image_based_lanes.begin(), image_based_lanes.end(), [](LaneArray::const_reference a, LaneArray::const_reference b){
    return a.score > b.score;
});
if(nms_method_ == NMSMethod::CPU){
    image_based_lanes = cpu_nms(image_based_lanes, nms_threshold_, nms_topk_, input_width_);
}else if(nms_method_ == NMSMethod::FastGPU){
    if(image_based_lanes.size() > nms_topk_){
        image_based_lanes.resize(nms_topk_);
    }
}
for(auto& lane : image_based_lanes){
    lane.points.reserve(N_OFFSETS / 2);
    int start = (int)(lane.start_y * N_STRIPS + 0.5f);
    int end   = start + (int)(lane.length + 0.5f) - 1;
    end       = min(end, N_STRIPS);
    for(int i = start; i <= end; ++i){
        lane.points.push_back(cv::Point2f(lane.lane_xs[i], anchor_ys_[i]));
    }
}

段错误是出现在 lane.points.push_back(cv::Point2f(lane.lane_xs[i], anchor_ys_[i])); 代码中，调试发现有一个 lane 结构体中的数值全是 nan

后续博主继续调试后发现一个非常奇怪的现象，第四帧数据正常进行了推理，parray 是输出数据，经过置信度过滤最终保留了 9 条车道线数据，在将输出数据变成 lane 结构体的过程中发现第 7 条车道线的数据存在异常，而前后两条车道线数据则没有问题，如下所示：

在这里插入图片描述

博主也仔细检查了 decode 的代码并没有发现什么问题，目前该 bug 并未解决，可能是哪部分的代码写得有问题，大家感兴趣的可以帮忙看看

结语

博主在这里针对 CLRNet 的预处理和后处理做了简单分析，同时与大家分享了 C++ 上的实现流程，目的是帮大家理清思路，更好的完成后续的部署工作😄。感谢各位看到最后，创作不易，读后有收获的看官请帮忙点个👍⭐️

CLRNet 作为 CVPR2022 lane detection 的 SOTA 方案还是非常值得学习的，目前 CLRNet 在 CULane 数据集上的表现排名第 6 还是非常不错的🤗

最后大家如果觉得 tensorRT_Pro-YOLOv8 这个 repo 对你有帮助的话，不妨点个 ⭐️ 支持一波，这对博主来说非常重要，感谢各位🙏。