CLRerNet推理详解及部署实现（下）-CSDN博客

本博客上原创文章未经本人许可，不得用于商业用途。转载请注明出处，否则保留追究法律责任的权利

本文链接：https://blog.csdn.net/qq_40672115/article/details/141275949

前言

在 CLRerNet推理详解及部署实现（上）文章中我们有提到如何导出 CLRerNet 的 ONNX 模型，这篇文章就来看看如何在 tensorRT 上推理得到结果

Note：开始之前大家务必参考 CLRerNet推理详解及部署实现（上）将对应的环境配置好，并将 CLRerNet 的 ONNX 导出来，这里博主就不再介绍了

repo：https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8

在这里插入图片描述

一、CLRerNet推理(Python)

1. CLRerNet预测

我们先尝试利用官方预训练权重来推理一张图片并保存，看能否成功

执行如下指令：

cd CLRerNet
conda activate clrernet
python demo/image_demo.py demo/demo.jpg configs/clrernet/culane/clrernet_culane_dla34.py clrernet_culane_dla34.pth --out-file=result.jpg

Note：代码，权重大家可以点击 here 下载，脚本运行需要编译 NMS 插件，其环境配置请参考上篇文章，这边博主不再赘述

执行该脚本后会在当前目录下生成 result.jpg 推理结果图片，如下所示：

在这里插入图片描述

2. CLRerNet预处理

模型推理成功后我们就要来梳理下 CLRerNet 的预处理和后处理，方便后续在 C++ 上的实现，我们先来看预处理的实现

经过我们的调试分析（省略…😄）可知 CLRerNet 的预处理过程可以在 tools/speed_test.py 文件中非常清晰的找到，可以参考：speed_test.py#L63

img = cv2.imread(args.filename)
cuty = cfg.crop_bbox[1] if "crop_bbox" in cfg else 0
img = img[cuty:, ...]
img = cv2.resize(img, cfg.img_scale)
mean = np.array(cfg.img_norm_cfg.mean)
std = np.array(cfg.img_norm_cfg.std)
img = mmcv.imnormalize(img, mean, std, False)
img = torch.unsqueeze(torch.tensor(img).permute(2, 0, 1), 0).cuda()

它包含以下步骤：

img[cuty:, …]：高度裁剪，将图片中的天空部分裁掉
cv2.resize：将图片缩放到 800x320
mmcv.imnormalize：减均值除以标准差，其中均值为 0 标准差为 255，其实就是做了除以 255 归一化操作
torch.unsqueeze：ToTensor，维度变换，HWC -> BCHW

它和我们前面说的 CLRNet 的预处理一模一样，因此我们不难写出对应的预处理代码，如下所示：

def preprocess(self, img):
    # 0. cut 
    img = img[self.cut_height:, :, :]
    # 1. resize
    img = cv2.resize(img, (self.img_w, self.img_h))
    # 2. normalize
    img = (img / 255.0).astype(np.float32)
    # 3. to bchw
    img = img.transpose(2, 0, 1)[None]
    return img

Note：预处理 resize 的目标尺寸是 width=800，height=320，并且没有 BGR->RGB 这个操作，因此 CLRerNet 模型的输入是 1x3x320x800

3. CLRerNet后处理

我们再来看看后处理的实现

经过我们的调试分析（省略…😄）可知 CLRerNet 的后处理过程在 libs/models/dense_heads/clrernet_head.py 文件中，可以参考：clrernet_head.py#L397

def get_lanes(self, pred_dict, as_lanes=True):
    """
    Convert model output to lane instances.
    Args:
        pred_dict (dict): prediction dict containing multiple lanes.
            cls_logits (torch.Tensor): 2-class logits with shape (B, Np, 2).
            anchor_params (torch.Tensor): anchor parameters with shape (B, Np, 3).
            lengths (torch.Tensor): lane lengths in row numbers with shape (B, Np, 1).
            xs (torch.Tensor): x coordinates of the lane points with shape (B, Np, Nr).
        as_lanes (bool): transform to the Lane instance for interpolation.
    Returns:
        pred (List[torch.Tensor]): List of lane tensors (shape: (N, 2))
            or `Lane` objects, where N is the number of rows.
        scores (torch.Tensor): Confidence scores of the lanes.

    B: batch size, Np: num_priors, Nr: num_points (rows).
    """
    softmax = nn.Softmax(dim=1)
    assert (
        len(pred_dict["cls_logits"]) == 1
    ), "Only single-image prediction is available!"
    # filter out the conf lower than conf threshold
    threshold = self.test_cfg.conf_threshold
    scores = softmax(pred_dict["cls_logits"][0])[:, 1]
    keep_inds = scores >= threshold
    scores = scores[keep_inds]
    xs = pred_dict["xs"][0, keep_inds]
    lengths = pred_dict["lengths"][0, keep_inds]
    anchor_params = pred_dict["anchor_params"][0, keep_inds]
    if xs.shape[0] == 0:
        return [], []

    if self.test_cfg.use_nms:
        nms_anchor_params = anchor_params[..., :2].detach().clone()
        nms_anchor_params[..., 0] = 1 - nms_anchor_params[..., 0]
        nms_predictions = torch.cat(
            [
                pred_dict["cls_logits"][0, keep_inds].detach().clone(),
                nms_anchor_params[..., :2],
                lengths.detach().clone() * self.n_strips,
                xs.detach().clone() * (self.img_w - 1),
            ],
            dim=-1,
        )  # [N, 77]
        keep, num_to_keep, _ = nms(
            nms_predictions,
            scores,
            overlap=self.test_cfg.nms_thres,
            top_k=self.test_cfg.nms_topk,
        )
        keep = keep[:num_to_keep]
        xs = xs[keep]
        scores = scores[keep]
        lengths = lengths[keep]
        anchor_params = anchor_params[keep]

    lengths = torch.round(lengths * self.n_strips)
    pred = self.predictions_to_lanes(xs, anchor_params, lengths, scores, as_lanes)

    return pred, scores

def predictions_to_lanes(
    self, pred_xs, anchor_params, lengths, scores, as_lanes=True, extend_bottom=True
):
    """
    Convert predictions to the lane segment instances.
    Args:
        pred_xs (torch.Tensor): x coordinates of the lane points with shape (Nl, Nr).
        anchor_params (torch.Tensor): anchor parameters with shape (Nl, 3).
        lengths (torch.Tensor): lane lengths in row numbers with shape (Nl, 1).
        scores (torch.Tensor): confidence scores with shape (Nl,).
        as_lanes (bool): transform to the Lane instance for interpolation.
        extend_bottom (bool): if the prediction does not start at the bottom of the image,
            extend its prediction until the x is outside the image.
    Returns:
        lanes (List[torch.Tensor]): List of lane tensors (shape: (N, 2))
            or `Lane` objects, where N is the number of rows.

    B: batch size, Nl: number of lanes after NMS, Nr: num_points (rows).
    """
    prior_ys = self.prior_ys.to(pred_xs.device).double()
    lanes = []
    for lane_xs, lane_param, length, score in zip(
        pred_xs, anchor_params, lengths, scores
    ):
        start = min(
            max(0, int(round((1 - lane_param[0].item()) * self.n_strips))),
            self.n_strips,
        )
        length = int(round(length.item()))
        end = start + length - 1
        end = min(end, len(prior_ys) - 1)
        if extend_bottom:
            edge = (lane_xs[:start] >= 0.0) & (lane_xs[:start] <= 1.0)
            start -= edge.flip(0).cumprod(dim=0).sum()
        lane_ys = prior_ys[start : end + 1]
        lane_xs = lane_xs[start : end + 1]
        lane_xs = lane_xs.flip(0).double()
        lane_ys = lane_ys.flip(0)

        lane_ys = (
            lane_ys * (self.test_cfg.ori_img_h - self.test_cfg.cut_height)
            + self.test_cfg.cut_height
        ) / self.test_cfg.ori_img_h
        if len(lane_xs) <= 1:
            continue
        points = torch.stack(
            (lane_xs.reshape(-1, 1), lane_ys.reshape(-1, 1)), dim=1
        ).squeeze(2)
        if as_lanes:
            lane = Lane(
                points=points.cpu().numpy(),
                metadata={
                    "start_x": lane_param[1],
                    "start_y": lane_param[0],
                    "conf": score,
                },
            )
        else:
            lane = points
        lanes.append(lane)
    return lanes

它包含以下步骤：

nms：非极大值抑制
predictions_to_pred：车道线 proposals 解码

在分析后处理代码之前我们先来看看模型的输出含义，CLRerNet 模型的输出是 1x192x78，其中代表的含义是：

1：batch 维度
192：每张图像中预测的车道线 proposal 数量
78：每条车道线 proposal 的特征向量长度，它包括以下五部分
- cls（2 维）：分类概率，分别代表背景的概率以及车道线的概率
- start_y，start_x（2 维）：车道线的起始点坐标
- theta（1 维）：车道线的起点与车道先验的 x 轴之间的角度
- length（1 维）：车道线的长度
- x_offset（72 维）：车道线在每个 anchor 的水平偏移，单位是像素

OK，把模型输出的每个维度含义梳理清楚之后，我们再来看后处理代码就比较清晰了

NMS 我们已经非常熟悉了，值得注意的是车道线的 NMS 中 IoU 的计算和我们常见的检测框 IoU 计算有所不同，它主要是针对车道线的特殊性，官方提供的 IoU 计算代码如下：

template <typename scalar_t>
// __device__ inline scalar_t devIoU(scalar_t const * const a, scalar_t const * const b) {
__device__ inline bool devIoU(scalar_t const * const a, scalar_t const * const b, const float threshold) {
  const int start_a = (int) (a[2] * N_STRIPS - DATASET_OFFSET + 0.5); // 0.5 rounding trick
  const int start_b = (int) (b[2] * N_STRIPS - DATASET_OFFSET + 0.5);
  const int start = max(start_a, start_b);
  const int end_a = start_a + a[4] - 1 + 0.5 - ((a[4] - 1) < 0); //  - (x<0) trick to adjust for negative numbers (in case length is 0)
  const int end_b = start_b + b[4] - 1 + 0.5 - ((b[4] - 1) < 0);
  const int end = min(min(end_a, end_b), N_OFFSETS - 1);
  // if (end < start) return 1e9;
  if (end < start) return false;
  scalar_t dist = 0;
  for(unsigned char i = 5 + start; i <= 5 + end; ++i) {
    if (a[i] < b[i]) {
      dist += b[i] - a[i];
    } else {
      dist += a[i] - b[i];
    }
  }
  // return (dist / (end - start + 1)) < threshold;
  return dist < (threshold * (end - start + 1));
  // return dist / (end - start + 1);
}

上述函数的输入参数是两条车道线的 proposal 以及 IoU 判断的阈值，主要计算两个 proposal 在重叠区域内的水平偏移差异，适用于车道线检测。具体步骤如下：

1. 计算两条车道线 proposal 的起始和结束索引
2. 检查两个 proposal 是否有重叠区域
3. 计算两个 proposal 在重叠区域内的水平偏移差异总和
4. 将水平偏移差异总和与阈值进行比较，判断两个 proposal 是否重叠

经过 NMS 之后我们还需要对预测的车道线 proposal 进行 decode 解码将其转换为实际的车道线坐标，代码如下：

def predictions_to_lanes(
    self, pred_xs, anchor_params, lengths, scores, as_lanes=True, extend_bottom=True
):
    """
    Convert predictions to the lane segment instances.
    Args:
        pred_xs (torch.Tensor): x coordinates of the lane points with shape (Nl, Nr).
        anchor_params (torch.Tensor): anchor parameters with shape (Nl, 3).
        lengths (torch.Tensor): lane lengths in row numbers with shape (Nl, 1).
        scores (torch.Tensor): confidence scores with shape (Nl,).
        as_lanes (bool): transform to the Lane instance for interpolation.
        extend_bottom (bool): if the prediction does not start at the bottom of the image,
            extend its prediction until the x is outside the image.
    Returns:
        lanes (List[torch.Tensor]): List of lane tensors (shape: (N, 2))
            or `Lane` objects, where N is the number of rows.

    B: batch size, Nl: number of lanes after NMS, Nr: num_points (rows).
    """
    prior_ys = self.prior_ys.to(pred_xs.device).double()
    lanes = []
    for lane_xs, lane_param, length, score in zip(
        pred_xs, anchor_params, lengths, scores
    ):
        start = min(
            max(0, int(round((1 - lane_param[0].item()) * self.n_strips))),
            self.n_strips,
        )
        length = int(round(length.item()))
        end = start + length - 1
        end = min(end, len(prior_ys) - 1)
        if extend_bottom:
            edge = (lane_xs[:start] >= 0.0) & (lane_xs[:start] <= 1.0)
            start -= edge.flip(0).cumprod(dim=0).sum()
        lane_ys = prior_ys[start : end + 1]
        lane_xs = lane_xs[start : end + 1]
        lane_xs = lane_xs.flip(0).double()
        lane_ys = lane_ys.flip(0)

        lane_ys = (
            lane_ys * (self.test_cfg.ori_img_h - self.test_cfg.cut_height)
            + self.test_cfg.cut_height
        ) / self.test_cfg.ori_img_h
        if len(lane_xs) <= 1:
            continue
        points = torch.stack(
            (lane_xs.reshape(-1, 1), lane_ys.reshape(-1, 1)), dim=1
        ).squeeze(2)
        if as_lanes:
            lane = Lane(
                points=points.cpu().numpy(),
                metadata={
                    "start_x": lane_param[1],
                    "start_y": lane_param[0],
                    "conf": score,
                },
            )
        else:
            lane = points
        lanes.append(lane)
    return lanes

它主要包括如下步骤：(from ChatGPT)

1. 准备 anchor 的 y 坐标
- prior_ys 是均匀分布固定的值，CLRerNet 模型利用了车道线的先验知识，大多数车道线都是从图像底部开始向上延伸，因此 y 坐标的分布是相对固定的
2. 处理每个 proposal
- 计算起始点和长度：
  - start：车道线的起始点在 anchor 中的索引
  - length：车道线的长度
  - end：车道线终点在 anchor 中的索引，确保不超过 anchor 范围
- 获取垂直偏移量：lane_ys = prior_ys[start : end + 1]
- 获取水平偏移量：lane_xs = lane_xs[start : end + 1]
- 翻转车道线点：将车道线点翻转，使其从起始点到终点
- 构造车道线对象：将车道线点构成 Lane 对象，并包含一些元数据（如起始点和置信度）
3. 返回车道线列表：将所有车道线对象返回

通过上述分析我们不难写出对应的后处理代码，如下所示：

def postprocess(self, pred):
    # pred->1x192x78

    lanes = []
    for img_id, lane_id in zip(*np.where(pred[..., 1] > self.conf_thresh)):
        lane = pred[img_id, lane_id]
        lanes.append(lane.tolist())
        print(f"score = {lane[1]:.2f}, start_y = {lane[2]:.2f}, start_x = {lane[3]:.2f} length = {lane[5]:.2f}")
    lanes = sorted(lanes, key=lambda x:x[1], reverse=True)
    lanes = self._nms(lanes)
    lanes_points = self._decode(lanes)
    return lanes_points[:self.nms_topk]

def _nms(self, lanes):
    
    remove_flags = [False] * len(lanes)
    
    keep_lanes = []
    for i, ilane in enumerate(lanes):
        if remove_flags[i]:
            continue
            
        keep_lanes.append(ilane)
        for j in range(i + 1, len(lanes)):
            if remove_flags[j]:
                continue
            
            jlane = lanes[j]
            if self._lane_iou(ilane, jlane) < self.nms_thres:
                remove_flags[j] = True
    return keep_lanes

def _lane_iou(self, lane_a, lane_b):
    # lane = (_, conf, start_y, start_x, theta, length, ...) = 2+2+1+1+72 = 78
    start_a = int((1 - lane_a[2]) * self.n_strips + 0.5)
    start_b = int((1 - lane_b[2]) * self.n_strips + 0.5)
    start   = max(start_a, start_b)
    end_a   = start_a + int(lane_a[5] + 0.5) - 1
    end_b   = start_b + int(lane_b[5] + 0.5) - 1
    
    end     = min(min(end_a, end_b), self.n_strips)
    dist = 0
    for i in range(start, end + 1):
        dist += abs((lane_a[i + 6] - lane_b[i + 6]) * (self.img_w - 1))
    dist = dist / float(end - start + 1)
    return dist

def _decode(self, lanes):
    lanes_points = []
    for lane in lanes:
        # ===diff===
        start  = int((1 - lane[2]) * self.n_strips + 0.5)
        end    = start + int(lane[5] + 0.5) - 1
        end    = min(end, self.n_strips)
        points = []
        for i in range(start, end + 1):
            y = self.anchor_ys[i]
            factor = self.cut_height / self.ori_h
            ys = (1 - factor) * y + self.ori_h
            points.append([lane[i + 6], ys])
        points = torch.from_numpy(np.array(points))
        lanes_points.append(points)
    return lanes_points

值得注意的是我们在计算起始点坐标时使用的是 1-lane[2]，这是因为模型输出的并不直接是 start_y 坐标，1-start_y 才是起始点坐标，这个我们在上篇文章中有特别提到过

4. CLRerNet推理

通过上面对 CLRerNet 的预处理和后处理分析之后，整个推理过程就显而易见了。CLRerNet 的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 cut_height 和 resize 两部分，后处理主要包括 NMS 和 decode 解码两部分

完整的推理代码如下：

import cv2
import torch
import numpy as np
import onnxruntime as ort

class CLRNet(object):
    def __init__(self, model_path, S=72, cut_height=270, img_w=800, img_h=320, conf_thresh=0.4, nms_thres=50., nms_topk=4) -> None:
        self.predictor   = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.n_strips    = S - 1
        self.n_offsets   = S
        self.cut_height  = cut_height
        self.img_w       = img_w
        self.img_h       = img_h
        self.conf_thresh = conf_thresh
        self.nms_thres   = nms_thres
        self.nms_topk    = nms_topk
        self.anchor_ys   = [1 - i / self.n_strips for i in range(self.n_offsets)]
        self.ori_w       = 1640
        self.ori_h       = 590

    def preprocess(self, img):
        # 0. cut 
        img = img[self.cut_height:, :, :]
        # 1. resize
        img = cv2.resize(img, (self.img_w, self.img_h))
        # 2. normalize
        img = (img / 255.0).astype(np.float32)
        # 3. to bchw
        img = img.transpose(2, 0, 1)[None]
        return img
    
    def forward(self, input):
        # input->1x3x320x800
        output = self.predictor.run(None, {"images": input})[0]
        return output

    def postprocess(self, pred):
        # pred->1x192x78
        lanes = []
        for img_id, lane_id in zip(*np.where(pred[..., 1] > self.conf_thresh)):
            lane = pred[img_id, lane_id]
            lanes.append(lane.tolist())
        lanes = sorted(lanes, key=lambda x:x[1], reverse=True)
        lanes = self._nms(lanes)
        lanes_points = self._decode(lanes)
        return lanes_points[:self.nms_topk]

    def _nms(self, lanes):
        
        remove_flags = [False] * len(lanes)
        
        keep_lanes = []
        for i, ilane in enumerate(lanes):
            if remove_flags[i]:
                continue
                
            keep_lanes.append(ilane)
            for j in range(i + 1, len(lanes)):
                if remove_flags[j]:
                    continue
                
                jlane = lanes[j]
                if self._lane_iou(ilane, jlane) < self.nms_thres:
                    remove_flags[j] = True
        return keep_lanes
    
    def _lane_iou(self, lane_a, lane_b):
        # lane = (_, conf, start_y, start_x, theta, length, ...) = 2+2+1+1+72 = 78
        start_a = int((1 - lane_a[2]) * self.n_strips + 0.5)
        start_b = int((1 - lane_b[2]) * self.n_strips + 0.5)
        start   = max(start_a, start_b)
        end_a   = start_a + int(lane_a[5] + 0.5) - 1
        end_b   = start_b + int(lane_b[5] + 0.5) - 1
        end     = min(min(end_a, end_b), self.n_strips)
        dist = 0
        for i in range(start, end + 1):
            dist += abs((lane_a[i + 6] - lane_b[i + 6]) * (self.img_w - 1))
        dist = dist / float(end - start + 1)
        return dist

    def _decode(self, lanes):
        lanes_points = []
        for lane in lanes:
            # ===diff===
            start  = int((1 - lane[2]) * self.n_strips + 0.5)
            end    = start + int(lane[5] + 0.5) - 1
            end    = min(end, self.n_strips)
            points = []
            for i in range(start, end + 1):
                y = self.anchor_ys[i]
                factor = self.cut_height / self.ori_h
                ys = (1 - factor) * y + factor
                points.append([lane[i + 6], ys])
            points = torch.from_numpy(np.array(points))
            lanes_points.append(points)
        return lanes_points

if __name__ == "__main__":
    
    image = cv2.imread("demo/demo.jpg")
    model_file_path = "clrernet.sim.onnx"
    model   = CLRNet(model_file_path)
    img_pre = model.preprocess(image)
    pred    = model.forward(img_pre)
    lanes_points = model.postprocess(pred)

    for points in lanes_points:
        points[:, 0] *= image.shape[1]
        points[:, 1] *= image.shape[0]
        points = points.numpy().round().astype(int)
        # for curr_p, next_p in zip(points[:-1], points[1:]):
        #     cv2.line(image, tuple(curr_p), tuple(next_p), color=(0, 255, 0), thickness=3)
        for point in points:
            cv2.circle(image, point, 3, color=(0, 255, 0), thickness=-1)
    
    cv2.imwrite("result.jpg", image)
    print("save done.")

Note：这里直接使用的 ONNX 模型进行的推理，它不像 torch 模型还需要去编译 NMS 插件比较麻烦，ONNX 模型的导出可以参考上篇文章

推理效果如下图：

在这里插入图片描述

至此，我们在 Python 上面完成了 CLRerNet 的整个推理过程，下面我们去 C++ 上实现

二、CLRerNet推理(C++)

C++ 上的实现我们使用的 repo 依旧是 tensorRT_Pro，现在我们就基于 tensorRT_Pro 完成 CLRerNet 在 C++ 上的推理

1. ONNX 导出

ONNX 导出的细节请参考 CLRerNet推理详解及部署实现（上），这边不再赘述

2. CLRerNet预处理

之前有提到 CLRerNet 的预处理就是一个 cut 加 resize 操作，因此我们在 tensorRT_Pro 中 CLRerNet 模型的预处理可以简单修改下 resize 的 CUDA 核函数的实现，另外需要注意在 CUDAKernel::Norm 的指定时不需要做 channel invert 操作

预处理代码如下：

__global__ void cut_resize_bilinear_and_normalize_kernel(
	uint8_t* src, int src_line_size, int src_width, int src_height, float* dst, int dst_width, int dst_height,
	float sx, float sy, int cut_height, Norm norm, int edge
){
	int position = blockDim.x * blockIdx.x + threadIdx.x;
	if (position >= edge) return;

	int dx      = position % dst_width;
	int dy      = position / dst_width;
	float src_x = (dx + 0.5f) * sx - 0.5f;
	float src_y = (dy + 0.5f) * sy - 0.5f + cut_height;	  // modify
	float c0, c1, c2;

	int y_low = floorf(src_y);
	int x_low = floorf(src_x);
	int y_high = limit(y_low + 1, 0, src_height - 1);
	int x_high = limit(x_low + 1, 0, src_width - 1);
	y_low = limit(y_low, 0, src_height - 1);
	x_low = limit(x_low, 0, src_width - 1);

	int ly    = rint((src_y - y_low) * INTER_RESIZE_COEF_SCALE);
	int lx    = rint((src_x - x_low) * INTER_RESIZE_COEF_SCALE);
	int hy    = INTER_RESIZE_COEF_SCALE - ly;
	int hx    = INTER_RESIZE_COEF_SCALE - lx;
	int w1    = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;
	float* pdst = dst + dy * dst_width + dx * 3;
	uint8_t* v1 = src + y_low * src_line_size + x_low * 3;
	uint8_t* v2 = src + y_low * src_line_size + x_high * 3;
	uint8_t* v3 = src + y_high * src_line_size + x_low * 3;
	uint8_t* v4 = src + y_high * src_line_size + x_high * 3;

	c0 = resize_cast(w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0]);
	c1 = resize_cast(w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1]);
	c2 = resize_cast(w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2]);

	if(norm.channel_type == ChannelType::Invert){
		float t = c2;
		c2 = c0;  c0 = t;
	}

	if(norm.type == NormType::MeanStd){
		c0 = (c0 * norm.alpha - norm.mean[0]) / norm.std[0];
		c1 = (c1 * norm.alpha - norm.mean[1]) / norm.std[1];
		c2 = (c2 * norm.alpha - norm.mean[2]) / norm.std[2];
	}else if(norm.type == NormType::AlphaBeta){
		c0 = c0 * norm.alpha + norm.beta;
		c1 = c1 * norm.alpha + norm.beta;
		c2 = c2 * norm.alpha + norm.beta;
	}

	int area = dst_width * dst_height;
	float* pdst_c0 = dst + dy * dst_width + dx;
	float* pdst_c1 = pdst_c0 + area;
	float* pdst_c2 = pdst_c1 + area;
	*pdst_c0 = c0;
	*pdst_c1 = c1;
	*pdst_c2 = c2;
}

void cut_resize_bilinear_and_normalize(
	uint8_t* src, int src_line_size, int src_width, int src_height, float* dst, int dst_width, int dst_height, int cut_height,
	const Norm& norm,
	cudaStream_t stream) {

	int jobs   = dst_width * dst_height;
	auto grid  = CUDATools::grid_dims(jobs);
	auto block = CUDATools::block_dims(jobs);

	checkCudaKernel(cut_resize_bilinear_and_normalize_kernel <<< grid, block, 0, stream >>>(
		src, src_line_size,
		src_width, src_height, dst,
		dst_width, dst_height, src_width/(float)dst_width, (src_height - cut_height)/(float)dst_height,
		cut_height, norm, jobs
	));	
}

关于预处理部分其实就是调用了上述 CUDA 核函数来实现 cut 和 resize，由于在 CUDA 中我们是对每个像素进行操作，因此非常容易实现 BGR->RGB，/255.0 等操作。

3. CLRerNet后处理

之前我们有提到过 CLRerNet 的后处理部分主要是包括 NMS 和 decode 两部分

关于 NMS 的实现博主这里是将其放在 GPU 上完成的，也是沿用了 tensorRT_Pro 的代码，如下所示：

static __device__ float LaneIoU(float* a, float* b, int input_width){
    int start_a = (int)((1 - a[2]) * N_STRIPS + 0.5f);
    int start_b = (int)((1 - b[2]) * N_STRIPS + 0.5f);
    int start   = max(start_a, start_b);
    int end_a   = start_a + (int)(a[5] + 0.5f) - 1;
    int end_b   = start_b + (int)(b[5] + 0.5f) - 1;
    int end     = min(min(end_a, end_b), N_STRIPS);
    float dist  = 0.0f;
    for(int i = 7 + start; i <= 7 + end; ++i){
        dist += fabsf(a[i] - b[i]);
    }
    return dist * (input_width - 1) / (float)(end - start + 1);
}

static __global__ void nms_kernel(float* lanes, int max_lanes, int input_width, float threshold){
    int position = (blockDim.x * blockIdx.x + threadIdx.x);
    int count = min((int)*lanes, max_lanes);
    if(position >= count)
        return;
    
    float* pcurrent = lanes + 1 + position * NUM_LANE_ELEMENT;
    if(pcurrent[6] == 0) return;

    for(int i = 0; i < count; ++i){
        float* pitem = lanes + 1 + i * NUM_LANE_ELEMENT;
        if(i == position)   continue;

        if(pitem[1] >= pcurrent[1]){
            if(pitem[1] == pcurrent[1] && i < position)
                continue;
            
            float iou = LaneIoU(pcurrent, pitem, input_width);
            if(iou < threshold){
                pcurrent[6] = 0;  // 1=keep, 0=ignore
                return;
            }
        }
    }
}

关于 NMS 的具体实现也是启动多个线程，每个线程处理一条车道线 proposal，如果剩余 proposal 中的置信度大于当前线程中处理的车道线，则计算两个车道线的 IoU，通过 IoU 值判断是否保留该框。相比于 CPU 版的 NMS 应该是少套了一层循环，另外一层循环是通过 CUDA 上线程的并行操作处理的

decode 解码部分中通过置信度过滤的实现是放在 GPU 上做的，代码如下：

static __global__ void decode_kernel(float* predict, int num_lanes, float confidence_threshold, float* parray){

    int position = blockDim.x * blockIdx.x + threadIdx.x;
    if (position >= num_lanes)  return;

    float* pitem = predict + (NUM_LANE_ELEMENT - 1) * position;
    float conf   = pitem[1];
    if(conf < confidence_threshold)
        return;
    
    int index = atomicAdd(parray, 1);
    float conf1   = *pitem++;
    float conf2   = *pitem++;
    float start_y = *pitem++;
    float start_x = *pitem++;
    float theta   = *pitem++;
    float length  = *pitem++;

    float* pout_item = parray + 1 + index * NUM_LANE_ELEMENT;
    *pout_item++ = conf1;
    *pout_item++ = conf2;
    *pout_item++ = start_y;
    *pout_item++ = start_x;
    *pout_item++ = theta;
    *pout_item++ = length;
    *pout_item++ = 1;   // 1 = keep, 0 = ignore

    for(int i = 0; i < N_OFFSETS; ++i){
        float point  = *pitem++;
        *pout_item++ = point;
    }
}

另外 proposal 中点的解码是放在 CPU 上做的，代码如下：

for(auto& lane : image_based_lanes){
    lane.points.reserve(N_OFFSETS / 2);
    int start = (int)(lane.start_y * N_STRIPS + 0.5f);
    int end   = start + (int)(lane.length + 0.5f) - 1;
    end       = min(end, N_STRIPS);
    for(int i = start; i <= end; ++i){
        lane.points.push_back(cv::Point2f(lane.lane_xs[i], anchor_ys_[i]));
    }
}

4. CLRerNet推理

通过上面对 CLRerNet 的预处理和后处理分析之后，整个推理过程就显而易见了。C++ 上 CLRerNet 的预处理部分需简单修改 tensorRT_Pro 中的 CUDA resize，后处理中的 decode 和 NMS 部分也需要简单修改

我们在终端执行如下指令即可完成推理（注意！完整流程博主会在后续内容介绍，这边只是简单演示）：

make clrernet -j64

编译图解如下所示：

在这里插入图片描述

推理结果如下图所示：

在这里插入图片描述

至此，我们在 C++ 上面完成了 CLRerNet 的整个推理过程，下面我们将完整的走一遍流程

三、CLRerNet部署

博主新建了一个仓库 tensorRT_Pro-YOLOv8，该仓库基于 shouxieai/tensorRT_Pro，并进行了调整以支持 YOLOv8 的各项任务，目前已支持分类、检测、分割、姿态点估计任务。

下面我们就来看看如何利用 tensorRT_Pro-YOLOv8 这个 repo 完成 CLRerNet 模型的推理。

1. 源码下载

tensorRT_Pro-YOLOv8 的代码可以直接从 GitHub 官网上下载，源码下载地址是 https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8，Linux 下代码克隆指令如下：

git clone https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8

也可手动点击下载，点击右上角的 Code 按键，将代码下载下来。至此整个项目就已经准备好了。也可以点击 here 下载博主准备好的源代码（注意代码下载于 2024/8/18 日，若有改动请参考最新）

2. 环境配置

需要使用的软件环境有 TensorRT、CUDA、cuDNN、OpenCV、Protobuf，所有软件环境的安装可以参考 Ubuntu20.04软件安装大全，这里不再赘述，需要各位看官自行配置好相关环境😄，外网访问较慢，这里提供下博主安装过程中的软件安装包下载链接 Baidu Drive【pwd:yolo】🚀🚀🚀

tensorRT_Pro-YOLOv8 提供 CMakeLists.txt 和 Makefile 两种方式编译，二者选一即可

2.1 配置CMakeLists.txt

主要修改五处

1. 修改第 13 行，修改 OpenCV 路径

set(OpenCV_DIR   "/usr/local/include/opencv4/")

2. 修改第 15 行，修改 CUDA 路径

set(CUDA_TOOLKIT_ROOT_DIR     "/usr/local/cuda-11.6")

3. 修改第 16 行，修改 cuDNN 路径

set(CUDNN_DIR    "/usr/local/cudnn8.4.0.27-cuda11.6")

4. 修改第 17 行，修改 tensorRT 路径（版本必须大于 8.6）

set(TENSORRT_DIR "/home/jarvis/lean/TensorRT-8.6.1.6")

5. 修改第 20 行，修改 protobuf 路径

set(PROTOBUF_DIR "/home/jarvis/protobuf")

2.2 配置Makefile

主要修改五处

1. 修改第 4 行，修改 protobuf 路径

lean_protobuf  := /home/jarvis/protobuf

2. 修改第 5 行，修改 tensorRT 路径（版本必须大于 8.6）

lean_tensor_rt := /home/jarvis/lean/TensorRT-8.6.1.6

3. 修改第 6 行，修改 cuDNN 路径

lean_cudnn     := /usr/local/cudnn8.4.0.27-cuda11.6

4. 修改第 7 行，修改 OpenCV 路径

lean_opencv    := /usr/local

5. 修改第 8 行，修改 CUDA 路径

lean_cuda      := /usr/local/cuda-11.6

3. ONNX导出

导出细节可以查看 CLRerNet推理详解及部署实现（上），这边不再赘述。记得将导出的 ONNX 模型放在 tensorRT_Pro-YOLOv8/workspace 文件夹下。

4. engine生成

在 workspace 下新建 clrernet_build.sh，其内容如下：

#! /usr/bin/bash

TRTEXEC=/home/jarvis/lean/TensorRT-8.6.1.6/bin/trtexec

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/jarvis/lean/TensorRT-8.6.1.6/lib

${TRTEXEC} \
  --onnx=clrernet.sim.onnx \
  --minShapes=images:1x3x320x800 \
  --optShapes=images:1x3x320x800 \
  --maxShapes=images:8x3x320x800 \
  --memPoolSize=workspace:2048 \
  --saveEngine=clrernet.sim.FP16.trtmodel \
  --fp16 \
  > clrnet.log 2>&1

其中需要修改 TRTEXEC 的路径为你自己的路径，终端执行如下指令：

cd tensorRT_Pro-YOLOv8/workspace
bash clrernet_build.sh

执行后等待一段时间会在当前文件夹生成 clrernet.sim.FP16.trtmodel 即模型引擎文件，注意终端看不到任何日志打印输出，这是因为博主将 tensorRT 输出的日志信息保存到了 clrernet.log 文件中，大家也可以删除保存直接在终端显示相关日志信息

Note：博主也提供了 TRT::compile 接口生成 engine 文件，不过在反序列化的时候可能会出现如下问题：

在这里插入图片描述

这个主要是因为 tensorRT_Pro-YOLOv8 自己构建的 onnxparser 版本太老，不支持 GridSample 和 LayerNormalization 节点的解析，我们可以手动替换 onnxparser 解析器具体可以参考：RT-DETR推理详解及部署实现

5. 源码修改

如果你想推理自己训练的模型还需要修改下源代码，CLRerNet 模型的推理代码主要在 app_clrernet.cpp 文件中，我们就只需要修改这一个文件的内容即可，源码修改较简单主要有以下几点：

app_clrernet.cpp 238 行，“clrernet.sim” 修改为你导出的 ONNX 模型名

具体修改示例如下：

test(TRT::Mode::FP16, "clrernet.sim");	// 修改1 238 行 "clrernet.sim" 改成 "best"

6. 运行

OK！源码修改好了，Makefile 编译文件也搞定了，engine 模型也准备好了，现在可以编译运行了，直接在终端执行如下指令即可：

make clrernet -j64

推理结果如下图所示：

在这里插入图片描述

模型推理成功后会生成 clrernet.sim_CLRerNet_FP16_result 文件夹，该文件夹下保存了推理的图片。

模型推理效果如下图所示：

在这里插入图片描述

OK，以上就是使用 tensorRT_Pro-YOLOv8 推理 CLRerNet 的大致流程，若有问题，欢迎各位看官批评指正。

结语

博主在这里针对 CLRerNet 的预处理和后处理做了简单分析，同时与大家分享了 C++ 上的实现流程，目的是帮大家理清思路，更好的完成后续的部署工作😄。感谢各位看到最后，创作不易，读后有收获的看官请帮忙点个👍⭐️

CLRNet 作为目前 CULane 数据集的 SOTA 方案还是非常值得学习的🤗

最后大家如果觉得 tensorRT_Pro-YOLOv8 这个 repo 对你有帮助的话，不妨点个 ⭐️ 支持一波，这对博主来说非常重要，感谢各位🙏。