BEV：显示相机视角转换-----FastBEV/IPM与LSS

本文链接：https://blog.csdn.net/lishiyu93/article/details/143930256

一、背景

BEV方案中，将图像视角转换到BEV视角的方法对模型性能影响较大，FastBEV的速度较快，但投影效果上限不高，LSS投影上限较高，但速度较慢（耗时相对较高）。是否有折中的方案，在耗时增加相对较少的情况下，提升模型的上限（中高算力平台下，提升模型能力）？

二、视角转换关键算子-----gridsample

这是pytorch官网对gridsample算子使用方法说明，其支持4-D（FastBEV/IMP）和5-D（LSS）采样，将图像特征提取到对应的BEV特征中，完成相机视角转换：https://pytorch.org/docs/stable/generated/torch.nn.functional.grid_sample.html
在这里插入图片描述
5-D gridsample相比4-D gridsample耗时剧增，假如在某智驾芯片上，4-D gridsample耗时是2ms，相同条件下5-D gridsample的耗时可能是200ms（具体耗时受特征图通道数影响），这种耗时急剧上升的方案，很难在智驾中落地应用。

三、LSS投影优化

1.先来对比4-D gridsample和5-D gridsample的输入输出关系：

4-D gridsample
input: (N, C, H_in, W_in);
bev_grid: (N, H_out, W_out, 2), 这里的2表示bev_grid坐标通过相机内外参投影到图像上的坐标(x,y);
output: (N, C, H_out, W_out)

5-D gridsample
input: (N, C, H_in, W_in);
LSS官方代码构建深度特征的方法：https://github.com/nv-tlabs/lift-splat-shoot/blob/master/src/models.py
在这里插入图片描述
通过1x1卷积将C通道的输入特征变为C+D通道（这里的D表示深度估计的通道数），并对深度特征中的D通道特征进行softmax处理input_i:(N, D, H_in, W_in),按照dim=1堆叠起来得到input_i_1:(N, 1，D, H_in, W_in)，然后将原始的(N, C, H_in, W_in)特征按照dim=2堆叠起来得到input_i_2:(N, C, 1，H_in, W_in)，最后将input_i_1*input_i_2深度输入input_2:(N, C, D, H_in, W_in);
bev_grid: (N, Z_out, H_out, W_out, 3), 这里的3表示bev_grid坐标通过相机内外参投影到图像上的坐标(x,y,d), d为深度估计;
output: (N, C, Z_out, H_out, W_out)；

由于获取深度信息需要用到5-D gridsample，想要降低耗时，考虑减少特征图通道对耗时的影响，即做5-D gridsample时，将通道C设为1；

2.具体方法-----拆解5-D gridsample

在得到深度输入input_2前，先对input_i_1和input_i_2进行gridsample，再将gridsample后的结果相乘得最终的投影结果。具体方法为：将5-D gridsample拆解为一个4-D gridsample和一个单通道（C=1）的5-D gridsample，4-D gridsample负责提取多通道特征信息，单通道5-D gridsample负责提取深度特征信息，最后将两个特征信息相乘，得到多通道下的深度信息，等效变换过程如下：

step1：

4-D gridsample
input: (N, C, H_in, W_in);
bev_grid: (N, Z_out, H_out, W_out, 2), 这里的2表示bev_grid坐标通过相机内外参投影到图像上的坐标(x,y);
for循环提取每个Z_out下的bev_grid_i: (N, Z_out, H_out, W_out, 2),通过4-D gridsample分别得到输出特征图output_i: (N, C, H_out, W_out)，按照dim=2堆叠起来，得到最终的BEV特征图output_1(没有深度概率信息):
output_1: (N, C, Z_out, H_out, W_out)

step2:

单通道5-D gridsample
input: (N, C, H_in, W_in);
input经过softmax处理后的特征图input_2: (N, D, H_in, W_in),这里的D表示深度估计的通道数；将input_2在dim=1上扩展一个维度，得到input_3:(N, 1, D, H_in, W_in)
bev_grid: (N, Z_out, H_out, W_out, 3), 这里的3表示bev_grid坐标通过相机内外参投影到图像上的坐标(x,y,d), d为深度估计;
output_2: (N, 1, Z_out, H_out, W_out)；

step3:

将output_1和output_2相乘得到有深度概率信息的BEV特征图
output = outptu_1 * output_2 = (N, C, Z_out, H_out, W_out) * (N, 1, Z_out, H_out, W_out) = (N, C, Z_out, H_out, W_out)

四、部分代码

1.IPM的BEV网格坐标索引

class UpdateIndicesIPM:
    def __init__(self, height, range, voxel_size, feature_size, downsample):
        self.height = height
        self.range = range
        self.voxel_size = voxel_size
        self.feature_size = feature_size
        self.ds_matrix = np.eye(4)
        self.ds_matrix[:2] /= downsample

    def __call__(self, data):
        num = len(data["cam2egoes"])
        ego2feats = torch.zeros((num, 4, 4), dtype=torch.float32)
        for i in range(num):
            ego2cam = np.linalg.inv(data["cam2egoes"][i])
            tmp = np.eye(4)
            tmp[:3, :3] = data["cam_intrinsics"][i]
            ego2feats[i] = torch.tensor(self.ds_matrix @ tmp @ ego2cam)
        
        grid = torch.stack(torch.meshgrid([
                torch.arange(self.range[0], self.range[3], self.voxel_size[0]),
                torch.arange(self.range[1], self.range[4], self.voxel_size[1]),
                torch.tensor(self.height), torch.tensor(1.0)
                ], indexing="ij")) # [4, 188, 64, 4, 1]
        grid_h, grid_w = grid.shape[1:3]
        grid = grid.view(1, 4, -1).expand(num, 4, -1) # [7, 4, 192512] 
        points_2d = torch.bmm(ego2feats[:, :3, :], grid)
        x = (points_2d[:, 0] / points_2d[:, 2]).round().long()  
        y = (points_2d[:, 1] / points_2d[:, 2]).round().long() 
        z = points_2d[:, 2]
        valid = ~((x >= 0) & (y >= 0) & (x < self.feature_size[1]) & 
                  (y < self.feature_size[0]) & (z > 0))
        x[valid] = 0
        y[valid] = 0
        x = (x.float() / self.feature_size[1] * 2.) - 1.0
        y = (y.float() / self.feature_size[0] * 2.) - 1.0
        indices = torch.cat([x.unsqueeze(2), y.unsqueeze(2)], dim=2)
        indices = indices.reshape(-1, grid_h, grid_w, len(self.height), 2) # batch, num_img, bev_w, bev_h, num_height, 2
        data["indices"] = indices
        return data

2.FastBEV

class FastBevTransform(nn.Module):
    def __init__(self, feats_channels, num_height):
        super().__init__()
        self._num_height = num_height
        self._conv = nn.Conv2d(feats_channels * num_height, feats_channels, kernel_size=1)
        self._grid_sample = GridSample(mode="nearest",
                                            padding_mode="zeros",
                                            align_corners=True)
        self._cat = Concat(dim=1)

    def forward(self, feats, indices):
        # feats: (7B, C, H, W), indices: (7B, Hg, Wg, Z, 2)
        bev_feats = []
        for i in range(self._num_height):
            output = self._grid_sample(feats, indices[:,:,:,i])
            bev_feats.append(output)
        bev_feats = self._cat(bev_feats)  # (7B, Z*C, Hg, Wg)
        bev_feats = self._conv(bev_feats)  # (7B, C, Hg, Wg)
        return bev_feats

3.LSS的BEV网格坐标索引

class UpdateIndicesLSS:
    def __init__(self, height, range, voxel_size, feature_size,
                 resolution, max_num_depth, downsample):
        self.height = height
        self.range = range
        self.voxel_size = voxel_size
        self.feature_size = feature_size
        self.resolution = resolution
        self.max_num_depth = max_num_depth
        self.ds = np.eye(3)
        self.ds[:2] /= downsample
    
    def __call__(self, data):
        num = len(data["cam2egoes"])
        ego2cams = torch.zeros((num, 4, 4), dtype=torch.float32)
        cam2feats = torch.zeros((num, 3, 3), dtype=torch.float32)
        for i in range(num):
            ego2cams[i] = torch.tensor(np.linalg.inv(data["cam2egoes"][i]))
            cam2feats[i] = torch.tensor(self.ds @ data["cam_intrinsics"][i])
        grid = torch.stack(torch.meshgrid([
                torch.arange(self.range[0], self.range[3], self.voxel_size[0]),
                torch.arange(self.range[1], self.range[4], self.voxel_size[1]),
                torch.tensor(self.height), torch.tensor(1.0)
                ], indexing="ij")) # [4, 188, 64, 4, 1]
        grid_h, grid_w = grid.shape[1:3]
        grid4 = grid.view(1, 4, -1).expand(num, 4, -1) # [7, 4, 192512] 
        points_2d = torch.bmm(ego2cams[:, :3, :], grid4)
        x = (points_2d[:, 0] / points_2d[:, 2])   # [7, 48128]
        y = (points_2d[:, 1] / points_2d[:, 2])   # [7, 48128]
        z = points_2d[:, 2]                       # [7, 48128]
        r = points_2d.norm(dim=1)                 # [B*N, Hg*Wg]
        d = torch.floor(r / self.resolution)

        distortions = torch.tensor(np.array(data["cam_distortions"]).T)
        k1,k2,k3,p1,p2,k4,k5,k6 = distortions[:,:,None]
        fovs = torch.tensor(data['crop_fovs']).unsqueeze(-1) / 2.0
        in_fov = np.abs(np.arctan2(points_2d[:, 0], z)) < fovs
        r2 = x**2 + y**2
        ratio = (1 + k1 * r2 + k2 * r2**2 + k3 * r2**3) / (1 + k4 * r2 + k5 * r2**2 + k6 * r2**3)
        x_undist = x * ratio + 2 * p1 * x * y + p2 * (r2 + 2 * x**2)
        y_undist = y * ratio + p1 * (r2 + 2 * y**2) + 2 * p2 * x * y
        x = cam2feats[:, 0, [0]] * x_undist + cam2feats[:, 0, [2]]
        y = cam2feats[:, 1, [1]] * y_undist + cam2feats[:, 1, [2]]
        valid = ~((x >= 0) & (y >= 0) & (x < self.feature_size[1]) & \
                  (y < self.feature_size[0]) & (z > 0) & in_fov & \
                  (d >= 0) & (d < self.max_num_depth))   # [7, 48128]
        x[valid], y[valid], d[valid] = -1, -1, -1
        x = (x.float() / self.feature_size[1] * 2.) - 1.0
        y = (y.float() / self.feature_size[0] * 2.) - 1.0
        d = (d.float() / self.max_num_depth * 2.) - 1.0
        indices = torch.cat([x[:,:,None], y[:,:,None], d[:,:,None]], dim=2)  # [7, 48128, 3]
        indices = indices.reshape(-1, grid_h, grid_w, len(self.height), 3)   # batch*num_img, bev_w, bev_h, num_height, 3(x, y, d)
        data["indices"] = indices.permute(0, 3, 1, 2, 4) # batch*num_img, num_height, bev_w, bev_h, 3(x, y, d)
        return data

4.LSS的BEV投影

class LssBevTransform(nn.Module):
    def __init__(self, num_channels, num_zs, max_num_depth):
        super().__init__()
        self._num_zs = num_zs
        self._max_num_depth = max_num_depth
        self._conv = nn.Conv2d(num_channels * num_zs, num_channels, kernel_size=1)  # TODO：可以增加几层conv
        self._depth_proj = nn.Sequential(
            nn.Conv2d(num_channels, max_num_depth, kernel_size=3, padding=1),
            nn.Softmax(dim=1)
        )
        self._grid_sampler = opm.GridSample(mode="nearest",
                                            padding_mode="zeros",
                                            align_corners=True)

    def forward(self, feats, indices):
        # feats: (B*N, C, H, W)
        # indices: (B*N, Z, X, Y, 3) where 3 dims represent (w, h, d).
        bev_feats = self._sample_bev_feats(feats, indices[..., :2])  # (B*N, C, Z, X, Y)
        depth_feats = self._sample_depth_feats(feats, indices)  # (B*N, 1, Z, X, Y)
        final_feats = bev_feats * depth_feats  # (B*N, C, Z, Y, X)
        N, C, Z, Y, X = final_feats.shape
        final_feats = final_feats.view(N, C * Z, Y, X)  # (B*N, Z*C, Hg, Wg)
        final_feats = self._conv(final_feats)  # (B*N, C, Hg, Wg)        
        return final_feats

    def _sample_bev_feats(self, feats, indices):
        # feats: (B*N, C, H, W)
        # indices: (B*N, Z, X, Y, 2) where 2 dims represent (w, h)
        bev_feats = [self._grid_sampler(feats, indices[:, i]) for i in range(self._num_zs)]
        return torch.stack(bev_feats, dim=2)  # (B*N, C, Z, X, Y)  

    def _sample_depth_feats(self, feats, indices):
        # feats: (B*N, C, H, W)
        # indices: (B*N, Z, X, Y, 3) where 3 dims represent (w, h, d).
        depths = self._depth_proj(feats)[:, None]  # (B*N, 1, D, H, W)
        return self._grid_sampler(depths, indices) # (B*N, 1, Z, X, Y)

五、展望

LSS投影时将input_3:(N, 1, D, H_in, W_in)中D和H_in进行reshape合并后得(N, 1, D*H_in, W_in)，可以完全通过4-D gridsample提取特征，耗时进一步降低（注意合并维度不要写错了！！！），等效替代测试代码如下：

#!/usr/bin/env python3
import unittest

import torch
import torch.nn.functional as F

class GridSampleTest(unittest.TestCase):
    def test_grid_sample_equivalence(self):
        D, H, W = 100, 144, 256
        Y, X = 64, 128
        C = 32

        # Generate random features.
        feats_5d = torch.randn(1, C, D, H, W)

        # Generate random indices.
        d = torch.randint(high=D, size=(Y, X))
        h = torch.randint(high=H, size=(Y, X))
        w = torch.randint(high=W, size=(Y, X))

        # Prepare grid for 5D grid_sample.
        indices_5d = torch.stack([
            2.0 * w / (W - 1) - 1.0,
            2.0 * h / (H - 1) - 1.0,
            2.0 * d / (D - 1) - 1.0
        ], dim=-1).view(1, 1, Y, X, 3)
        bev_feats_5d = F.grid_sample(feats_5d, indices_5d, mode="nearest", align_corners=True).view(C, Y, X)

        # Flatten D and H dimensions and prepare grid for 4D grid_sample.
        dh = d * H + h
        indices_4d = torch.stack([
            2.0 * w / (W - 1) - 1.0,
            2.0 * dh / (D * H - 1) - 1.0
        ], dim=-1).view(1, Y, X, 2)
        feats_4d = feats_5d.view(1, C, D * H, W)
        # 下方注释为错误写法：合并维度错误
        # dw = d * W + w
        # indices_4d = torch.stack([
        #     2.0 * dw / (D * W - 1) - 1.0,
        #     2.0 * h / (H - 1) - 1.0
        # ], dim=-1).view(1, Y, X, 2)
        # feats_4d = feats_5d.view(1, C, H, D * W)
        bev_feats_4d = F.grid_sample(feats_4d, indices_4d, mode="nearest", align_corners=True).view(C, Y, X)

        # Check if the results are close.
        self.assertTrue(torch.allclose(bev_feats_5d, bev_feats_4d, atol=1e-6))

if __name__ == "__main__":
    unittest.main()

涉及到两个维度合并后进行索引取特征，注意索引坐标要提前取整，否则部分浮点数据经过合并计算后再取整会有细小的差别，导致最终取到的特征有一定偏差。

代码更新：

1.LSS 4D gridsample的BEV网格坐标索引

    def _get_lss_fixed_resolution_depth(self, r, resolution):
        d = torch.floor(r / resolution)

        return d
    def get_fov(self, intrinsic, distortion):
        # 假设camera_matrix和dist_coeffs是通过相机标定得到的
        # camera_matrix = np.array([[fx, 0, cx],
        #                         [0, fy, cy],
        #                         [0,  0,  1]])
        # dist_coeffs = np.array([k1, k2, p1, p2, k3])  # 这里k1, k2, p1, p2, k3是相机畸变系数
        
        # # 去畸变前的点（假设是一个二维点）
        # src_point = np.array([[x, y]], dtype=np.float32)  # x, y是去畸变前点的坐标
        
        # # 去畸变后的点
        # dst_point = cv2.undistortPoints(src_point, camera_matrix, dist_coeffs, P=camera_matrix)
        # dst_point = dst_point.astype(np.int)  # 如果需要整数坐标
        
        # 绘制去畸变后的点（如果需要）
        # cv2.circle(img, tuple(dst_point[0][0]), 5, (0, 255, 0), -1)
        # 注意：img是去畸变前的图像，如果需要对整个图像去畸变，需要遍历图像中的所有点
        
        fx = intrinsic[0][0]   #x方向（水平方向）焦距
        cx = intrinsic[0][2]
        fy = intrinsic[1][1]   #y方向（垂直方向）焦距
        cy = intrinsic[1][2]
        
        point0 = np.array([cx, cy])     # 原始光轴中心点（图像中心点）
        point1 = np.array([[0, cy]])    # 在图像平面上选择一个点 (此处选择图像的左边缘)
        # 反投影点(去畸变点)
        dst_point = cv2.undistortPoints(point1, intrinsic, distortion, P=intrinsic)
        # 计算两个点之间的距离
        dis = point0 - dst_point[0][0]
        dis = math.sqrt(dis[0]*dis[0] + dis[1]*dis[1])
        # 计算视场角 (FOV，水平方向)
        tan = dis/fx
        fov = 2 * math.atan(tan)
        fov = fov * 180 / 3.14
        return fov

def _preprocess_img_lss(self, data):
    # data["imgs"] = data["imgs"].reshape(-1, 3, 576, 1024) # [8, 7, 3, 576, 1024] -> [56, 3, 576, 1024]
    data["imgs"] = data["imgs"].reshape(-1, 3, self.camera_in_image_size[0], self.camera_in_image_size[1])
    ego2cam = data["aug_ego2cam"].reshape(-1, 4, 4).float()
    
    grid = torch.stack(torch.meshgrid([
                torch.arange(self.cam_range[0], self.cam_range[3], self.cam_voxel_size[0]),
                torch.arange(self.cam_range[1], self.cam_range[4], self.cam_voxel_size[1]),
                torch.tensor(self.cam_height),
            ], indexing="ij")).to(ego2cam.device) # torch.Size([3, 376, 128, 4]) #[3, 188, 64, 4]

    channel, grid_size_x, grid_size_y, z = grid.shape
    grid = grid.view(1, 3, -1).expand(data["imgs"].shape[0], 3, -1) # [7, 3, 192512] # [56, 3, 48128]
    grid = torch.cat((grid, torch.ones_like(grid[:, :1])), dim=1) # [7, 4, 192512] 坐标齐次化 #[56, 4, 48128]

    points_2d = torch.bmm(ego2cam[:, :3, :], grid)
    x = (points_2d[:, 0] / points_2d[:, 2])   # [56, 48128]
    y = (points_2d[:, 1] / points_2d[:, 2])   # [56, 48128]
    z = points_2d[:, 2]                                      # [56, 48128]
    r = points_2d.norm(dim=1)   # [B*N, Hg*Wg]
    resolution = self.lss_grid_indices_parameter["fixed_depth_resolution"]["resolution"]
    max_num_depth = self.lss_grid_indices_parameter["fixed_depth_resolution"]["max_num_depth"]
    d = self._get_lss_fixed_resolution_depth(r, resolution)
    
    temp_distortion = data["cam_distortions"]
    distortion = []
    for j in range(temp_distortion[0].shape[0]):
        for i in range(len(temp_distortion)):
            distortion.append(list(temp_distortion[i][j][0]))
    distortion = torch.tensor(distortion)
    k1 = distortion[:, 0].unsqueeze(-1).to(x.device)
    k2 = distortion[:, 1].unsqueeze(-1).to(x.device)
    k3 = distortion[:, 2].unsqueeze(-1).to(x.device)
    p1 = distortion[:, 3].unsqueeze(-1).to(x.device)
    p2 = distortion[:, 4].unsqueeze(-1).to(x.device)
    k4 = distortion[:, 5].unsqueeze(-1).to(x.device)
    k5 = distortion[:, 6].unsqueeze(-1).to(x.device)
    k6 = distortion[:, 7].unsqueeze(-1).to(x.device)
    intrinsic = data["fov_cam_intrinsics"]

    intrinsic = intrinsic.reshape(-1, 4, 4)
    fov = []
    fov_valid = []
    img_num = len(data["imgs"])
    for i in range(img_num):
        current_intrinsic = np.array(intrinsic[i][:3,:3].to('cpu'))
        current_fov = self.get_fov(current_intrinsic, \
                np.array([np.array(k1[i].to('cpu')), np.array(k2[i].to('cpu')),  np.array(p1[i].to('cpu')), np.array(p2[i].to('cpu')), np.array(k3[i].to('cpu')),  np.array(k4[i].to('cpu')), np.array(k5[i].to('cpu')), np.array(k6[i].to('cpu'))]))
        fov.append(current_fov)
    fov_valid = (np.abs(np.arctan2(points_2d[:, 0].to('cpu'), z.to('cpu'))) / np.pi * 180 < torch.tensor([m / 2 for m in fov]).unsqueeze(-1)) & (z.to('cpu') > 0)

    r2 = x**2 + y**2
    x_temp = x * (1 + k1 * r2 + k2 * r2**2 + k3 * r2**3) / (1 + k4 * r2 + k5 * r2**2 + k6 * r2**3) + 2 * p1 * x * y + p2 * (r2 + 2 * x**2)
    y_temp = y * (1 + k1 * r2 + k2 * r2**2 + k3 * r2**3) / (1 + k4 * r2 + k5 * r2**2 + k6 * r2**3) + p1 * (r2 + 2 * y**2) + 2 * p2 * x * y
    intrinsic = data["aug_cam_intrinsics"]
    intrinsic = intrinsic.reshape(-1, 4, 4)
    x = intrinsic[:, 0, 0].unsqueeze(-1) * x_temp + intrinsic[:, 0, 1].unsqueeze(-1) * y_temp + intrinsic[:, 0, 2].unsqueeze(-1)
    y = intrinsic[:, 1, 0].unsqueeze(-1) * x_temp + intrinsic[:, 1, 1].unsqueeze(-1) * y_temp + intrinsic[:, 1, 2].unsqueeze(-1)
    x = torch.round(x)  #取整，防止后续计算四舍五入出现异常值
    y = torch.round(y)
    d = torch.round(d)
    valid = ~((x >= 0) & (y >= 0) & (x < self.cam_feature_size[1]) & (y < self.cam_feature_size[0]) & (z > 0) & fov_valid.to(x.device) \
                & (d >= 0) & (d < max_num_depth))   # [56, 48128]

    ####################
    x[valid] = 0  # (BN, ZXY)
    y[valid] = 0  # (BN, ZXY)
    d[valid] = 0  # (BN, ZXY)
    yd = d.float() * self.cam_feature_size[0] + y  #(BN, ZXY)
    x = (x.float() / (self.cam_feature_size[1] - 1)) * 2.0 - 1.0
    y = (y.float() / (self.cam_feature_size[0] - 1)) * 2.0 - 1.0
    yd = (yd.float() / (self.cam_feature_size[0] * max_num_depth - 1)) * 2.0 - 1.0
    indices = torch.stack([x, y, x, yd], dim=-1)  # (BN, ZXY, 4)
    indices = indices.view(-1, len(self.cam_height), grid_size_x, grid_size_y, 4)
    self.batch_size = indices.shape[0] // self.num_cam
    return indices

2.LSS 4D gridsample的BEV投影

class LssBevTransform(nn.Module):
    def __init__(self, num_channels, num_zs, max_num_depth):
        super().__init__()
        self._num_zs = num_zs
        self._max_num_depth = max_num_depth
        self._conv = nn.Conv2d(num_channels * num_zs, num_channels, kernel_size=1)  # TODO：可以增加几层conv
        self._depth_proj = nn.Sequential(
            nn.Conv2d(num_channels, max_num_depth, kernel_size=3, padding=1),
            nn.Softmax(dim=1)
        )
        self._grid_sampler = opm.GridSample(mode="nearest",
                                            padding_mode="zeros",
                                            align_corners=True)
        self._cat = opm.Concat(dim=1)

    def forward(self, feats, indices):
        # feats: (B*N, C, H, W)
        # indices: (B*N, Z, X, Y, 3) where 3 dims represent (w, h, d).
        bev_feats = self._sample_bev_feats(feats, indices[..., :2])  # (B*N, C, Z, X, Y)
        # depth_feats = self._sample_depth_feats(feats, indices)  # (B*N, 1, Z, X, Y)
        depth_feats = self._sample_depth_feats(feats, indices[..., 2:4])  # (B*N, 1, Z, X, Y)
        final_feats = bev_feats * depth_feats  # (B*N, C, Z, Y, X)
        N, C, Z, Y, X = final_feats.shape
        final_feats = final_feats.view(N, C * Z, Y, X)  # (B*N, Z*C, Hg, Wg)
        final_feats = self._conv(final_feats)  # (B*N, C, Hg, Wg)        
        return final_feats

    def _sample_bev_feats(self, feats, indices):
        # feats: (B*N, C, H, W)
        # indices: (B*N, Z, X, Y, 2) where 3 dims represent (w, h)
        bev_feats = [self._grid_sampler(feats, indices[:, i]) for i in range(self._num_zs)]
        return torch.stack(bev_feats, dim=2)  # (B*N, C, Z, Y, X)  

    def _sample_depth_feats(self, feats, indices):
        # feats: (B*N, C, H, W)
        # indices: (B*N, Z, X, Y, 3) where 3 dims represent (w, h, d).
        # depths = self._depth_proj(feats)[:, None]  # (B*N, 1, D, H, W)
        # return self._grid_sampler(depths, indices) # (B*N, 1, Z, X, Y)
        depths = self._depth_proj(feats)[:, None]
        N, C, D, H, W = depths.shape
        depths_new = depths.view(N, C, D * H, W)
        bev_feats = [self._grid_sampler(depths_new, indices[:, i]) for i in range(self._num_zs)]
        return torch.stack(bev_feats, dim=2)

相机内外参处理：

class ImageCropResizeAug3D:
    """Do crop, resize, FLIP_LEFT_RIGHT and rotate For multi channel images.
    Args:
        input_image_info: For load multi channel images, first do crop, then do resize, 
        for example, one of the params format like this:
        - 'CAM_FRONT_SHORT': this is one of the camera_types, you can define in get_data_info, here is a map key;
        - x_min, y_min:start of the crop coordinate, x_min is the width-axis direction, y_min is the height-axis direction; 
        - width, height: the image size of after crop;
        - new_width, new_height: the image size of after resize;
            input_image_info:
                'CAM_FRONT_SHORT': 
                    crop_params: 
                        x_min_scale: 4
                        y_min_scale: 3
                        width_scale: 2
                        height_scale: 2
                    resize_params:
                        new_width: 1024
                        new_height: 576
        rot_lim (list): range of image rotate
        rand_flip (bool): Whether to FLIP_LEFT_RIGHT the img.
    """
    def __init__(
        self, input_image_info, rot_lim, rand_flip, is_train, crop_noise
    ):
        self.input_image_info = input_image_info
        self.rand_flip = rand_flip
        self.rot_lim = rot_lim
        self.is_train = is_train
        self.crop_noise = crop_noise
        print("start ImageCropResizeAug3D")

    def get_fov(self, intrinsic, distortion):
        # 假设camera_matrix和dist_coeffs是通过相机标定得到的
        # camera_matrix = np.array([[fx, 0, cx],
        #                         [0, fy, cy],
        #                         [0,  0,  1]])
        # dist_coeffs = np.array([k1, k2, p1, p2, k3])  # 这里k1, k2, p1, p2, k3是相机畸变系数
        
        # # 去畸变前的点（假设是一个二维点）
        # src_point = np.array([[x, y]], dtype=np.float32)  # x, y是去畸变前点的坐标
        
        # # 去畸变后的点
        # dst_point = cv2.undistortPoints(src_point, camera_matrix, dist_coeffs, P=camera_matrix)
        # dst_point = dst_point.astype(np.int)  # 如果需要整数坐标
        
        # 绘制去畸变后的点（如果需要）
        # cv2.circle(img, tuple(dst_point[0][0]), 5, (0, 255, 0), -1)
        # 注意：img是去畸变前的图像，如果需要对整个图像去畸变，需要遍历图像中的所有点

        fx = intrinsic[0][0]   #x方向（水平方向）焦距
        cx = intrinsic[0][2]
        fy = intrinsic[1][1]   #y方向（垂直方向）焦距
        cy = intrinsic[1][2]
        
        point0 = np.array([cx, cy])     # 原始光轴中心点（图像中心点）
        point1 = np.array([[0, cy]])    # 在图像平面上选择一个点 (此处选择图像的左边缘)
        # 反投影点(去畸变点)
        dst_point = cv2.undistortPoints(point1, intrinsic, distortion, P=intrinsic)
        # 计算两个点之间的距离
        dis = point0 - dst_point[0][0]
        dis = math.sqrt(dis[0]*dis[0] + dis[1]*dis[1])
        # 计算视场角 (FOV，水平方向)
        tan = dis/fx
        fov = 2 * math.atan(tan)
        fov = fov * 180 / 3.14
        return fov

    def sample_augmentation(self, results, index):
        # print("start ImageCropResizeAug3D sample_augmentation")
        image_type = results["image_types"]
        # print("start get image_type")
        image_crop_resize_params = copy.deepcopy(self.input_image_info[image_type[index]])
        crop_params = image_crop_resize_params["crop_params"]
        #判断图像fov是否符合设计要求
        current_file_path = results['path']
        intrinsic = results['cam_intrinsics'][index][:3, :3]
        distortion = results['cam_distortions'][index][0]
        camera_matrix = np.array(intrinsic)
        if len(distortion) == 5:
            results['cam_distortions'][index] = np.array([np.array([distortion[0],distortion[1],distortion[2],distortion[3],distortion[4], 0, 0, 0])])
            distortion = results['cam_distortions'][index][0]
        elif len(distortion) == 4:
            results['cam_distortions'][index] = np.array([np.array([distortion[0],distortion[1],distortion[2],distortion[3], 0, 0, 0, 0])])
            distortion = results['cam_distortions'][index][0]

        if len(distortion) == 8:
            distortion = np.array([distortion[0],distortion[1],distortion[3],distortion[4],distortion[2],distortion[5],distortion[6],distortion[7]])
        else:
            print(f"!!!error:{current_file_path} lenth of distortion is {len(distortion)} not 8 or 5")
            # exit()
        # 判断每帧数据的fov是否在误差范围内
        current_caluate_cam_fov = self.get_fov(camera_matrix, distortion)
        current_camera_design_fov = image_crop_resize_params["crop_params"]["design_fov"]
        if image_type[index] == 'CAM_BACK_MIDDLE':
            if abs(current_camera_design_fov - current_caluate_cam_fov) > 5:
                if abs(60 - current_caluate_cam_fov) < 5:  #历史数据中部分数据后视相机为60度
                    crop_params["x_min_scale"] = 18.00
                    crop_params["y_min_scale"] = 18.00
                    crop_params["width_scale"] = 1.125
                    crop_params["height_scale"] = 1.125
                elif abs(120 - current_caluate_cam_fov) < 5:  #历史数据中部分数据后视相机为120度
                    crop_params["x_min_scale"] = 3.60
                    crop_params["y_min_scale"] = 3.60
                    crop_params["width_scale"] = 2.25
                    crop_params["height_scale"] = 2.25
                else:
                    print(f"!!!error:the {image_type[index]} fov of {current_file_path} is {current_caluate_cam_fov}, not match disign fov {current_camera_design_fov} or 60 or 120!!!")
                    # exit()
        else:
            # assert (abs(current_camera_design_fov - current_caluate_cam_fov) < 5), f"!!!error:the {image_type[index]} fov of {current_file_path} is {current_caluate_cam_fov}, not match disign fov {current_camera_design_fov} !!!"
            if abs(current_camera_design_fov - current_caluate_cam_fov) >= 5:
                 print (f"!!!error:the {image_type[index]} fov of {current_file_path} is {current_caluate_cam_fov}, not match disign fov {current_camera_design_fov} !!!")

        if self.crop_noise > 0:
            crop_noise = random.randint(-self.crop_noise, self.crop_noise)
        else:
            crop_noise = 0
        if (crop_params["x_min_scale"] - 1) < 1e-6:
            x_min = 0
            fov_x_min = x_min
        else:
            x_min = round(float(results["ori_shape"][index][0]) / float(crop_params["x_min_scale"]))
            fov_x_min = x_min
            x_min += crop_noise
        if (crop_params["y_min_scale"] - 1) < 1e-6:
            y_min = 0
            fov_y_min = y_min
        else:        
            y_min = round(float(results["ori_shape"][index][1]) / float(crop_params["y_min_scale"]))
            fov_y_min = y_min
            y_min += crop_noise
        width = round(float(results["ori_shape"][index][0]) / float(crop_params["width_scale"]))
        height = round(float(results["ori_shape"][index][1]) / float(crop_params["height_scale"]))
        crop_area = (x_min, y_min, x_min + width, y_min + height)
        fov_crop_area = (fov_x_min, fov_y_min, fov_x_min + width, fov_y_min + height)
        resize_params = image_crop_resize_params["resize_params"]
        resize = float(resize_params["new_width"])/float(width)
        resize_dims = (resize_params["new_width"], resize_params["new_height"])
        flip = self.rand_flip and np.random.choice([0, 1])
        rotate = np.random.uniform(*self.rot_lim)
        results["img_shape"].append(resize_dims)

        return resize, resize_dims, crop_area, flip, rotate, fov_crop_area

    def img_transform(
        self, img, rotation, translation, resize, resize_dims, crop, flip, rotate
    ):
        # 公式原理参考：https://zhuanlan.zhihu.com/p/608931944
        # 这里改变了crop和resize的顺序，先crop后resize
        # adjust image
        img = img.crop(crop)
        # resampling_method = {0:'Resampling.NEAREST',2:'Resampling.BILINEAR',3:'Resampling.HAMMING', 4:'Resampling.BICUBIC'}
        img = img.resize(resize_dims,resample=random.choice([0,2,3]))
        if flip:
            img = img.transpose(method=Image.FLIP_LEFT_RIGHT)
        img = img.rotate(rotate)

        # post-homography transformation
        rotation *= resize
        translation -= (resize*torch.Tensor(crop[:2]))
        if flip:
            A = torch.Tensor([[-1, 0], [0, 1]])
            b = torch.Tensor([crop[2] - crop[0], 0])
            rotation = A.matmul(rotation)
            translation = A.matmul(translation) + resize*b
        theta = rotate / 180 * np.pi
        A = torch.Tensor(
            [
                [np.cos(theta), np.sin(theta)],
                [-np.sin(theta), np.cos(theta)],
            ]
        )
        b = resize*torch.Tensor([crop[2] - crop[0], crop[3] - crop[1]]) / 2
        b = A.matmul(-b) + b
        rotation = A.matmul(rotation)
        translation = A.matmul(translation) + b

        return img, rotation, translation

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        imgs = data["imgs"]
        new_imgs = []
        transforms = []
        fov_transforms = []
        for index, img in enumerate(imgs):
            resize, resize_dims, crop, flip, rotate, fov_crop_area = self.sample_augmentation(data, index)
            post_rot = torch.eye(2)
            post_tran = torch.zeros(2)
            new_img, rotation, translation = self.img_transform(
                img,
                post_rot,
                post_tran,
                resize=resize,
                resize_dims=resize_dims,
                crop=crop,
                flip=flip,
                rotate=rotate,
            )
            transform = torch.eye(4)
            transform[:2, :2] = rotation
            transform[:2, 3] = translation
            new_imgs.append(new_img)
            transforms.append(transform.numpy())
            fov_post_rot = torch.eye(2)
            fov_post_tran = torch.zeros(2)
            fov_rotation = fov_post_rot * resize
            fov_translation = fov_post_tran - (resize*torch.Tensor(fov_crop_area[:2]))
            fov_transform = torch.eye(4)
            fov_transform[:2, :2] = fov_rotation
            fov_transform[:2, 3] = fov_translation
            fov_transforms.append(fov_transform.numpy())
        data["imgs"] = new_imgs
        # update the calibration matrices
        data["img_aug_matrix"] = transforms
        data["img_fov_aug_matrix"] = fov_transforms
        return data

@PIPELINES.register_module()
class UpdateEgo2img:
    """update final ego2img matrix.
    """
    def __init__(self, downsample):
        self.downsample = downsample
        print("start UpdateEgo2img")

    def update_ego2img(self, cam_intrinsics, rot, tran,  img_aug_matrix, img_fov_aug_matrix, lidar_aug_matrix, downsample):
        lidar2img = np.zeros((len(rot), 3, 4), dtype=np.float32)
        aug_cam_intrinsics = np.zeros((len(rot), 4, 4), dtype=np.float32)
        fov_cam_intrinsics = np.zeros((len(rot), 4, 4), dtype=np.float32)
        aug_lidar2cam = np.zeros((len(rot), 4, 4), dtype=np.float32)
        for i in range(len(rot)):
            # 获取cam2ego的旋转平移变换矩阵
            transform = np.zeros((4, 4), dtype=np.float32)
            transform[:3, :3] = rot[i]
            transform[:3, -1] = tran[i]
            transform[-1, -1] = 1.0

            # 计算lidar数据增强后cam2lidar的变换矩阵
            new_transform = lidar_aug_matrix @ transform
            rotation = new_transform[:3, :3]
            translation = new_transform[:3, 3]
            # 计算lidar2cam的旋转平移变换矩阵
            lidar2cam_r = np.linalg.inv(rotation)
            lidar2cam_t = translation @ lidar2cam_r.T
            lidar2cam_rt = np.eye(4)
            lidar2cam_rt[:3, :3] = lidar2cam_r.T
            lidar2cam_rt[3, :3] = -lidar2cam_t
            intrinsic = cam_intrinsics[i][:3, :3]

            # 将图像增强矩阵应用于相机内参
            viewpad = np.eye(4)
            if img_aug_matrix is not None:
                assert img_aug_matrix is not None, img_aug_matrix
                post_rot = img_aug_matrix[i][:3, :3]
                post_tran = img_aug_matrix[i][:3, 3]
                viewpad[:3, :2] = post_rot @ intrinsic[:3, :2]
                viewpad[:3, 2] = post_rot @ intrinsic[:3, 2]
                viewpad[:3, 2] += post_tran
            else:
                viewpad[:intrinsic.shape[0], :intrinsic.shape[1]] = intrinsic
            #fov 内外参变化
            fov_viewpad = np.eye(4)
            if img_fov_aug_matrix is not None:
                assert img_fov_aug_matrix is not None, img_fov_aug_matrix
                fov_post_rot = img_fov_aug_matrix[i][:3, :3]
                fov_post_tran = img_fov_aug_matrix[i][:3, 3]
                fov_viewpad[:3, :2] = fov_post_rot @ intrinsic[:3, :2]
                fov_viewpad[:3, 2] = fov_post_rot @ intrinsic[:3, 2]
                fov_viewpad[:3, 2] += fov_post_tran
            else:
                fov_viewpad[:intrinsic.shape[0], :intrinsic.shape[1]] = intrinsic            
            # 根据下采样因子更新相机内参矩阵
            intrinsic=np.eye(4)
            intrinsic[:2] /= downsample 
            viewpad = intrinsic @ viewpad
            aug_cam_intrinsics[i] = copy.deepcopy(viewpad)
            fov_viewpad = intrinsic @ fov_viewpad
            fov_cam_intrinsics[i] = copy.deepcopy(fov_viewpad)
            aug_lidar2cam[i] = copy.deepcopy(lidar2cam_rt.T)

            # 计算激光雷达到图像坐标系的变换矩阵并存储
            lidar2img[i]  = (viewpad @ lidar2cam_rt.T)[:3]

        return lidar2img, aug_cam_intrinsics, fov_cam_intrinsics, aug_lidar2cam

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        # 将模型代码中的bev特征点映射操作，与模型解耦，放在预处理的位置做，更新ego2img参数
        ego2img, aug_cam_intrinsics, fov_cam_intrinsics, aug_lidar2cam = self.update_ego2img(
                        data["cam_intrinsics"], data["rot"], data["tran"], 
                        data["img_aug_matrix"], 
                        data["img_fov_aug_matrix"],
                        data["lidar_aug_matrix"], 
                        self.downsample
                        )
        data["ego2img"] = ego2img
        data["aug_cam_intrinsics"] = np.array(aug_cam_intrinsics)
        data["fov_cam_intrinsics"] = np.array(fov_cam_intrinsics)
        data["aug_ego2cam"] = np.array(aug_lidar2cam)
        # print("finished UpdateEgo2img")
        return data