【MIT-BEVFusion代码解读】第三篇：camera的encoder部分

非晚非晚

已于 2024-08-29 13:42:46 修改

阅读量515

点赞数 16

分类专栏：目标检测实战文章标签： bevfusion 目标检测 3D目标检测 mmdet BEV

于 2024-08-29 13:30:27 首次发布

本文链接：https://blog.csdn.net/QLeelq/article/details/122580875

版权

目标检测实战专栏收录该内容

10 篇文章 18 订阅

订阅专栏

文章目录

1. backbone模块
2. neck模块
3. vtransform模块

BEVFusion相关的其它文章链接：

【论文阅读】ICRA 2023|BEVFusion：Multi-Task Multi-Sensor Fusion with Unified Bird‘s-Eye View Representation
MIT-BEVFusion训练环境安装以及问题解决记录
【MIT-BEVFusion代码解读】第一篇：整体结构与config参数说明
【MIT-BEVFusion代码解读】第二篇：LiDAR的encoder部分
【MIT-BEVFusion代码解读】第三篇：camera的encoder部分
【MIT-BEVFusion代码解读】第四篇：融合特征fuser和解码特征decoder

camera的encoder主要有3部分，分别是backbone、neck和vtransform部分。其中backbone使用SwinTransformer，neck使用GeneralizedLSSFPN，vtransform部分使用的是DepthLSSTransform，如下所示。

调用的顺序分别为backbone => neck => vtransform，具体代码如下所示。

        B, N, C, H, W = x.size()
        x = x.view(B * N, C, H, W)
		# backbone => SwinTransformer
        x = self.encoders["camera"]["backbone"](x)
        # neck => GeneralizedLSSFPN
        x = self.encoders["camera"]["neck"](x)

        if not isinstance(x, torch.Tensor):
            x = x[0]

        BN, C, H, W = x.size()
        x = x.view(B, int(BN / B), C, H, W)
		# vtransform => DepthLSSTransform
        x = self.encoders["camera"]["vtransform"](
            x,
            points,
            camera2ego,
            lidar2ego,
            lidar2camera,
            lidar2image,
            camera_intrinsics,
            camera2lidar,
            img_aug_matrix,
            lidar_aug_matrix,
            img_metas,
        )
        return x

1. backbone模块

1.1 Swin Transformer理论

backbone部分使用的是SwinTransformer，便于理解，首先简要介绍一下SwinTransformer的理论知识。

论文：Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
论文地址：https://arxiv.org/abs/2103.14030
开源地址：https://github.com/microsoft/Swin-Transformer

Swin Transformer的主要思想是把建模能力很强的transformer和视觉信号的先验联系起来，这些先验具有层次性、局部性和平移不变性，具体做法是用shifted window来建立分层特征图，有了分层特征图就可以用FPN/Unet等结构去做密集预测的任务，而且计算量与图片尺寸成正比。

Swin Transformer和Vision Transformer的不同：

Swin Transformer使用了类似卷积神经网络中的层次化构建方法（Hierarchical feature maps），比如特征图尺寸中有对图像下采样4倍的，8倍的以及16倍的，这样的backbone有助于在此基础上构建目标检测，实例分割等任务。而在之前的Vision Transformer中是一开始就直接下采样16倍，后面的特征图也是维持这个下采样率不变。
在Swin Transformer中使用了Windows Multi-Head Self-Attention(W-MSA)的概念，将特征图划分成了多个不相交的区域（Window），并且Multi-Head Self-Attention只在每个窗口（Window）内进行。相对于Vision Transformer中直接对整个（Global）特征图进行Multi-Head Self-Attention，这样做的目的是能够减少计算量的，尤其是在浅层特征图很大的时候。这样做虽然减少了计算量但也会隔绝不同窗口之间的信息传递，所以在论文中作者又提出了 Shifted Windows Multi-Head Self-Attention(SW-MSA)的概念，通过此方法能够让信息在相邻的窗口中进行传递。

Swin Transformer的网络架构图如下所示。
在这里插入图片描述

Patch Partition模块：

首先将图片输入到Patch Partition模块中进行分块，即每4x4相邻的像素为一个Patch，然后在channel方向展平（flatten）。假设输入的是RGB三通道图片，那么每个patch就有4x4=16个像素，然后每个像素有R、G、B三个值所以展平后是16x3=48，所以通过Patch Partition后图像shape由 [H, W, 3]变成了 [H/4, W/4, 48]。

Linear Embeding模块：

通过Linear Embeding层对每个像素的channel数据做线性变换，由48变成C，即图像shape再由 [H/4, W/4, 48]变成了 [H/4, W/4, C]。

其实在源码中Patch Partition和Linear Embeding就是直接通过一个卷积层实现的。

Patch Merging模块：

在每个Stage中首先要通过一个Patch Merging层进行下采样（Stage1除外）。如下图所示，假设输入Patch Merging的是一个4x4大小的单通道特征图（feature map），Patch Merging会将每个2x2的相邻像素划分为一个patch，然后将每个patch中相同位置（同一颜色）像素给拼在一起就得到了4个feature map。接着将这四个feature map在深度方向进行concat拼接，然后在通过一个LayerNorm层。最后通过一个全连接层在feature map的深度方向做线性变化，将feature map的深度由C变成C/2。通过这个简单的例子可以看出，通过Patch Merging层后，feature map的高和宽会减半，深度会翻倍。

W-MSA(Windows Multi-head Self-Attention)模块：

引入Windows Multi-head Self-Attention（W-MSA）模块是为了减少计算量。如下图所示，左侧使用的是普通的Multi-head Self-Attention（MSA）模块，对于feature map中的每个像素（或称作token，patch）在Self-Attention计算过程中需要和所有的像素去计算。但在图右侧，在使用Windows Multi-head Self-Attention（W-MSA）模块时，首先将feature map按照MxM（例子中的M=2）大小划分成一个个Windows，然后单独对每个Windows内部进行Self-Attention。

SW-MSA模块：

采用W-MSA模块时，只会在每个窗口内进行自注意力计算，所以窗口与窗口之间是无法进行信息传递的。为了解决这个问题，作者引入了Shifted Windows Multi-Head Self-Attention（SW-MSA）模块，即进行偏移的W-MSA。如下图所示，左侧使用的是刚刚讲的W-MSA（假设是第L层），那么根据W-MSA和SW-MSA是成对使用的，那么第L+1层使用的就是SW-MSA（右侧图）。根据左右两幅图对比能够发现窗口（Windows）发生了偏移（可以理解成窗口从左上角分别向右侧和下方各偏移了M/2个像素）。那么这就解决了不同窗口之间无法进行信息交流的问题。

1.2 patch embedding

首先需要说明的是，BEVFusion的SwinTransformer调用的是mmdet库中的代码。它位于anaconda的lib/python3.8/site-packages/mmdet中。

patch embedding模块实现的是patch partition和linear embedding功能，用来切patch并将patch特征嵌入到指定维度。直接用一个kernel_size=4和stride=patch_size的卷积来实现。模型默认patch_size=4.

backbone的输入为(B * N, C, H, W) = (4 * 6, 3, 256, 704)。其中B = 4表示batchsize大小。N=6表示相机的个数。

x, hw_shape = self.patch_embed(x)

这一步实际上通过一个Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))的卷积核操作，输入大小为(B * N, C, H, W) = (4 * 6, 3, 256, 704)，经过卷积之后输出为：[24, 96, 64, 176]，在经过flatten后大小变为[24, 11264, 96]，最后再过一个LayerNorm层。代码如下所示：

        if self.adap_padding:
            x = self.adap_padding(x)

        x = self.projection(x) # Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
        out_size = (x.shape[2], x.shape[3]) # feature map = (64, 176)
        x = x.flatten(2).transpose(1, 2) # 展开成[24, 11264, 96]
        if self.norm is not None:# LayerNorm
            x = self.norm(x)
        return x, out_size

1.3 stages

之后会经过4个stage，其中后面3个stage会经过LayerNorm操作，最后再重新排列成4维输出。

        outs = []
        # 4个stage
        for i, stage in enumerate(self.stages):
            x, hw_shape, out, out_hw_shape = stage(x, hw_shape)
            # self.out_indices=[1, 2, 3],只要后三个输出
            if i in self.out_indices:
                norm_layer = getattr(self, f'norm{i}')
                out = norm_layer(out)
                out = out.view(-1, *out_hw_shape,
                               self.num_features[i]).permute(0, 3, 1,
                                                             2).contiguous()
                
                outs.append(out)

        return outs

输出一个3个元素的list，它的大小分别为：[24, 192, 32, 88]、[24, 384, 16, 44]和[24, 768, 8, 22]。

（1）Patch Merging

后三个stage要进行Patch Merging操作，所以前三个stage包含了downsample，也就是每经过一个SwinBlock就进行一次downsample操作，如下所示：

        for i in range(num_layers):
            # 前三个stage存储downsample操作
            if i < num_layers - 1:
                downsample = PatchMerging(
                    in_channels=in_channels,
                    out_channels=2 * in_channels,
                    stride=strides[i + 1],
                    norm_cfg=norm_cfg if patch_norm else None,
                    init_cfg=None)
            else:
                downsample = None

这里用到了pytorch中的torch.nn.Unfold 滑动裁剪功能

torch.nn.Unfold(kernel_size, dilation=1, padding=0, stride=1)

kernel_size：滑动窗口的size
stride：空间维度上滑动的步长，Default: 1
padding：在输入的四周赋零填充. Default: 0
dilation：空洞卷积的扩充率，Default: 1

其中一个downsample的模块如下所示，即先通过Unfold进行裁剪，然后使用LayerNorm归一化，最后通过Linear层对深度进行减半操作。

  (downsample): PatchMerging(
    (adap_padding): AdaptivePadding()
    (sampler): Unfold(kernel_size=(2, 2), dilation=(1, 1), padding=(0, 0), stride=(2, 2))
    (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
    (reduction): Linear(in_features=384, out_features=192, bias=False)
  )

（2）SwinBlock

这里再次把SwinBlock模块的图放过来，以便进行对比查看。可以看到一个SwinBlock有两个block，不同点在于一个使用W-MSA，另一个使用SW-MSA。

在这里插入图片描述

block1

第一个Swin Transformer Block结构如下所示。

    (0): SwinBlock(
      (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): ShiftWindowMSA(
        (w_msa): WindowMSA(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (softmax): Softmax(dim=-1)
        )
        (drop): DropPath()
      )
      (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ffn): FFN(
        (activate): GELU()
        (layers): Sequential(
          (0): Sequential(
            (0): Linear(in_features=768, out_features=3072, bias=True)
            (1): GELU()
            (2): Dropout(p=0.0, inplace=False)
          )
          (1): Linear(in_features=3072, out_features=768, bias=True)
          (2): Dropout(p=0.0, inplace=False)
        )
        (dropout_layer): DropPath()
      )
    )

block2

第二个Swin Transformer Block结构如下所示。

    (1): SwinBlock(
      (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): ShiftWindowMSA(
        (w_msa): WindowMSA(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (softmax): Softmax(dim=-1)
        )
        (drop): DropPath()
      )
      (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ffn): FFN(
        (activate): GELU()
        (layers): Sequential(
          (0): Sequential(
            (0): Linear(in_features=768, out_features=3072, bias=True)
            (1): GELU()
            (2): Dropout(p=0.0, inplace=False)
          )
          (1): Linear(in_features=3072, out_features=768, bias=True)
          (2): Dropout(p=0.0, inplace=False)
        )
        (dropout_layer): DropPath()
      )
    )

2. neck模块

neck部分使用的是GeneralizedLSSFPN。输入接收的是backbone的输出，也就是swin trainsformer输出的3个元素list，分别为[24, 192, 32, 88]、[24, 384, 16, 44]和[24, 768, 8, 22]。

因为经过swin trainsformer输出后的特征大小不一样，该部分使用了类似于FPN的方法，将不同分支的特征进行融合，流程如如下所示：

具体的代码流程如下所示，其中self.lateral_convs为将通道降至256的2维卷积，self.fpn_convs为通道不变的2维卷积。

    def forward(self, inputs):
        # inputs为3个元素的输入list
        """Forward function."""
        # upsample -> cat -> conv1x1 -> conv3x3
        assert len(inputs) == len(self.in_channels)

        # build laterals
        # self.start_level = 0，这一步暂时没意义
        laterals = [inputs[i + self.start_level] for i in range(len(inputs))]

        # build top-down path
        used_backbone_levels = len(laterals) - 1
        # 两两结合
        for i in range(used_backbone_levels - 1, -1, -1):
            # 插值至下一个特征大小
            x = F.interpolate(
                laterals[i + 1],
                size=laterals[i].shape[2:],
                **self.upsample_cfg,
            )
            # 插值后，feature map大小相同，concat结合
            laterals[i] = torch.cat([laterals[i], x], dim=1)
            # 将通道降至256
            laterals[i] = self.lateral_convs[i](laterals[i])
            # 进一步使用卷积提取特征
            laterals[i] = self.fpn_convs[i](laterals[i])

        # build outputs
        outs = [laterals[i] for i in range(used_backbone_levels)]
        return tuple(outs)

neck返回包含2个元素的tuple，[24, 256, 32, 88]和[6, 256, 16, 44]，实际上在真正有用的只有第0个元素，也就是[24, 256, 32, 88]维度的部分，因为在外层做了筛选，如下所示。

        if not isinstance(x, torch.Tensor):
            x = x[0]
        
        # B = batchsize, N = 相机个数
        # [24, 256, 32, 88])
        BN, C, H, W = x.size()
        # [4, 6, 256, 32, 88])
        x = x.view(B, int(BN / B), C, H, W)

3. vtransform模块

MIT-BEVFusion的VT部分基于LSS，但是对BEV池化做了改进（用cuda完成BEV池化的计算、interval、预计算的思想）。vtransform使用的是DepthLSSTransform，

3.1 将点云深度转换至相机坐标系下

初始化相机坐标系下的深度：depth = [batchsize, 相机个数，1， 256， 704]。
点云逆增广：点云数据去数据增强，得到原始点云。
lidar2image：将点云投影至6个相机，分别得到6个相机坐标系下的点云数据。
深度数据：获取深度数据，并得到透视2D坐标系数据(图像物理坐标系)。
图像数据增强：将点云做同图像一样的数据增强。
创建布尔掩码：对齐透视图坐标系后，过滤范围外的点云。
赋予深度信息：将每个相机范围内的点云数据，赋值给depth。

        batch_size = len(points)
        # 初始化深度信息(batchsize, 相机个数，1， 256， 704)
        depth = torch.zeros(batch_size, img.shape[1], 1, *self.image_size).to(
            points[0].device
        )
        
        for b in range(batch_size):
            cur_coords = points[b][:, :3] # 点云xyz，shape=[nums, 3]
            # 3个转换矩阵，shape=[6, 4, 4]
            cur_img_aug_matrix = img_aug_matrix[b] # 图像数据增强矩阵
            cur_lidar_aug_matrix = lidar_aug_matrix[b] # 激光数据增强矩阵
            cur_lidar2image = lidar2image[b] # lidar至image的转换矩阵

            # inverse aug
            # 去除激光数据的平移和旋转操作
            # 并将lidar坐标系转换为[3, nums]
            cur_coords -= cur_lidar_aug_matrix[:3, 3]
            cur_coords = torch.inverse(cur_lidar_aug_matrix[:3, :3]).matmul(
                cur_coords.transpose(1, 0)
            )
            # lidar2image
            # 将点云转换之图像坐标系下，得到6个相机坐标系下的数据[6, 3, nums])
            cur_coords = cur_lidar2image[:, :3, :3].matmul(cur_coords)
            cur_coords += cur_lidar2image[:, :3, 3].reshape(-1, 3, 1)
            # get 2d coords
            dist = cur_coords[:, 2, :] # 获取距离信息，也就是相机坐标系下的z分量，得到[6, nums]
            cur_coords[:, 2, :] = torch.clamp(cur_coords[:, 2, :], 1e-5, 1e5)#截取Z方向值
            cur_coords[:, :2, :] /= cur_coords[:, 2:3, :]# 透视坐标系：相机坐标系下，x和y分量除以z。得到图像物理坐标系

            # imgaug
            # 对激光点云做图像数据增广
            cur_coords = cur_img_aug_matrix[:, :3, :3].matmul(cur_coords)
            cur_coords += cur_img_aug_matrix[:, :3, 3].reshape(-1, 3, 1)
            cur_coords = cur_coords[:, :2, :].transpose(1, 2) # 只取相机坐标系下的xy分量，且转置后shape=[6, nums, 2]

            # normalize coords for grid sample
            cur_coords = cur_coords[..., [1, 0]] # 对齐坐标系，也就是x和y调换。
            # 创建一个布尔掩码，哪些点在图像边界内。= [6, nums]
            on_img = (
                (cur_coords[..., 0] < self.image_size[0])
                & (cur_coords[..., 0] >= 0)
                & (cur_coords[..., 1] < self.image_size[1])
                & (cur_coords[..., 1] >= 0)
            )
            # 6个相机依次处理，赋予深度值
            for c in range(on_img.shape[0]):
            	# 取对应相机下，且在掩码范围内的激光点
                masked_coords = cur_coords[c, on_img[c]].long()
                # 取对应相机下，其在掩码范围内的距离点
                masked_dist = dist[c, on_img[c]]
                # 在相应的位置填深度信息
                depth[b, c, 0, masked_coords[:, 0], masked_coords[:, 1]] = masked_dist

3.2 融合图像和激光特征得到新特征

我们先来看LSS这部分，如下图所示。我们观察右面的网格图，首先解释一下网格图的坐标，其中a代表某一个深度softmax概率（大小为H * W），c代表语义特征的某一个channel的feature，那么ac就表示这两个矩阵的对应元素相乘，于是就为feature的每一个点赋予了一个depth 概率，然后广播所有的ac，就得到了不同的channel的语义特征在不同深度（channel）的feature map，经过训练，重要的特征颜色会越来越深（由于softmax概率高），反之就会越来越暗淡，趋近于0。在这里插入图片描述

下面来看看BEVFusion中将具有真实深度信息的点云特征，与图像特征融合，得到概率分布。

步骤流程大致如下：

得到激光转化后得到的深度特征，大小为[24, 1, 256, 704]，并将其经过dtransform进一步提取特征，得到大小为[24, 64, 32, 88]的激光深度特征。
得到图像特征，大小为[24, 256, 32, 88]
将激光和图像特征进行concat，得到大小为[24, 64, 32, 88]，将融合的特征进行depthnet进行卷积提取特征，得到大小为[24, 198, 32, 88]
198 = self.D + self.C个通道特征，前self.D = 118经过softmax编码，预测depth的概率分布，得到深度概率权重。
self.D和self.C的部分进行外积，得到新的feature。深度值 * 特征 = 2D特征转变为3D空间(俯视图)内的特征。

self.D与上面的视锥的D一致，用来储存深度特征，self.C为图像的语义特征。

    def get_cam_feats(self, x, d):
    	'''
    	x：图像特征
    	d：激光点云在图像物理坐标系下的深度信息。
    	'''
        # x为neck部分输出的图像特征 [1, 6, 256, 32, 88])
        B, N, C, fH, fW = x.shape
        
        # d为激光转化的深度特征(batchsize, 相机个数，1， 256， 704)
        d = d.view(B * N, *d.shape[2:]) # [24, 1, 256, 704]
        x = x.view(B * N, C, fH, fW) # [24, 256, 32, 88]
        
        d = self.dtransform(d)# 经过三个卷积，对齐深度特征，[24, 64, 32, 88])
        x = torch.cat([d, x], dim=1) # 使用concat融合[24, 320, 32, 88]
        # 经过三个卷积，提取特征，[24, 198, 32, 88]，198=118+80
        x = self.depthnet(x)

        # softmax => [24, 118, 32, 88]
        # self.D = 118，深度概率。softmax编码，相理解为每个可选深度的权重
        depth = x[:, : self.D].softmax(dim=1) 
        # [24, 80, 118, 32, 88]，深度值 * 特征 = 2D特征转变为3D空间(俯视图)内的特征
        x = depth.unsqueeze(1) * x[:, self.D : (self.D + self.C)].unsqueeze(2) # self.C = 80

        x = x.view(B, N, self.C, self.D, fH, fW) # [4, 6, 80, 118, 32, 88]
        x = x.permute(0, 1, 3, 4, 5, 2) # [4, 6, 118, 32, 88, 80]
        return x

3.3 创建锥视点云

首先创建图像坐标系下的锥视点，然后将锥视点投影至lidar坐标系，得到锥视点云。

（1）创建锥视图

深度ds创建方法 : 以1m为起点，60m为终点，间隔0.5m共创建118个点。然后重复至[118, 32, 88]大小
feature map下fW对应的iW值xs：以0为起点，iW-1= 703为终点，取fW=88个点，然后重复至[118, 32, 88]大小
feature map下fH对应的iH值ys：以0为起点，iH-1= 255为终点，取fH=32个点，然后重复至[118, 32, 88]大小

    def create_frustum(self):
        import pdb
        pdb.set_trace()
        iH, iW = self.image_size # [256, 704]
        fH, fW = self.feature_size # [32, 88]
        
        # self.dbound = [1.0, 60.0, 0.5]
        # 以1.0为基准，间隔0.5取一个点，一共118个点。
        # expand也就是重复扩充至32*88大小。
        ds = (
            torch.arange(*self.dbound, dtype=torch.float)
            .view(-1, 1, 1)
            .expand(-1, fH, fW)
        )
        # ds.shape = [118, 32, 88])
        D, _, _ = ds.shape # D = 118
        
        # xs.shape = ys.shape = [118, 32, 88]
        # 以0为起点，iW-1= 703为终点，取fW=88个点
        xs = (
            torch.linspace(0, iW - 1, fW, dtype=torch.float)
            .view(1, 1, fW)
            .expand(D, fH, fW)
        )
        # 以0为起点，iH-1= 255为终点，取fH=32个点
        ys = (
            torch.linspace(0, iH - 1, fH, dtype=torch.float)
            .view(1, fH, 1)
            .expand(D, fH, fW)
        )

        frustum = torch.stack((xs, ys, ds), -1) # [118, 32, 88, 3]
        return nn.Parameter(frustum, requires_grad=False)

（2）视锥点投影至lidar坐标系

在上面的3.2节中，得到了带有深度信息的feature map，那么我们想知道这些特征对应3D空间的哪个点，我们怎么做呢？

图像坐标系视锥点投影到lidar坐标系，得到视锥点，视锥图大小为[4, 6, 118, 32, 88, 3]。表示batchsize=4，6个相机图像，feature map = 32*88。也就是只要知道正确的索引，就能知道这个索引对应在lidar坐标系的真实的xyz坐标。

    def get_geometry(
        self,
        camera2lidar_rots,
        camera2lidar_trans,
        intrins,
        post_rots,
        post_trans,
        **kwargs,
    ):
        B, N, _ = camera2lidar_trans.shape

        # undo post-transformation
        # B x N x D x H x W x 3
        # 抵消数据增强及预处理对像素的变化
        points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)

        points = (
            torch.inverse(post_rots)
            .view(B, N, 1, 1, 1, 3, 3)
            .matmul(points.unsqueeze(-1))
        )
        # cam_to_lidar
        # 坐标系变换
        points = torch.cat(
            (
                points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                points[:, :, :, :, :, 2:3],
            ),
            5,
        )
        combine = camera2lidar_rots.matmul(torch.inverse(intrins))
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
        points += camera2lidar_trans.view(B, N, 1, 1, 1, 3)

        if "extra_rots" in kwargs:
            extra_rots = kwargs["extra_rots"]
            points = (
                extra_rots.view(B, 1, 1, 1, 1, 3, 3)
                .repeat(1, N, 1, 1, 1, 1, 1)
                .matmul(points.unsqueeze(-1))
                .squeeze(-1)
            )
        if "extra_trans" in kwargs:
            extra_trans = kwargs["extra_trans"]
            points += extra_trans.view(B, 1, 1, 1, 1, 3).repeat(1, N, 1, 1, 1, 1)
        # (bs, N, depth, H, W, 3)
        return points

3.4 bev pooling

先来看看LSS和BEVFusion中BEV池化的不同，先上原论文的对比图。
在这里插入图片描述

LSS中的BEV池化操作步骤如下：

先根据特征点的XYZ坐标和batch，计算每个点的索引值；索引值相同的点位于同一个栅格中；如上图中的index_0 => [1, 3]， index_1 => [7, -1, -2]，index_2 =>[4, -3, 6]
只需“遍历”索引值，将相同索引值的位置求和，完成池化。

在LSS的代码中，归类相同索引使用了排序思想，这样的话，可以先计算前缀和，如果索引不相同则减去前一个索引的前缀和，得到区间和。

而BEVFusion中使用了Interval Reduction的思路，也就是不同的索引使用不同线程进行处理，这一步在实际代码中使用了cuda进行加速。

得到的深度图像特征[4, 6, 118, 32, 88, 80]，和锥视点云[4, 6, 118, 32, 88, 3]，经过bev pooling后，得到[4, 80, 360, 360]大小的feature。

非晚非晚

关注

16
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
0
评论
【MIT-BEVFusion代码解读】第三篇：camera的encoder部分

camera的encoder主要有3部分，分别是backboneneck和vtransform部分。其中backbone使用neck使用vtransform部分使用的是，如下所示。fill:#333;color:#333;color:#333;fill:none;使用使用使用backboneneckvtransform调用的顺序分别为，具体代码如下所示。
复制链接

扫一扫