BEV开山之作Lift-Splat-Shot (LSS) 深度详解

最新推荐文章于 2025-03-18 01:14:57 发布

shuaishuaideyuzi

最新推荐文章于 2025-03-18 01:14:57 发布

阅读量881

点赞数 17

分类专栏： 3D视觉入门文章标签：人工智能 python pytorch 3d 计算机视觉

本文链接：https://blog.csdn.net/shyr_sheyu/article/details/145539249

版权

3D视觉入门专栏收录该内容

4 篇文章

订阅专栏

在自动驾驶感知系统中，将多视角图像转换为鸟瞰图（BEV）是一个关键步骤。Lift-Splat-Shot（LSS）是一种高效的视角转换方法，能够将透视视图特征转换为BEV空间，从而实现更准确的3D物体检测。本文将详细解析LSS的工作原理、技术细节及其应用场景。

一、LSS概述

LSS（Lift-Splat-Shot）是由Philipp Henzler等人于2021年提出的一种用于自动驾驶感知系统的视角转换方法。该方法通过三个主要步骤——Lift、Splat和Shot，将多视角图像特征映射到统一的鸟瞰图特征空间，从而支持更精确的3D目标检测。

1.1 LSS的核心思想

LSS的主要目标是解决传统单目摄像头在3D感知中的局限性，如透视投影失真和多模态数据对齐困难。通过构建一个统一的BEV特征表示，LSS能够在保持几何一致性的同时，更好地捕捉场景中的复杂信息。

二、LSS的技术实现

LSS的工作流程可以分为三个阶段：Lift、Splat和Shot。每个阶段都有其独特的功能和实现方式。

2.1 Lift

功能：预测每个像素的深度分布。

实现：

输入：多视角图像特征。
输出：每个像素的深度分布概率。

具体来说，Lift阶段使用一个深度估计网络来预测每个像素点的深度值。为了提高深度估计的准确性，通常采用分类方式离散化深度区间（例如，将深度范围划分为112个bins）。这样做的好处是可以利用分类任务的优势，避免回归任务中的精度损失。简易代码：

def lift(image_features):
    """
    Predict depth distribution for each pixel.
    
    Args:
        image_features (Tensor): Feature maps from the image encoder.
        
    Returns:
        depth_distributions (Tensor): Depth distributions for each pixel.
    """
    # Example implementation using a neural network
    depth_net = nn.Sequential(
        nn.Conv2d(in_channels=channels, out_channels=128, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.Conv2d(in_channels=128, out_channels=num_depth_bins, kernel_size=1)
    )
    depth_logits = depth_net(image_features)
    depth_distributions = F.softmax(depth_logits, dim=1)
    return depth_distributions

2.2 Splat

功能：将backbone中提取到的环视相机特征，通过Lift计算到的深度以及相机内外惨投影到BEV网格。

实现：

输入：每个像素的深度分布和其对应的透视视图特征。
输出：BEV网格上的特征表示。

Splat阶段通过体素池化（Voxel Pooling）将透视视图特征投影到BEV网格上。具体做法是根据每个像素的深度分布，将其特征值分配到相应的BEV网格单元中。为了处理重叠区域，通常采用加权平均（pooling）的方式进行特征聚合。

def splat(depth_distributions, image_features, cam_intrinsics, cam_extrinsics, bev_grid):
    """
    Project perspective view features to BEV grid.
    
    Args:
        depth_distributions (Tensor): Depth distributions for each pixel.
        image_features (Tensor): Feature maps from the image encoder.
        cam_intrinsics (Tensor): Camera intrinsics matrix.
        cam_extrinsics (Tensor): Camera extrinsics matrix.
        bev_grid (Tensor): BEV grid coordinates.
        
    Returns:
        bev_features (Tensor): Features in BEV space.
    """
    batch_size, num_depth_bins, height, width = depth_distributions.shape
    bev_features = torch.zeros((batch_size, channels, bev_height, bev_width))
    
    for b in range(batch_size):
        for h in range(height):
            for w in range(width):
                depth_probs = depth_distributions[b, :, h, w]
                xyz = project_to_3d(h, w, depth_probs, cam_intrinsics[b], cam_extrinsics[b])
                bev_coords = project_to_bev(xyz, bev_grid)
                
                for i, coord in enumerate(bev_coords):
                    if 0 <= coord[0] < bev_height and 0 <= coord[1] < bev_width:
                        bev_features[b, :, coord[0], coord[1]] += depth_probs[i] * image_features[b, :, h, w]
    
    return bev_features

2.3 Shot

功能：沿高度维度压缩形成2D BEV特征。

实现：

输入：BEV网格上的特征表示。
输出：压缩后的2D BEV特征。

Shot阶段通过对BEV网格的高度维度进行压缩，得到最终的2D BEV特征。这一步骤有助于减少计算量，并使后续的3D检测任务更加高效。

def shot(bev_features):
    """
    Compress BEV features along the height dimension.
    
    Args:
        bev_features (Tensor): Features in BEV space.
        
    Returns:
        compressed_bev_features (Tensor): Compressed 2D BEV features.
    """
    # Example implementation using max pooling
    compressed_bev_features = torch.max(bev_features, dim=2)[0]
    return compressed_bev_features