ECCV2022：PERT 一种简单而优雅的多视图3D目标检测解决方案

最新推荐文章于 2024-07-27 09:45:00 发布

advanced_shq

最新推荐文章于 2024-07-27 09:45:00 发布

阅读量151

点赞数

文章标签：目标检测深度学习计算机视觉

本文链接：https://blog.csdn.net/sq_wu/article/details/131838997

版权

论文标题：PETR: Position Embedding Transformation for Multi-View 3D Object Detection

论文链接：https://arxiv.org/abs/2203.05625

代码链接：https://github.com/megvii-research/PETR

代码注释：https://github.com/qshuiqing/PETR

前言

在这里插入图片描述

DETR中，object query与2D特征交互生成2D预测。DETR3D中，object query预测参考点并投影到2D空间采点，然后与object query交互生成3D预测。PETR中，将3D PE与2D特征融合生成3D表征，然后在与query交互生成3D预测。

本文贡献：

提出了简单而优雅的多视图3D目标检测框架
介绍了一种新的用于多视图3D目标检测的3D位置感知表示

网络架构

在这里插入图片描述

核心流程：

backbone : 生成2D图片特征
3D Coordinates Generator : 相机视锥空间的点投影到3D世界坐标系中
3D Position Encoder : 3D位置信息与2D图片特征生成3D位置感知特征
Query Generator : 生成最初的query
Decoder : 迭代更新query，用于最后的预测

3D Coordinates Generator

相机视锥空间中的点投影到3D空间中。

首先，将相机视锥空间离散成 $W_F, H_F, D)$ 网格，网格中每个点坐标 $p_j^m = (u_j \times d_j, v_j \times d_j, d_j, 1)$ 。

petr_head.py

coords_h = torch.arange(H, device=img_feats[0].device).float() * pad_h / H
coords_w = torch.arange(W, device=img_feats[0].device).float() * pad_w / W
# 使用LID沿着深度方向采64个点
# we sample 64 points along the depth axis following the linear-increasing discretization in CaDDN
index  = torch.arange(start=0, end=self.depth_num, step=1, device=img_feats[0].device).float()
index_1 = index + 1
bin_size = (self.position_range[3] - self.depth_start) / (self.depth_num * (1 + self.depth_num))
coords_d = self.depth_start + bin_size * index * index_1

coords = torch.stack(torch.meshgrid([coords_w, coords_h, coords_d])).permute(1, 2, 3, 0) # W, H, D, 3
coords = torch.cat((coords, torch.ones_like(coords[..., :1])), -1) # torch.Size([88, 32, 64, 4]) 用于后续与相机外参运算
coords[..., :2] = coords[..., :2] * torch.maximum(coords[..., 2:3], torch.ones_like(coords[..., 2:3])*eps)

把相机视锥空间中的点与相机外参运算 $p_{i,j}^{3d} = K_i^{-1}p_j^m$ ，获得3D空间坐标 $p_{i,j}^{3d} = (x_{i,j}, y_{i,j}, z_{i,j}, 1)^T$ 。

petr_head.py

coords3d = torch.matmul(img2lidars, coords).squeeze(-1)[..., :3]

最后，归一化3D空间坐标点。

# 归一化3D坐标
# The 3D coordinates in 3D world space are normalized to [0, 1]
coords3d[..., 0:1] = (coords3d[..., 0:1] - self.position_range[0]) / (self.position_range[3] - self.position_range[0])
coords3d[..., 1:2] = (coords3d[..., 1:2] - self.position_range[1]) / (self.position_range[4] - self.position_range[1])
coords3d[..., 2:3] = (coords3d[..., 2:3] - self.position_range[2]) / (self.position_range[5] - self.position_range[2])

$\left\{ \begin{matrix} x_{i,j} = (x_{i,j} - x_{min}) / (x_{max} - x{min}) \\ y_{i,j} = (y_{i,j} - y_{min}) / (y_{max} - y{min}) \\ z_{i,j} = (z_{i,j} - z_{min}) / (z_{max} - z{min}) \end{matrix} \right.$

3D Position Encoder

关联2D图像特征 $F^{2d} = \{F_i^{2d} \in R^{C\times H_F\times W_F},i=1,2,...,N\}$ 和3D位置信息生成3D位置感知特征 $F^{3d} = \{F_i^{3d} \in R^{C\times H_F\times W_F},i=1,2,...,N\}$ 。
$F_i^{3d} = \psi(F_i^{2d}, P_i^{3d}), i = 1, 2, ...,N$

将3D坐标点进行卷积处理，生成3D positional embedding(3D PE)。

petr_head.py

self.position_encoder = nn.Sequential( # 192 -> 1024 -> 256
                nn.Conv2d(self.position_dim, self.embed_dims*4, kernel_size=1, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(self.embed_dims*4, self.embed_dims, kernel_size=1, stride=1, padding=0),
# torch.Size([6, 192, 32, 88])
coords3d = coords3d.permute(0, 1, 4, 5, 3, 2).contiguous().view(B*N, -1, H, W)
coords3d = inverse_sigmoid(coords3d) # torch.Size([6, 192, 32, 88])
# 3D 坐标进行卷积处理生成3D位置编码 torch.Size([6, 256, 32, 88])
coords_position_embeding = self.position_encoder(coords3d)

将图片特征经过卷积处理。

petr_head.py

self.input_proj = Conv2d( # 256 -> 256
                self.in_channels, self.embed_dims, kernel_size=1)
# 对图片特征 1x1 卷积处理
x = self.input_proj(x.flatten(0, 1))

将3D positional embedding与经过卷积处理后的图片特征相加得到3D position-aware features。

petr_transformer.py

# key - x; key_pos - coords_position_embedding
key = key + key_pos

在这里插入图片描述

作者对3D位置编码进行了分析，随机选择前视图3个点的PE，并计算这3个点与所有其他视图PE之间的相似性。如下图可以看看出，靠近选定点的区域有较高的相似性。
在这里插入图片描述

Query Generator and Decoder

为了缓解3D场景收敛困难的问题，本文作者首先在3D空间初始化一组从0到1均匀分布的可学习锚点。然后，将锚点输入多层感知机，生成最初的object queries $Q_0$ 。

self.query_embedding = nn.Sequential( # 384 -> 256 -> 256
            nn.Linear(self.embed_dims*3//2, self.embed_dims),
            nn.ReLU(),
            nn.Linear(self.embed_dims, self.embed_dims),
        )
# pos2posemb3d 生成可学习锚点
# query_embedding 多层感知机，根据锚点生成最初的 object queries
query_embeds = self.query_embedding(pos2posemb3d(reference_points))

本文作者使用了6层，每一层object queries通过多头注意力和FFN与3D位置感知特征交互。循环迭代6次后，更新后的object queries具有高级别表示，可用于预测相应的对象。
$Q_l = \mathit\Omega _l(F^{3d}, Q_{l-1}), \ l = 1,...,L$
$Q_l \in R^{M\times C}$ : $l$ 层更新后的object queries，M、C分别是queries的数量和通道数

$\mathit \Omega _l$ : $l$ 层的decoder， $l$ 是decoder的层数

结论

本文提供了一个简单而优雅的多视角3D目标检测解决方案。通过3D coordinates generation和position encoding，2D特征可以转化成3D位置感知特征表示。这种3D表示可以直接集成到基于query的DETR体系结构中，并实现端到端的检测。