论文阅读笔记:(2021.10 CoRL) DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

论文地址:DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries | OpenReviewWe introduce a framework for multi-camera 3D object detection. In contrast to existing works, which estimate 3D bounding boxes directly from monocular images or use depth prediction networks to generate input for 3D object detection from 2D information, our method manipulates predictions directly in 3D space. Our architecture extracts 2D features from multiple camera images and then uses a sparse set of 3D object queries to index into these 2D features, linking 3D positions to multi-view images using camera transformation matrices. Finally, our model makes a bounding box prediction per object query, using a set-to-set loss to measure the discrepancy between the ground-truth and the prediction. This top-down approach outperforms its bottom-up counterpart in which object bounding box prediction follows per-pixel depth estimation, since it does not suffer from the compounding error introduced by a depth prediction model. Moreover, our method does not require post-processing such as non-maximum suppression, dramatically improving inference speed. We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.https://openreview.net/forum?id=xHnJS2GYFDz

代码开源了, 基于mmDetection3D进行了修改: https://github.com/WangYueFt/detr3dhttps://github.com/WangYueFt/detr3d

一、目的,贡献,创新点

基于2D的方法后处理多, 基于伪点云的方法受深度估计的影响大, 因此提出本方法;

1. 一种自顶而下的单目多视图3D目标检测网络, 在各个层级都能够融合多个试图的信息, the first attempt to cast multi-camera detection as 3D set-to-set prediction

2. 直接关联2D feature和3D框, 避免深度估计不准带来的影响:uses information from multiple cameras by back-projecting 3D information onto all available frames

3. 无NMS, 在相机视角重合部分效果比较好: 其实这是2的结果, 2中的multi_head_attention中已经有框之间的相互作用, set-to-set loss (hangarian loss)驱使网络对于一个ground truth只有一个最优预测;

二、精度

主要在nuScene数据集上:

三、实现

3.1 特征提取: 用基于Resnet-FPN的backbone, 提取各个相机上的图片特征;

 ​​​3.2 检测头(2D-to-3D Feature Transformation), 其中: i 是query的序号, L是transformer的层级, m是相机编号, k是backbone的feature level

3.3 loss :hangarian loss, 参考:

(2015.06 cvpr) End-to-end people detection in crowded scenes

(2015.06 cvpr) End-to-end people detection in crowded scenes.pdf

或者也可以看DETR的论文,大致流程如下:

    1. 把gt的数量补充到object query的数量N,一般来说, gt的数量肯定要少一些,不足的部分用空的补上;

    2. 计算{gt} set 和 {obj query} set的N*N cost 矩阵, 具体的cost是类别和box iou的加权;

    3. 用带权重的匈牙利匹配,比如munkrs, 计算一个最优的2分匹配(assignment)

    4. 基于这个assignment, minimize此时的loss, loss也是由类比loss和框的loss构成;

 四、个人观点、重要参考文献

3Dproposal投影到2D + 多层迭代解决多相机问题

query之间的multi-head-attention + set-to-set loss干掉nms

牛啊!

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值