Object Detection

Transformer-Based Models

DETR (end-to-end object DEtection with TRansformer; 2020)

传统目标检测: 预测一堆anchors,并用NMS来确定最终输出anchors (post processing). NMS决定模型不是end-to-end.

DETR: 把目标检测作为set prediction problem

prediction loss: Hungarian Loss, forces UNIQUE MATCHING  b/t predicted and ground truth boxes (bipartite matching)

decoder: 使用learned object query (有点类似anchors的意思), predicts (in a single pass) a set of objects in parallel. (区分于语言模型autoregressive需要一个一个输出,视觉decoder是一次全部输出)

detr中 transformer encoder的输入embedding对应CNN feature map, embedding的某一个位置对应feature map上某一个pixel, embedding的input dimension对应feature map的output channel


deformable-DETR

transformer结构用deformable代替,超参定义每个query可以看多少个key, 然后这个query的offset以及attention weights都由模型通过query直接学习,(而不是用dot-product那一套,不需要key)

注意这里,1. 没有key和value的区分,用一个representation

2. p_q是z_q在input feature map上的reference point (对应位置)。encoder中的self-attention好理解,因为每个query也是从resnet的feature maps中来,找对应位置即可;对于decoder中每个object query, reference point用一个网络学出来

Multi-scale deformable attention涉及无非scale的对应,这里不多说

BEVformer

一种范式:从multi-camera inputs中得到BEV特征,再用于下游检测或分割任务

从多图像传感器到BEV的任务;模型主干部分类似detr的encoder,输出是BEV features, 外接不同下游任务的head

1. Temporal Self Attention (TSA): Learnable Query 结合 history BEV B_{t-1}做deformable attention.类似RNN结构但不使用gates

2. Spatial Cross Attention: encoder中间的BEV层,和camera(经过backbone网络)的features做deformable Cross Attention

        

BEV相当于query, 对于二维BEV上的点p, 延展到3d并抽取N_{ref}个点(vs. 标准deformable DETR只固定一个reference point),然后对于第j个3d点根据camera内参去找在第i个camera feature上的reference point: P(p, i, j), 作相应的deformable attention

BEVformerV2

backbone: 一般使用VoVNet,研究发现在ImageNet上ConvNext-XL主干性能大大超越,但是用在BEV detection中达不到相应效果,原因: 1) general CV和自动驾驶上的domain gap:backbone一般在二维任务上预训练 2) BEV detector结构过于复杂:输出bbox和输入backbone之间隔着encoder(生成bev特征)和object decoder(省略在head中), 都是多层attention

BEVFusion & Transfusion

Non-Transformer Based Models

Two-stage detectors:

Faster R-CNN

标题

参考:faster rcnn详解-CSDN博客

*head用一乘一卷积是常规操作:feature map size不变,对每个位置input中不同channel做线性组合*


RCNN - Fast RCNN - Faster RCNN

Feature Pyramid Network

Motivation: 单个高层特征(比如说Faster R-CNN利用下采样四倍的卷积层——Conv4,进行后续的物体的分类和bounding box的回归),但是这样做有一个明显的缺陷,即小物体本身具有的像素信息较少,在下采样的过程中极易被丢失

High level feature map: spatially coarser, but semantically stronger

upsampling

Customized FPN example

the backbone-neck-head paradigm

Application in Faster-RCNN

1. for RPN: pect ratios in order to cover objects of different shapes. We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid.

2. for Fast R-CNN:       

One-stage detectors:

  • One-stage detectors perform object classification and localization in a single forward pass through the network.
  • They are generally faster because they do not require a separate region proposal step.
  • Examples include YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), and RetinaNet.

one-stage模型框架理解:去掉了RPN模块,backbone输出fature map后,对每一个像素点做class+bbox预测 (yolov1)

RetinaNet

An FPN model structure

Focal Loss

对于FPN每一个level的feature map,在每一个像素点上预测class,其中绝大多数都是background

yolo

v2: add anchor boxes of different aspect ratios

v3: add FPN; start using DarkNet53

YOLO系列算法全家桶——YOLOv1-YOLOv9详细介绍 !!-CSDN博客

Talks

Tesla FSD

Phil Duan 

  • No HD Map or additional sensors
  • AEB

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值