Object Detection

最新推荐文章于 2024-09-14 21:53:48 发布

weixin_44504134

最新推荐文章于 2024-09-14 21:53:48 发布

阅读量922

点赞数 9

文章标签：学习

本文链接：https://blog.csdn.net/weixin_44504134/article/details/139236959

版权

Transformer-Based Models

DETR （end-to-end object DEtection with TRansformer; 2020）

传统目标检测：预测一堆anchors，并用NMS来确定最终输出anchors (post processing). NMS决定模型不是end-to-end.

DETR: 把目标检测作为set prediction problem

prediction loss: Hungarian Loss, forces UNIQUE MATCHING b/t predicted and ground truth boxes (bipartite matching)

decoder: 使用learned object query (有点类似anchors的意思), predicts (in a single pass) a set of objects in parallel. (区分于语言模型autoregressive需要一个一个输出，视觉decoder是一次全部输出)

detr中 transformer encoder的输入embedding对应CNN feature map, embedding的某一个位置对应feature map上某一个pixel, embedding的input dimension对应feature map的output channel

deformable-DETR

transformer结构用deformable代替，超参定义每个query可以看多少个key, 然后这个query的offset以及attention weights都由模型通过query直接学习，（而不是用dot-product那一套，不需要key）

注意这里，1. 没有key和value的区分，用一个representation

2. p_q是z_q在input feature map上的reference point (对应位置)。encoder中的self-attention好理解，因为每个query也是从resnet的feature maps中来，找对应位置即可；对于decoder中每个object query, reference point用一个网络学出来

Multi-scale deformable attention涉及无非scale的对应，这里不多说

BEVformer

一种范式：从multi-camera inputs中得到BEV特征，再用于下游检测或分割任务

从多图像传感器到BEV的任务；模型主干部分类似detr的encoder，输出是BEV features, 外接不同下游任务的head

1. Temporal Self Attention (TSA): Learnable Query 结合 history BEV $B_{t-1}$ 做deformable attention.类似RNN结构但不使用gates

2. Spatial Cross Attention: encoder中间的BEV层，和camera（经过backbone网络）的features做deformable Cross Attention

BEV相当于query, 对于二维BEV上的点p, 延展到3d并抽取 $N_{ref}$ 个点(vs. 标准deformable DETR只固定一个reference point)，然后对于第j个3d点根据camera内参去找在第i个camera feature上的reference point: P(p, i, j), 作相应的deformable attention

BEVformerV2

backbone: 一般使用VoVNet，研究发现在ImageNet上ConvNext-XL主干性能大大超越，但是用在BEV detection中达不到相应效果，原因: 1) general CV和自动驾驶上的domain gap：backbone一般在二维任务上预训练 2) BEV detector结构过于复杂：输出bbox和输入backbone之间隔着encoder(生成bev特征)和object decoder（省略在head中）, 都是多层attention

BEVFusion & Transfusion

Non-Transformer Based Models

Two-stage detectors:

Faster R-CNN

参考：faster rcnn详解-CSDN博客

*head用一乘一卷积是常规操作：feature map size不变，对每个位置input中不同channel做线性组合*

RCNN - Fast RCNN - Faster RCNN

Feature Pyramid Network

Motivation: 单个高层特征(比如说Faster R-CNN利用下采样四倍的卷积层——Conv4，进行后续的物体的分类和bounding box的回归)，但是这样做有一个明显的缺陷，即小物体本身具有的像素信息较少，在下采样的过程中极易被丢失

High level feature map: spatially coarser, but semantically stronger

upsampling

the backbone-neck-head paradigm

Application in Faster-RCNN

1. for RPN: pect ratios in order to cover objects of different shapes. We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid.

2. for Fast R-CNN:

One-stage detectors:

One-stage detectors perform object classification and localization in a single forward pass through the network.
They are generally faster because they do not require a separate region proposal step.
Examples include YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), and RetinaNet.

one-stage模型框架理解：去掉了RPN模块，backbone输出fature map后，对每一个像素点做class+bbox预测（yolov1）