Github源码:facebookresearch/detr
Github注释版源码:HuKai97/detr-annotations
论文:End-to-End Object Detection with Transformers
转载:【DETR源码解析】
概述
DETR 即 DEtection TRansformer, 是 Facebook AI 研究院提出的 CV 模型,主要用于目标检测,也可以用于分割任务。该模型使用 Transformer 替代了复杂的目标检测传统套路,比如 two-stage 或 one-stage、anchor-based 或 anchor-free、nms 后处理等;也没有使用一些骚里骚气的技巧,比如在使用多尺度特征融合、使用一些特殊类型的卷积(如分组卷积、可变性卷积、动态生成卷积等)来抽取特征、对特征图作不同类型的映射以将分类与回归任务解耦、甚至是数据增强,整个过程就是使用CNN提取特征后编码解码得到预测输出。
可以说,整体工作很solid,虽然效果未至于 SOTA,但将炼丹者们通常认为是属于 NLP 领域的 Transformer 拿来跨界到 CV 领域使用,并且能work,这是具有重大意义的,其中的思想也值得我们学习。这种突破传统与开创时代的工作往往是深得人心的,比如 Faster R-CNN 和 YOLO,你可以看到之后的许多工作都是在它们的基础上做改进的。
概括地说,DETR 将目标检测任务看作集合预测问题, 对于一张图片,固定预测一定数量的物体(原作是100个,在代码中可更改),模型根据这些物体对象与图片中全局上下文的关系直接并行输出预测集, 也就是 Transformer 一次性解码出图片中所有物体的预测结果,这种并行特性使得 DETR 非常高效。
特点:
- 端到端:去除了NMS和Anchor,没有那么多的超参数,计算量的大大减少,整个网络变得很简单;
- 基于Transformer:首次将Transformer引入到目标检测任务当中;
- 提出一种新的基于集合的损失函数:通过二分图匹配的方法强制模型输出一组独一无二的预测框,每个物体只会产生一个预测框,这样就讲目标检测问题之间转换为集合预测问题,所以才不用NMS,达到端到端的效果;
- 而且在decoder输入一组科学的object query和encoder输出的全局上线特征,直接以并行方式强制最终输出的100个预测框,替代了anchor;
缺点: - 对大物体的检测效果很好,但是对小物体的监测效果不是很好;训练起来比较慢;
- 由于query的设计以及初始化等问题,DETR模型从零开始训练需要超长的训练时间;
优点: - 在COCO数据集上速度和精度和Faster RCNN差不多;可以扩展到很多细分的任务中,比如分割、追踪、多模态等;
流程:
- 输入图像进过CNN网络后得到图像的特征矩阵;
- 将图像拉直并添加位置编码;
- 输入Transformer encoder中学习特征的相关性信息;
- 将encoder输出以及object query作为decoder的输入得到解码后的信息;
- 将解码后的信息传入FFN得到预测信息;
- 判断FFN预测信息是否包含真实的目标对象;
- 如果有,则输出预测框和类别,否则输出一个no object类。
原理
基于集合预测的损失函数
二分图匹配确定有效预测框
预测得到N(100)个预测框,gt为M个框,通常N>M,那么怎么计算损失呢?
这里呢,就先对这100个预测框和gt框进行一个二分图的匹配,先确定每个gt对应的是哪个预测框,最终再计算M个预测框和M个gt框的总损失。
其实很简单,假设现在有一个矩阵,横坐标就是我们预测的100个预测框,纵坐标就是gt框,分别计算每个预测框和其他所有gt框的cost,这样就构成了一个cost matrix,再确定把如何把所有gt框分配给对应的预测框,才能使得最终的总cost最小。
这里计算的方法就是很经典的匈牙利算法,通常是调用scipy包中的linear_sum_assignment函数来完成。这个函数的输入就是cost matrix,输出一组行索引和一个对应的列索引,给出最佳分配。
所以通过以上的步骤,就确定了最终100个预测框中哪些预测框会作为有效预测框,哪些预测框会称为背景。再将有效预测框和gt框计算最终损失(有效预测框个数等于gt框个数)。
损失函数
损失函数:分类损失+回归损失
L Hungarian ( y , y ^ ) = ∑ i = 1 N [ − log p ^ σ ^ ( i ) ( c i ) + 1 { c i ≠ ∅ } L box ( b i , b ^ σ ^ ( i ) ) ] \mathcal{L}_{\text {Hungarian }}(y, \hat{y})=\sum_{i=1}^{N}\left[-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\text {box }}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)\right] LHungarian (y,y^)=i=1∑N[−logp^σ^(i)(ci)+1{
ci=∅}Lbox (bi,b^σ^(i))]
分类损失:交叉熵损失,去掉log
回归损失:GIOU Loss + L1 Loss
源码
整体结构
-
Backbone:
- 作用:提取图像特征,并将图像尺度缩小
- 结构:Resnet
- 输入:一个batch的image [B, 3, W, H]
- 输出:Resnet50最后一层的feature map [B, 2048, W_feat, H_feat]
-
过渡层:
- 作用:将backbone的输出转换为encoder所需要的shape,同时准备Transformer所需的其他变量,如masks、positional_embedding等
-
Transformer:
- 结构:encoder、decoder
- 总输入:
- x:输入的feature map [B, 256, W_feat, H_feat]
- mask:encoder和decoder中的有效padding mask [B, W_feat, H_feat]
- query_embed:decoder中的query的嵌入向量 [num_query(100), 256]
- positional_embedding:iamge feature的位置编码嵌入向量,encoder中的query_pos,decoder中的key_pos [B, 256, W_feat, H_feat]
- 总输出:
- out_dec:decoder的输出**(目标query输出)****[num_dec_layers(6), B, num_query(100), 256]**,虽然我们将decoder所有层的输出都保留了下来,但是我们想要的最终的输出只有decoder layer的最后一层输出
- memory:encoder的输出 [B, 256, W_feat, H_feat]
-
Transformer中的Encoder:
- 作用:对feature map进行位置编码和self attention,通过自注意力聚合feature map全局的特征
- 结构:self_attention、normalization、feed forward network、normalization
- 输入:
- query:输入query,是展平后的feature map [W_feat*H_feat, B, 256]
- value、key:在self attention中,k和v是由q算出,因此输入为None
- query_key_padding_mask:feature map的有效padding mask [B, W_feat*H_feat]
- query_positional_embedding:feature map的位置编码 [W_feat*H_feat, B, 256]
- 输出:
- encoder_out:输出经过自注意力的展平feature map [W_feat*H_feat, B, 256]
-
Transformer中的Decoder:
- 作用:
- 第一部分是对object query进行self attention,使得每个object query关注不同的obejct信息
- 第二部分是将object query与encoder_out进行cross attention,让object query在feature map上找到不同的obejct
- 结构:
- self attention(self_attention、normalization)
- cross attention(cross_attention、normalization、feed forward network、normalization)
- self attention:
- 输入:
- query:初始化object query [num_layer(6), num_query(100), B, 256]
- key、value:在self attention中,k和v是由q计算得到,None
- query_positional_embedding: query的位置编码 [num_layer(6), num_query(100), B, 256]
- mask:在decoder的self attention中,没有mask的输入 None
- 输出:
- 经过自注意力的object query [num_layer(6), num_query(100), B, 256]
- 输入:
- cross attention:
- 输入:
- query:经过自注意力得到的object query,即self attention模块的输出 [num_layer(6), num_query(100), B, 256]
- key、value:cross attention中,k和v都是encoder的输出 [N, B, 256]
- query_positional_embedding: query的位置编码 [num_layer(6), num_query(100), B, 256]
- key_positional_embedding:key的位置编码,即feature map的位置编码 [W_feat*H_feat**, **B, 256]
- key_padding_mask: key,即feature map的有效padding mask [B, W_feat*H_feat]
- 输出:
- out_dec:decoder的输出 [num_layers(6), B, num_query(100), 256]
- memory:encoder的输出 [W_feat*H_feat, B, 256]
- 输入:
- 作用:
模型搭建过程分为:
- 搭建DETR:Backbone + Transformer + MLP(Multilayer Perceptron,多层神经网络)
- 初始化损失函数:criterion + 初始化后处理:postprocessors
def build(args):
# the `num_classes` naming here is somewhat misleading.
# it indeed corresponds to `max_obj_id + 1`, where max_obj_id
# is the maximum id for a class in your dataset. For example,
# COCO has a max_obj_id of 90, so we pass `num_classes` to be 91.
# As another example, for a dataset that has a single class with id 1,
# you should pass `num_classes` to be 2 (max_obj_id + 1).
# For more details on this, check the following discussion
# https://github.com/facebookresearch/detr/issues/108#issuecomment-650269223
# 最大类别ID+1
num_classes = 20 if args.dataset_file != 'coco' else 91
if args.dataset_file == "coco_panoptic":
# for panoptic, we just add a num_classes that is large enough to hold
# max_obj_id + 1, but the exact value doesn't really matter
num_classes = 250
device = torch.device(args.device)
# 搭建backbone resnet + PositionEmbeddingSine
backbone = build_backbone(args)
# 搭建transformer
transformer = build_transformer(args)
# 搭建整个DETR模型
model = DETR(
backbone,
transformer,
num_classes=num_classes,
num_queries=args.num_queries,
aux_loss=args.aux_loss,
)
# 是否需要额外的分割任务
if args.masks:
model = DETRsegm(model, freeze_detr=(args.frozen_weights is not None))
# HungarianMatcher() 二分图匹配
matcher = build_matcher(args)
# 损失权重
weight_dict = {
'loss_ce': 1, 'loss_bbox': args.bbox_loss_coef}
weight_dict['loss_giou'] = args.giou_loss_coef
if args.masks: # 分割任务 False
weight_dict["loss_mask"] = args.mask_loss_coef
weight_dict["loss_dice"] = args.dice_loss_coef
# TODO this is a hack
if args.aux_loss: # 辅助损失 每个decoder都参与计算损失 True
aux_weight_dict = {
}
for i in range(args.dec_layers - 1):
aux_weight_dict.update({
k + f'_{
i}': v for k, v in weight_dict.items()})
weight_dict.update(aux_weight_dict)
losses = ['labels', 'boxes', 'cardinality']
if args.masks:
losses += ["masks"]
# 定义损失函数
criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,
eos_coef=args.eos_coef, losses=losses)
criterion.to(device)
# 定义后处理
postprocessors = {
'bbox': PostProcess()}
# 分割
if args.masks:
postprocessors['segm'] = PostProcessSegm()
if args.dataset_file == "coco_panoptic":
is_thing_map = {
i: i <= 90 for i in range(201)}
postprocessors["panoptic"] = PostProcessPanoptic(is_thing_map, threshold=0.85)
return model, criterion, postprocessors
搭建模型
class DETR(nn.Module):
""" This is the DETR module that performs object detection """
def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
""" Initializes the model.
Parameters:
backbone: torch module of the backbone to be used. See backbone.py
transformer: torch module of the transformer architecture. See transformer.py
num_classes: number of object classes
num_queries: number of object queries, ie detection slot. This is the maximal number of objects
DETR can detect in a single image. For COCO, we recommend 100 queries.
aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
"""
super().__init__()
self.num_queries = num_queries
self.transformer = transformer
hidden_dim = transformer.d_model
# 分类
self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
# 回归
self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
# self.query_embed 类似于传统目标检测里面的anchor 这里设置了100个 [100,256]
# nn.Embedding 等价于 nn.Parameter
self.query_embed = nn.Embedding(num_queries, hidden_dim)
self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)
self.backbone = backbone
self.aux_loss = aux_loss # True
def forward(self, samples: NestedTensor):
""" The forward expects a NestedTensor, which consists of:
- samples.tensor: batched images, of shape [batch_size x 3 x H x W]
- samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels
It returns a dict with the following elements:
- "pred_logits": the classification logits (including no-object) for all queries.
Shape= [batch_size x num_queries x (num_classes + 1)]
- "pred_boxes": The normalized boxes coordinates for all queries, represented as
(center_x, center_y, height, width). These values are normalized in [0, 1],
relative to the size of each individual image (disregarding possible padding).
See PostProcess for information on how to retrieve the unnormalized bounding box.
- "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
dictionnaries containing the two above keys for each decoder layer.
"""
if isinstance(samples, (list, torch.Tensor)):
samples = nested_tensor_from_tensor_list(samples)
# out: list{0: tensor=[bs,2048,19,26] + mask=[bs,19,26]} 经过backbone resnet50 block5输出的结果
# pos: list{0: [bs,256,19,26]} 位置编码
features, pos = self.backbone(samples)
# src: Tensor [bs,2048,19,26]
# mask: Tensor [bs,19,26]
src, mask = features[-1].decompose()
assert mask is not None
# 数据输入transformer进行前向传播
# self.input_proj(src) [bs,2048,19,26]->[bs,256,19,26]
# mask: False的区域是不需要进行注意力计算的
# self.query_embed.weight 类似于传统目标检测里面的anchor 这里设置了100个
# pos[-1] 位置编码 [bs, 256, 19, 26]
# hs: [6, bs, 100, 256]
hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
# 分类 [6个decoder, bs, 100, 256] -> [6, bs, 100, 92(类别)]
outputs_class = self.class_embed(hs)
# 回归 [6个decoder, bs, 100, 256] -> [6, bs, 100, 4]
outputs_coord = self.bbox_embed(hs).sigmoid()
out = {
'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
if self.aux_loss: # True
out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
# dict: 3
# 0 pred_logits 分类头输出[bs, 100, 92(类别数)]
# 1 pred_boxes 回归头输出[bs, 100, 4]
# 3 aux_outputs list: 5 前5个decoder层输出 5个pred_logits[bs, 100, 92(类别数)] 和 5个pred_boxes[bs, 100, 4]
return out
@torch.jit.unused
def _set_aux_loss(self, outputs_class, outputs_coord):
# this is a workaround to make torchscript happy, as torchscript
# doesn't support dictionary with non-homogeneous values, such
# as a dict having both a Tensor and a list.
return [{
'pred_logits': a, 'pred_boxes': b}
for a, b in zip(outputs_class[