【mmdetecion】DETR、DeformableDETR和DINO

1. 概述

类间继承关系和方法增量简述

super
super
super
super
DINO
DeformableDETR
DETR
DetectionTransformer
BaseDetector

2. DetectionTransformer

2.1 整体结构

三个部分清晰分明:
(1) 通过self.extract_feat调用backbone和neck模块提取特征;
(2) 通过self.forward_transformer搭建Transformer模块的调用逻辑,但每个部分没有实现,子类Detector主要实现这些内容;
(3) 通过self.bbox_head.predict/loss调用head的推理和loss计算模块,对应的head类主要实现这些内容。

模块结构图

Backbone+neck
Transformer: encoder+decoder
bbox_head: loss+predict

模块代码图 (建议对照代码看图)

img_feats
head_inputs_dict,
batch_data_samples
head_inputs_dict,
batch_data_samples
batch_inputs
self.extract_feat
self.forward_transformer
batch_data_samples
self.bbox_head.predict
scores,
labels,
bboxes
self.bbox_head.loss
losses
batch_inputs:images, shape=(B, C, H, W)
batch_data_samples: It usually includes information such as `gt_instance` or `gt_panoptic_seg` or `gt_sem_seg`.

2.2 self.forward_transformer代码图

每个DetectionTransformr的子类主要就是实现self.forward_transformer中调用的四个函数:

  • pre_transformer
  • forward_dencoder
  • pre_decoder
  • forward_decoder
def forward_transformer(self,
                        img_feats: Tuple[Tensor],
                        batch_data_samples: OptSampleList = None) -> Dict:
    encoder_inputs_dict, decoder_inputs_dict = self.pre_transformer(
        img_feats, batch_data_samples)

    encoder_outputs_dict = self.forward_encoder(**encoder_inputs_dict)

    tmp_dec_in, head_inputs_dict = self.pre_decoder(**encoder_outputs_dict)
    decoder_inputs_dict.update(tmp_dec_in)

    decoder_outputs_dict = self.forward_decoder(**decoder_inputs_dict)
    head_inputs_dict.update(decoder_outputs_dict)
    return head_inputs_dict
img_feats
batch_data
_samples
encoder_
inputs_dict
encoder_
outputs_dict
tmp_
dec_in
decoder_inputs_dict
head_inputs_dict
decoder_
outputs_dict
Input
pre_
transformer
forward_
encoder
pre_
decoder
forward_
decoder
Output

后面结合DETR的具体实现我们可以更好的理解这个内部的各个变量的含义。

3. DETR (了解大致的结构)

DETR

3.1 DETR Detector

正如前文所说,DetectionTransformer的子类Detector就是实现self.forward_transformer中的四个调用函数。它主要是为encoder和decoder提供输入(特征,位置编码PE和掩码mask)的准备。

  • encoder核心是SelfAttention(x, PE, masks),q k v都来自于同一个变量,PE和masks分别是position_embedding和输入掩码。
  • decoder核心是CrossAttention(q, kv, q_PE, kv_PE, kv_masks),代码中称kv为memory。

在DETR中这四个函数的作用如下:

  • pre_transformer:为encoder提供x,PE,masks,同时这个PE和masks也是decoder中的kv_PE和kv_masks。
  • forward_decoder:直接地调用encoder,输出增强后的特征x’。
  • pre_decoder:为decoder提供query和query_pos,这里有个细节query_pos是learnable的Embedding,而query是一个全0tensor不可学习
  • forward_decoder:根据前面的x’作为k/v,kv_PE和kv_masks,query和query_pos作为输入,调用Decoder。
q
kv
kv
img_feats
pre_transformer
feat
pos_embed
,masks
forward_encoder
memory
pre_decoder
query_pos:
query
forward_decoder
hidden_states
def pre_decoder(self, memory: Tensor) -> Tuple[Dict, Dict]:
    batch_size = memory.size(0)  # (bs, num_feat_points, dim)
    query_pos = self.query_embedding.weight
    # (num_queries, dim) -> (bs, num_queries, dim)
    query_pos = query_pos.unsqueeze(0).repeat(batch_size, 1, 1)
    query = torch.zeros_like(query_pos)

    decoder_inputs_dict = dict(query_pos=query_pos, query=query, memory=memory)
    head_inputs_dict = dict()
    return decoder_inputs_dict, head_inputs_dict      

3.2 DETRHead

head主要就是实现三个内容

  • forward: 通过分类回归头拿到结果;
  • predict:只在推理时调用,预测以及相应的后处理。
  • loss: 只在训练时,调用label assign策略,计算损失函数;

forward

    def _init_layers(self) -> None:
        """Initialize layers of the transformer head."""
        # cls branch
        self.fc_cls = Linear(self.embed_dims, self.cls_out_channels)
        # reg branch
        self.activate = nn.ReLU()
        self.reg_ffn = FFN(self.embed_dims, self.embed_dims, self.num_reg_fcs,
            dict(type='ReLU', inplace=True), dropout=0.0, add_residual=False)
        # NOTE the activations of reg_branch here is the same as
        # those in transformer, but they are actually different
        # in DAB-DETR (prelu in transformer and relu in reg_branch)
        self.fc_reg = Linear(self.embed_dims, 4)

    def forward(self, hidden_states: Tensor) -> Tuple[Tensor]:
		# hidden_states: (num_decoder_layers, bs, num_queries, dim) If `return_intermediate_dec` in detr.py is True else (1, bs, num_queries, dim)
		
        # Note cls_out_channels should include background.
        # (num_decoder_layers, bs, num_queries, cls_out_channels)
        layers_cls_scores = self.fc_cls(hidden_states). 
        # normalized coordinate format (cx, cy, w, h), has shape
        # (num_decoder_layers, bs, num_queries, 4)
        layers_bbox_preds = self.fc_reg(
            self.activate(self.reg_ffn(hidden_states))).sigmoid()
        return layers_cls_scores, layers_bbox_preds

predict

predict调用forward获得cls/reg预测结果后,调用predict_by_feat进行后处理。而predict_by_feat则通过调用_predict_by_feat_single对每张图进行后处理,因此核心代码在_predict_by_feat_single中,完成两件事:

  • 得分topk过滤:每个结果取最高得分类别,按得分取topk(max_per_img)
  • box格式转换与缩放:(cx,cy,w,h)变成(x1,y1,x2,y2)并乘以图片长宽,rescale回原图尺度
    def _predict_by_feat_single(self,
                                cls_score: Tensor,
                                bbox_pred: Tensor,
                                img_meta: dict,
                                rescale: bool = True) -> InstanceData:
        """Transform outputs from the last decoder layer into bbox predictions
        for each image.

        Args:
            cls_score (Tensor): Box score logits from the last decoder layer
                for each image. Shape [num_queries, cls_out_channels].
            bbox_pred (Tensor): Sigmoid outputs from the last decoder layer
                for each image, with coordinate format (cx, cy, w, h) and
                shape [num_queries, 4].
            img_meta (dict): Image meta info.
            rescale (bool): If True, return boxes in original image
                space. Default True.

        Returns:
            :obj:`InstanceData`: Detection results of each image
            after the post process.
            Each item usually contains following keys.

                - scores (Tensor): Classification scores, has a shape
                  (num_instance, )
                - labels (Tensor): Labels of bboxes, has a shape
                  (num_instances, ).
                - bboxes (Tensor): Has a shape (num_instances, 4),
                  the last dimension 4 arrange as (x1, y1, x2, y2).
        """
        assert len(cls_score) == len(bbox_pred)  # num_queries
        max_per_img = self.test_cfg.get('max_per_img', len(cls_score))
        img_shape = img_meta['img_shape']
        # exclude background
        if self.loss_cls.use_sigmoid:
            cls_score = cls_score.sigmoid()
            scores, indexes = cls_score.view(-1).topk(max_per_img)
            det_labels = indexes % self.num_classes
            bbox_index = indexes // self.num_classes
            bbox_pred = bbox_pred[bbox_index]
        else:
            scores, det_labels = F.softmax(cls_score, dim=-1)[..., :-1].max(-1)
            scores, bbox_index = scores.topk(max_per_img)
            bbox_pred = bbox_pred[bbox_index]
            det_labels = det_labels[bbox_index]

        det_bboxes = bbox_cxcywh_to_xyxy(bbox_pred)
        det_bboxes[:, 0::2] = det_bboxes[:, 0::2] * img_shape[1]
        det_bboxes[:, 1::2] = det_bboxes[:, 1::2] * img_shape[0]
        det_bboxes[:, 0::2].clamp_(min=0, max=img_shape[1])
        det_bboxes[:, 1::2].clamp_(min=0, max=img_shape[0])
        if rescale:
            assert img_meta.get('scale_factor') is not None
            det_bboxes /= det_bboxes.new_tensor(
                img_meta['scale_factor']).repeat((1, 2))

        results = InstanceData()
        results.bboxes = det_bboxes
        results.scores = scores
        results.labels = det_labels
        return results

loss

4. DeformableDETR

multi-scale feat (scale embedding)
two stage prediction

5. DINO

multi scale feat
feat
PE
query
proposal ...
Backbone+neck
feature extract
Encoders
feature enhance
cls/reg + query selection
proposal generation
Decoder
+cls/reg
Init
Decoder
+cls/reg
multi scale feat
PE, mask
multi scale feat
kv_PE,kv_mask=PE,mask
k=v=feat
feat
q_PE
query
proposal ...
Backbone+neck
feature extract
G
Encoders
feature enhance
Decoder
+cls/reg
cls/reg + query selection
proposal generation
Init
Decoder
+cls/reg

6. How to use (Inference Demo)

7. How to train (Train Script)

  • 10
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
引用中提到,Deformable DETR是对原始的DETR算法的改进。DETR算法的整体流程包括特征向量提取、位置编码信息添加、Transformer中的encoder部分、Transformer中的decoder部分、FFN部分和后续的匈牙利匹配损失计算。而Deformable DETR的改进主要体现在对小目标检测不友好的问题上。由于原始DETR只使用了最后一层特征图进行检测,对小目标的检测效果较差。因此,Deformable DETR通过增加Deformable Convolution模块来改善小目标的检测效果。 引用中介绍了mmdetectionDetr的整体逻辑。首先,图像经过提取特征向量的操作,然后通过DetrHead计算最终的损失。具体而言,输入图像经过提取特征向量的操作得到x,然后传入DetrHead计算损失。 而引用中提到,在得到图像特征向量x、masks矩阵以及位置编码pos_embed后,可以将它们送入Transformer模块中。其中,x的维度为,mask的维度为,query_embed的维度为,pos_embed的维度为。通过Transformer模块的运算,得到outs_dec,其维度为[nb_nb_dec, bs, num_query, embed_dim。 综上所述,mmdetection中的Deformable DETR算法主要是在DETR算法的基础上进行改进,通过增加Deformable Convolution模块来改善小目标的检测效果。在mmdetection中,Detr的整体逻辑是通过提取特征向量和使用DetrHead计算损失来进行训练。而在Deformable DETR中,通过将图像特征向量、masks矩阵和位置编码输入到Transformer模块中,得到最终的输出。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* *3* [Detr源码解读(mmdetection)](https://blog.csdn.net/qq_45990036/article/details/129208115)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值