目标检测-RT-DETR的Decoder部分

RT-DETR论文:2304.08069.pdf (arxiv.org)

RT-DETR代码:lyuwenyu/RT-DETR: [CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥 (github.com)

 RT-DETR整体网络结构图

Decoder介绍

RT-DETR(Real-Time Detection Transformer)是一种高效的目标检测模型,它结合了Transformer架构的优势,特别是通过其Decoder部分实现了对目标检测任务的端到端解决方案。RT-DETR的Decoder部分是模型的核心之一,它负责将编码器的输出转换为最终的检测结果,包括边界框坐标和类别预测。

Decoder部分的主要特点:

  1. 基于Transformer的Decoder:RT-DETR的Decoder采用了标准的Transformer Decoder结构,它通过自注意力机制(Self-Attention)和跨注意力机制(Cross-Attention)处理编码器的输出特征。

  2. Object Query:在Transformer Decoder中,每个目标由一个对象查询(Object Query)表示,这些查询向量通过与编码器输出的特征图进行交互,生成目标的类别和边界框预测。

  3. Cross-Attention:Decoder中的Cross-Attention机制允许对象查询与编码器的特征图进行交互,这有助于模型学习到目标的空间位置信息。

  4. Self-Attention:Self-Attention机制使Decoder中的每个查询都能够考虑到其他查询的信息,这有助于模型处理重叠或相互关联的目标。

  5. 分层结构:RT-DETR的Decoder可能包含多个层次,每一层都对前一层的输出进行进一步的细化,以提高检测的准确性。

  6. IoU感知的查询选择:RT-DETR引入了一种IoU(交并比)感知的查询选择机制,这有助于优化解码器查询的初始化,从而提高检测性能。

  7. 灵活性:RT-DETR支持通过使用不同数量的解码器层来灵活调整模型的推理速度,而不需要重新训练模型。

  8. 去噪思想:RT-DETR的Decoder部分采用了DINO(Distilled Knowledge from Transformers)的思想,使用“去噪学习”来提升双边匹配的样本质量,加快训练的收敛速度。

  9. IoU软标签:在RT-DETR的Decoder中,分类标签被替换为IoU软标签,这有助于模型更精确地预测边界框。

  10. 端到端学习:Decoder的设计允许RT-DETR直接从图像像素到边界框和类别预测的端到端学习,无需额外的后处理步骤,如非极大值抑制(NMS)。

实现细节:

RT-DETR的Decoder通常由以下步骤组成:

  • 特征嵌入:将编码器的输出转换为适合Decoder处理的嵌入表示。
  • 位置编码:为对象查询添加位置编码,以提供序列中的位置信息。
  • 多头自注意力:Decoder内部使用多头自注意力机制来处理对象查询。
  • 交叉注意力:Decoder通过交叉注意力机制与编码器的输出进行交互。
  • 输出层:最终,Decoder的输出通过一个线性层或类似的结构转换为类别预测和边界框坐标。

RT-DETR的Decoder部分的设计是模型实现实时目标检测的关键,它通过优化特征处理和查询选择机制,显著提高了模型的处理速度和检测精度。这些特点使得RT-DETR在多个标准数据集上取得了优异的性能,成为目标检测领域的一个重要进展。

实现代码

class RTDETRDecoder(nn.Module):
    """
    Real-Time Deformable Transformer Decoder (RTDETRDecoder) module for object detection.

    This decoder module utilizes Transformer architecture along with deformable convolutions to predict bounding boxes
    and class labels for objects in an image. It integrates features from multiple layers and runs through a series of
    Transformer decoder layers to output the final predictions.
    """
    export = False  # export mode

    def __init__(
            self,
            nc=80,
            ch=(512, 1024, 2048),
            hd=256,  # hidden dim
            nq=300,  # num queries
            ndp=4,  # num decoder points
            nh=8,  # num head
            ndl=6,  # num decoder layers
            d_ffn=1024,  # dim of feedforward
            eval_idx=-1,
            dropout=0.,
            act=nn.ReLU(),
            # Training args
            nd=100,  # num denoising
            label_noise_ratio=0.5,
            box_noise_scale=1.0,
            learnt_init_query=False):
        """
        Initializes the RTDETRDecoder module with the given parameters.

        Args:
            nc (int): Number of classes. Default is 80.
            ch (tuple): Channels in the backbone feature maps. Default is (512, 1024, 2048).
            hd (int): Dimension of hidden layers. Default is 256.
            nq (int): Number of query points. Default is 300.
            ndp (int): Number of decoder points. Default is 4.
            nh (int): Number of heads in multi-head attention. Default is 8.
            ndl (int): Number of decoder layers. Default is 6.
            d_ffn (int): Dimension of the feed-forward networks. Default is 1024.
            dropout (float): Dropout rate. Default is 0.
            act (nn.Module): Activation function. Default is nn.ReLU.
            eval_idx (int): Evaluation index. Default is -1.
            nd (int): Number of denoising. Default is 100.
            label_noise_ratio (float): Label noise ratio. Default is 0.5.
            box_noise_scale (float): Box noise scale. Default is 1.0.
            learnt_init_query (bool): Whether to learn initial query embeddings. Default is False.
        """
        super().__init__()
        self.hidden_dim = hd
        self.nhead = nh
        self.nl = len(ch)  # num level
        self.nc = nc
        self.num_queries = nq
        self.num_decoder_layers = ndl

        # Backbone feature projection
        self.input_proj = nn.ModuleList(nn.Sequential(nn.Conv2d(x, hd, 1, bias=False), nn.BatchNorm2d(hd)) for x in ch)
        # NOTE: simplified version but it's not consistent with .pt weights.
        # self.input_proj = nn.ModuleList(Conv(x, hd, act=False) for x in ch)

        # Transformer module
        decoder_layer = DeformableTransformerDecoderLayer(hd, nh, d_ffn, dropout, act, self.nl, ndp)
        self.decoder = DeformableTransformerDecoder(hd, decoder_layer, ndl, eval_idx)

        # Denoising part
        self.denoising_class_embed = nn.Embedding(nc + 1, hd)
        self.num_denoising = nd
        self.label_noise_ratio = label_noise_ratio
        self.box_noise_scale = box_noise_scale

        # Decoder embedding
        self.learnt_init_query = learnt_init_query
        if learnt_init_query:
            self.tgt_embed = nn.Embedding(nq, hd)
        self.query_pos_head = MLP(4, 2 * hd, hd, num_layers=2)

        # Encoder head
        self.enc_output = nn.Sequential(nn.Linear(hd, hd), nn.LayerNorm(hd))
        self.enc_score_head = nn.Linear(hd, nc)
        self.enc_bbox_head = MLP(hd, hd, 4, num_layers=3)

        # Decoder head
        self.dec_score_head = nn.ModuleList([nn.Linear(hd, nc) for _ in range(ndl)])
        self.dec_bbox_head = nn.ModuleList([MLP(hd, hd, 4, num_layers=3) for _ in range(ndl)])

        self._reset_parameters()

    def forward(self, x, batch=None):
        """Runs the forward pass of the module, returning bounding box and classification scores for the input."""
        from ultralytics.models.utils.ops import get_cdn_group

        # Input projection and embedding
        feats, shapes = self._get_encoder_input(x)

        # Prepare denoising training
        dn_embed, dn_bbox, attn_mask, dn_meta = \
            get_cdn_group(batch,
                          self.nc,
                          self.num_queries,
                          self.denoising_class_embed.weight,
                          self.num_denoising,
                          self.label_noise_ratio,
                          self.box_noise_scale,
                          self.training)

        embed, refer_bbox, enc_bboxes, enc_scores = \
            self._get_decoder_input(feats, shapes, dn_embed, dn_bbox)

        # Decoder
        dec_bboxes, dec_scores = self.decoder(embed,
                                              refer_bbox,
                                              feats,
                                              shapes,
                                              self.dec_bbox_head,
                                              self.dec_score_head,
                                              self.query_pos_head,
                                              attn_mask=attn_mask)
        x = dec_bboxes, dec_scores, enc_bboxes, enc_scores, dn_meta
        if self.training:
            return x
        # (bs, 300, 4+nc)
        y = torch.cat((dec_bboxes.squeeze(0), dec_scores.squeeze(0).sigmoid()), -1)
        return y if self.export else (y, x)

    def _generate_anchors(self, shapes, grid_size=0.05, dtype=torch.float32, device='cpu', eps=1e-2):
        """Generates anchor bounding boxes for given shapes with specific grid size and validates them."""
        anchors = []
        for i, (h, w) in enumerate(shapes):
            sy = torch.arange(end=h, dtype=dtype, device=device)
            sx = torch.arange(end=w, dtype=dtype, device=device)
            grid_y, grid_x = torch.meshgrid(sy, sx, indexing='ij') if TORCH_1_10 else torch.meshgrid(sy, sx)
            grid_xy = torch.stack([grid_x, grid_y], -1)  # (h, w, 2)

            valid_WH = torch.tensor([h, w], dtype=dtype, device=device)
            grid_xy = (grid_xy.unsqueeze(0) + 0.5) / valid_WH  # (1, h, w, 2)
            wh = torch.ones_like(grid_xy, dtype=dtype, device=device) * grid_size * (2.0 ** i)
            anchors.append(torch.cat([grid_xy, wh], -1).view(-1, h * w, 4))  # (1, h*w, 4)

        anchors = torch.cat(anchors, 1)  # (1, h*w*nl, 4)
        valid_mask = ((anchors > eps) * (anchors < 1 - eps)).all(-1, keepdim=True)  # 1, h*w*nl, 1
        anchors = torch.log(anchors / (1 - anchors))
        anchors = anchors.masked_fill(~valid_mask, float('inf'))
        return anchors, valid_mask

    def _get_encoder_input(self, x):
        """Processes and returns encoder inputs by getting projection features from input and concatenating them."""
        # Get projection features
        x = [self.input_proj[i](feat) for i, feat in enumerate(x)]
        # Get encoder inputs
        feats = []
        shapes = []
        for feat in x:
            h, w = feat.shape[2:]
            # [b, c, h, w] -> [b, h*w, c]
            feats.append(feat.flatten(2).permute(0, 2, 1))
            # [nl, 2]
            shapes.append([h, w])

        # [b, h*w, c]
        feats = torch.cat(feats, 1)
        return feats, shapes

    def _get_decoder_input(self, feats, shapes, dn_embed=None, dn_bbox=None):
        """Generates and prepares the input required for the decoder from the provided features and shapes."""
        bs = len(feats)
        # Prepare input for decoder
        anchors, valid_mask = self._generate_anchors(shapes, dtype=feats.dtype, device=feats.device)
        features = self.enc_output(valid_mask * feats)  # bs, h*w, 256

        enc_outputs_scores = self.enc_score_head(features)  # (bs, h*w, nc)

        # Query selection
        # (bs, num_queries)
        topk_ind = torch.topk(enc_outputs_scores.max(-1).values, self.num_queries, dim=1).indices.view(-1)
        # (bs, num_queries)
        batch_ind = torch.arange(end=bs, dtype=topk_ind.dtype).unsqueeze(-1).repeat(1, self.num_queries).view(-1)

        # (bs, num_queries, 256)
        top_k_features = features[batch_ind, topk_ind].view(bs, self.num_queries, -1)
        # (bs, num_queries, 4)
        top_k_anchors = anchors[:, topk_ind].view(bs, self.num_queries, -1)

        # Dynamic anchors + static content
        refer_bbox = self.enc_bbox_head(top_k_features) + top_k_anchors

        enc_bboxes = refer_bbox.sigmoid()
        if dn_bbox is not None:
            refer_bbox = torch.cat([dn_bbox, refer_bbox], 1)
        enc_scores = enc_outputs_scores[batch_ind, topk_ind].view(bs, self.num_queries, -1)

        embeddings = self.tgt_embed.weight.unsqueeze(0).repeat(bs, 1, 1) if self.learnt_init_query else top_k_features
        if self.training:
            refer_bbox = refer_bbox.detach()
            if not self.learnt_init_query:
                embeddings = embeddings.detach()
        if dn_embed is not None:
            embeddings = torch.cat([dn_embed, embeddings], 1)

        return embeddings, refer_bbox, enc_bboxes, enc_scores

    # TODO
    def _reset_parameters(self):
        """Initializes or resets the parameters of the model's various components with predefined weights and biases."""
        # Class and bbox head init
        bias_cls = bias_init_with_prob(0.01) / 80 * self.nc
        # NOTE: the weight initialization in `linear_init_` would cause NaN when training with custom datasets.
        # linear_init_(self.enc_score_head)
        constant_(self.enc_score_head.bias, bias_cls)
        constant_(self.enc_bbox_head.layers[-1].weight, 0.)
        constant_(self.enc_bbox_head.layers[-1].bias, 0.)
        for cls_, reg_ in zip(self.dec_score_head, self.dec_bbox_head):
            # linear_init_(cls_)
            constant_(cls_.bias, bias_cls)
            constant_(reg_.layers[-1].weight, 0.)
            constant_(reg_.layers[-1].bias, 0.)

        linear_init_(self.enc_output[0])
        xavier_uniform_(self.enc_output[0].weight)
        if self.learnt_init_query:
            xavier_uniform_(self.tgt_embed.weight)
        xavier_uniform_(self.query_pos_head.layers[0].weight)
        xavier_uniform_(self.query_pos_head.layers[1].weight)
        for layer in self.input_proj:
            xavier_uniform_(layer[0].weight)
  • 13
    点赞
  • 37
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
根据提供的引用内容,Grounding DINO是一种结合了DINO和基于Transformer的检测器的模型,用于开放式目标检测。它的输入是图像和文本,输出是多个[物体框,名词短语]对。具体来说,Grounding DINO使用DINO模型对图像和文本进行编码,然后使用基于Transformer的检测器对编码后的特征进行检测,最终输出[物体框,名词短语]对。 下面是一个简单的示例代码,演示如何使用Grounding DINO进行开放式目标检测: ```python import torch from torchvision.models.detection import fasterrcnn_resnet50_fpn from transformers import ViTFeatureExtractor, ViTForImageClassification from transformers.models.dino.modeling_dino import DINOHead # 加载预训练的DINO模型和ViT模型 dino = ViTForImageClassification.from_pretrained('facebook/dino-vit-base') dino_head = DINOHead(dino.config) dino_head.load_state_dict(torch.load('dino_head.pth')) dino.eval() vit_feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224') # 加载预训练的Faster R-CNN检测器 model = fasterrcnn_resnet50_fpn(pretrained=True) model.eval() # 输入图像和文本 image = Image.open('example.jpg') text = 'a person riding a bike' # 对图像和文本进行编码 image_features = vit_feature_extractor(images=image, return_tensors='pt')['pixel_values'] text_features = dino_head.get_text_features(text) image_embedding, text_embedding = dino(image_features, text_features) # 使用Faster R-CNN检测器进行目标检测 outputs = model(image_embedding) boxes = outputs[0]['boxes'] labels = outputs[0]['labels'] # 输出[物体框,名词短语]对 for i in range(len(boxes)): print([boxes[i], labels[i]]) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

张飞飞飞飞飞

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值