[DETR] End-to-End Object Detection with Transformers (ECCV. 2020 oral）代码笔记

最新推荐文章于 2024-02-15 22:56:48 发布

Ah丶Weii

最新推荐文章于 2024-02-15 22:56:48 发布

阅读量1.1k

点赞数 2

分类专栏：笔记

本文链接：https://blog.csdn.net/weixin_43823854/article/details/110877088

版权

End-to-End Object Detection with Transformers

如果对你帮助的话，希望给我个赞~

论文:https://arxiv.org/pdf/2005.12872.pdf
代码:https://github.com/facebookresearch/detr

一、transformer

transformer是NLP 非常经典的一篇论文，抛弃了之前用的RNN循环卷积网络细节，是一个基本由attention layer组成的网络。
论文:https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
有2篇很棒的整理帮助非NLP简单入门下transformer~
link1:https://zhuanlan.zhihu.com/p/150635505
link2:http://jalammar.github.io/illustrated-transformer/

1. 网络结构

在这里插入图片描述

2. scale dot-product attention 以及 muli-head attention

在这里插入图片描述

3. Attention(Q,K,V)公式

Q，K，V经过scale dot-product atten 的计算公式,KV的维度大小是相等的：
在这里插入图片描述

4. MultiHead(Q,K,V)公式

在这里插入图片描述

5. Position Embedding

在这里插入图片描述

二、DETR

1. Motivation

This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection.
Previous attempts either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks.
This paper aims to bridge this gap.

2. 网络结构

关注其中具体的实现，可以参阅我第三部分相应的代码，里面有比较详细的注释。
注：图中的32, 24 是输入图片[H, W, 3]的图片的宽高缩小的32倍，即H/32, W/32。

在这里插入图片描述

还引入一个mask tensor[N, 24, 32]，用来精确的约束pos embedding。 which contains 1 on pixels that were added due to padding when batching images of different sizes, and 0 otherwise
N = 6 表示有6层这样的串联的Multi-Head Attention layer。每一层都参与计算aux loss（辅助损失），但是只使用最后一层出来的[100, N, 256]来计算bbox 和 label。
tgt and tgt2 都是object queries 生成的。 tgt2只是tgt的norm而已。

decoder 第二个q = tgt2 + query_pos
decoder 第二个k = memory + pos
decoder 第二个v = value

2. HungarianMatcher

在这里插入图片描述

① compute match cost

在这里插入图片描述
首先通过match cost 得到indices，也就是pred 和 target对应的最佳匹配。
这一部分的loss就需要计算3个，分别是l1 loss，-GIOU loss（不用1-GIOU）和一个-pro[target class]（不需要1-pro）。
另外这一部分的loss作为一个const，不参与梯度的计算，只是求cost以及indices。

② compute Hungarian loss

在这里插入图片描述

在第二步中，也是需要计算3个loss，分别是class的cross_entropy（这个和match的loss不一样。论文中也有提及。）以及bbox的l1_loss和GIOUloss，另外在计算bbox

③match and loss部分核心代码

match 在这里插入图片描述
loss

三、核心代码部分

1. detr/models/detr.py

代码：

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
DETR model and criterion classes.
"""
import torch
import torch.nn.functional as F
from torch import nn

from util import box_ops
from util.misc import (NestedTensor, nested_tensor_from_tensor_list,
                       accuracy, get_world_size, interpolate,
                       is_dist_avail_and_initialized)

from .backbone import build_backbone
from .matcher import build_matcher
from .segmentation import (DETRsegm, PostProcessPanoptic, PostProcessSegm,
                           dice_loss, sigmoid_focal_loss)
from .transformer import build_transformer
import pdb

class DETR(nn.Module):
    """ This is the DETR module that performs object detection """
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """

        super().__init__()
        self.num_queries = num_queries # 100
        self.transformer = transformer # transfromer
        hidden_dim = transformer.d_model # 256
        
        # class and Bounding Box
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # [256, 92] 
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3) # 3 layers  the last layer : [256, 4] 
        '''
        (Pdb)  self.bbox_embed
        MLP(
          (layers): ModuleList(
            (0): Linear(in_features=256, out_features=256, bias=True)
            (1): Linear(in_features=256, out_features=256, bias=True)
            (2): Linear(in_features=256, out_features=4, bias=True)
          )
        )
        '''
        # embedding
        self.query_embed = nn.Embedding(num_queries, hidden_dim) # [100, 256]

        # in paper: a 1x1 convolution reduces the channel dimension of the high-level activation map f from C to a smaller dimension d
        self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1) # 2048 --> 256
        self.backbone = backbone
        self.aux_loss = aux_loss # True
        pdb.set_trace()
    def forward(self, samples: NestedTensor):
        """ The forward expects a NestedTensor, which consists of:
               - samples.tensor: batched images, of shape [batch_size x 3 x H x W]
               - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels

            It returns a dict with the following elements:
               - "pred_logits": the classification logits (including no-object) for all queries.
                                Shape= [batch_size x num_queries x (num_classes + 1)] # [N, 100, 91]
               - "pred_boxes": The normalized boxes coordinates for all queries, represented as
                               (center_x, center_y, height, width). These values are normalized in [0, 1],
                               relative to the size of each individual image (disregarding possible padding). # [x, y, h, w]
                               See PostProcess for information on how to retrieve the unnormalized bounding box.
               - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
                                dictionnaries containing the two above keys for each decoder layer.
        """
        
        # 首先进入这里
        pdb.set_trace() 
        # type(samples) NestedTensor [2, 3, 768, 1024]
        if isinstance(samples, (list, torch.Tensor)):   # False
          # github issue https://github.com/facebookresearch/detr/issues/133
          # This is not a segmentation mask, this is a padding mask 
          # which contains 1 on pixels that were added due to padding when batching images of different sizes, and 0 otherwise.
            samples = nested_tensor_from_tensor_list(samples)
        
        # 1.forward backbone
        features, pos = self.backbone(samples)

        src, mask = features[-1].decompose() #  src [2, 2048, 24, 32] mask [2, 24, 32]
        pdb.set_trace()
        assert mask is not None

        # 2.forward transformer
        # self.input_proj(src) is  1x1 conv  channel 2048 to 256
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0] # [6,2,100,256]

        # 3.forward class and bbox
        outputs_class = self.class_embed(hs) # [6, 2, 100, 92]
        outputs_coord = self.bbox_embed(hs).sigmoid() # torch.Size([6, 2, 100, 4])
        out = {
   'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]} #5个 [2, 100, 92]  [2, 100, 4]

        # aux_loss
        if self.aux_loss: # True
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        pdb.set_trace()
        return out # dict

    @torch.jit.unused
    def _set_aux_loss(self, outputs_class, outputs_coord):
        # this is a workaround to make torchscript happy, as torchscript
        # doesn't support dictionary with non-homogeneous values, such
        # as a dict having both a Tensor and a list.
        return [{
   'pred_logits': a, 'pred_boxes': b}
                for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]


class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category 
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """
        super().__init__()
        self.num_classes = num_classes # 91
        self.matcher = matcher
        self.weight_dict = weight_dict
        self.eos_coef = eos_coef # 0.1
        self.losses = losses # ['labels', 'boxes', 'cardinality']
        empty_weight = torch.ones(self.num_classes + 1)# [92]
        empty_weight[-1] = self.eos_coef # empty_weight[91] = 0.1 , others = 1
        self.register_buffer('empty_weight', empty_weight)
        #pdb.set_trace()

    def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
        """Classification loss (NLL)
        targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
        """
        # indices [(tensor([ 2,  4, 31, 42, 83, 84, 89, 91]), tensor([7, 1, 6, 5, 3, 4, 2, 0])), (tensor([54]), tensor([0]))]
        # indices[i][0]是pred_idx, [i][1]是gt_idx, 通过idx来索引后续的gt_label
        # targets[1] 'labels': tensor([18,  1,  1, 15, 27, 44, 84, 27] targets[2] 'labels': tensor[0]
        assert 'pred_logits' in outputs
        src_logits = outputs['pred_logits'] # [2,100,92]

        # get batch_idx and src_idx  index idx[0]代表gt_box在第几个img, idx[1]代表对应的100个query_num的索引
        idx = self._get_src_permutation_idx(indices) 
        # idx (tensor([0, 0, 0, 0, 0, 0, 0, 0, 1]), tensor([ 2,  4, 31, 42, 83, 84, 89, 91, 54]))

        target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)]) 
        # target_classes_o : tensor([27,  1, 84, 44, 15, 27,  1, 18,  6], device='cuda:0')
        '''
        temp = []
        for t, (_, J) in zip(targets, indices): # t就是第几个img  eg. batch_size = 2 , (_, J)就是indices[i] 上的元组
          temp.append(t["labels"][J]) # 将targets['label']里面的gt_label按照indices重新排序
          pdb.set_trace()
        target_classes_o = torch.cat(temp)
        '''

        target_classes = torch.full(src_logits.shape[:2], self.num_classes,
                                    dtype=torch.int64, device=src_logits.device) # [2, 100] 值先预设为self.num_classes eg.91

        # target_classes[idx] 理解为 [0, 2], [0, 4],.....,[1, 54]的值用target_classes_o 来赋值
        target_classes[idx] = target_classes_o
        # This criterion combines log_softmax and nll_loss in a single function.
        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight) 
        losses = {
   'loss_ce': loss_ce}

        # 后续思考class_error
        if log:
            # TODO this should probably be a separate loss, not hacked in this one here
            losses['class_error'] = 100 - accuracy(src_logits[idx], target_classes_o)[0]
        pdb.set_trace()
        return losses

    @torch.no_grad()
    def loss_cardinality(self, outputs, targets, indices, num_boxes):
        """ Compute the cardinality error, ie the absolute error in the number of predicted non-empty boxes
        This is not really a loss, it is intended for logging purposes only. It doesn't propagate gradients
        """
        pred_logits = outputs['pred_logits'] # [2, 100, 92]
        device = pred_logits.device
        tgt_lengths = torch.as_tensor([len(v["labels"]) for v in targets], device=device) # torch.size():[2], value: tensor[8, 1]([8,1]不是维度,维度是[2]是值下同)

        # Count the number of predictions that are NOT "no-object" (which is the last class)
        card_pred = (pred_logits.argmax(-1) != pred_logits.shape[-1] - 1).sum(1) # torch.size(): [2], 值就是相等为True的sum eg.balue:[100, 100]
        # cardinality_error可以理解为 所有pred_box预测有类别(除了最后一个class)的sum - target中每一个img的gt_box先验。
        card_err = F.l1_loss(card_pred.float(), tgt_lengths.float())
        losses = {
   'cardinality_error': card_err}
        pdb.set_trace()
        return losses

    def loss_boxes(self, outputs, targets, indices, num_boxes):
        """Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
           targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
           The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
        """
        assert 'pred_boxes' in outputs
        idx = self._get_src_permutation_idx(indices)
        src_boxes = outputs['pred_boxes'][idx] # [9, 4]
        target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0) # [9, 4]

        # l1 loss
        loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')

        losses = {
   }
        losses['loss_bbox'] = loss_bbox.sum() / num_boxes

        # giou loss = 1 - GIOU
        loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
            box_ops.box_cxcywh_to_xyxy(src_boxes),
            box_ops.box_cxcywh_to_xyxy(target_boxes)))
        losses['loss_giou'] = loss_giou.sum() / num_boxes
        pdb.set_trace()

        return losses

    def loss_masks(self, outputs, targets, indices, num_boxes):
        """Compute the losses related to the masks: the focal loss and the dice loss.
           targets dicts must contain the key "masks" containing a tensor of dim [nb_target_boxes, h, w]
        """
        assert "pred_masks" in outputs

        src_idx = self._get_src_permutation_idx(indices)
        tgt_idx = self._get_tgt_permutation_idx(indices)
        src_masks = outputs["pred_masks"]
        src_masks = src_masks[src_idx]
        masks = [t["masks"] for t in targets]
        # TODO use valid to mask invalid areas due to padding in loss
        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()
        target_masks = target_masks.to(src_masks)
        target_masks = target_masks[tgt_idx]

        # upsample predictions to the target size
        src_masks = interpolate(src_masks[:, None], size=target_masks.shape[-2:],
                                mode="bilinear", align_corners=False)
        src_masks = src_masks[:, 0].flatten(1)

        target_masks = target_masks.flatten(1)
        target_masks = target_masks.view(src_masks.shape)
        losses = {
   
            "loss_mask": sigmoid_focal_loss(src_masks, target_masks, num_boxes),
            "loss_dice": dice_loss(src_masks, target_masks, num_boxes),
        }
        pdb.set_trace()

        return losses

    def _get_src_permutation_idx(self, indices):
        '''
        indices:
        eg.
        [(tensor([ 2,  4, 31, 42, 83, 84, 89, 91]), tensor([7, 1, 6, 5, 3, 4, 2, 0])), (tensor([54]), tensor([0]))]
        '''
        # permute predictions following indices
        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)]) # tensor([0, 0, 0, 0, 0, 0, 0, 0, 1]),因此第一张img有8个gt, 第二张img有1个gt

        src_idx = torch.cat([src for (src, _) in indices]) #  tensor([ 2,  4, 31, 42, 83, 84, 89, 91, 54])

        pdb.set_trace()

        return batch_idx, src_idx

    def _get_tgt_permutation_idx(self, indices):
        # permute targets following indices
        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])
        tgt_idx = torch.cat([tgt for (_, tgt) in indices])
        pdb.set_trace()

        return batch_idx, tgt_idx

    def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
        loss_map = {
   
            'labels': self.loss_labels,
            'cardinality': self.loss_cardinality,
            'boxes': self.loss_boxes,
            'masks': self.loss_masks
        }
        assert loss in loss_map, f'do you really want to compute {loss} loss?'

        pdb.set_trace()
        return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)

    def forward(self, outputs, targets):
        """ This performs the loss computation.
        Parameters:
             outputs: dict of tensors, see the output specification of the model for the format 
             etc.
             outputs: dict_keys(['pred_logits', 'pred_boxes', 'aux_outputs'])
             'pred_logits' : torch.Size([2, 100, 92])
             'pred_boxes' : torch.Size([2, 100, 4])
             'aux_outputs' : len('aux_outputs') = 5 outputs['aux_outputs'][0].keys() = dict_keys(['pred_logits', 'pred_boxes']) 除了最后一层的其他5个decoderlayer的outputs.
             
             targets: list of dicts, such that len(targets) == batch_size.
                      The expected keys in each dict depends on the losses applied, see each loss' doc
        """ 
        outputs_without_aux = {
   k: v for k, v in outputs.items() if k != 'aux_outputs'}  # len = 2

        # Retrieve the matching between the outputs of the last layer and the targets 检索最后一层的输出与目标之间的匹配
        # forward matcher
        indices = self.matcher(outputs_without_aux, targets)
        pdb.set_trace()

        # Compute the average number of target boxes accross all nodes, for normalization purposes
        num_boxes = sum(len(t["labels"]) for t in targets) # 9
        num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)  # tensor[9.]
        if is_dist_avail_and_initialized(): # True
            torch.distributed.all_reduce(num_boxes) # None
        num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item() # 最小值》1   eg. 9.0
        
        pdb.set_trace()

        # Compute all the requested losses
        losses = {
   }
        for loss in self.losses: # ['labels', 'boxes', 'cardinality']
            losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))
        pdb.set_trace()

        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
        if 'aux_outputs' in outputs:
            for i, aux_outputs in enumerate(outputs['aux_outputs']):
                indices = self.matcher(aux_outputs, targets)
                for loss in self.losses:
                    if loss == 'masks':
                        # Intermediate masks losses are too costly to compute, we ignore them.
                        continue
                    kwargs = {
   }
                    if loss == 'labels':
                        # Logging is enabled only for the last layer
                        kwargs = {
   'log': False}
                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
                    l_dict = {
   k + f'_{i}': v for k, v in l_dict.items()}
                    losses.update(l_dict)
                    pdb.set_trace()
        pdb.set_trace()
        return losses


class PostProcess(nn.Module):
    """ This module converts the model's output into the format expected by the coco api"""
    @torch.no_grad()
    def forward(self, outputs, target_sizes):
        """ Perform the computation
        Parameters:
            outputs: raw outputs of the model
            target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
                          For evaluation, this must be the original image size (before any data augmentation)
                          For visualization, this should be the image size after data augment, but before padding
        """
        out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']

        assert len(out_logits) == len(target_sizes)
        assert target_sizes.shape[1] == 2

        prob = F.softmax(out_logits, -1)
        scores, labels = prob[..., :-1].max(-1)

        # convert to [x0, y0, x1, y1] format
        boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
        # and from relative [0, 1] to absolute [0, height] coordinates
        img_h, img_w = target_sizes.unbind(1)
        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
        boxes = boxes * scale_fct[:, None, :]

        results = [{
   'scores': s, 'labels': l, 'boxes': b} for s, l, b in zip(scores, labels, boxes)]
        pdb.set_trace()
        return results


class MLP(nn.Module): # Feed Forward Network
    """ Very simple multi-layer perceptron (also called FFN)"""

    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
        super().__init__()
        self.num_layers = num_layers # 3
        h = [hidden_dim] * (num_layers - 1) # [256, 256]
        # [input_dim] + h : [256, 256, 256] 
        # h + [output_dim] : [256, 256, 4]
        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
        '''
        
        self.layers:
            ModuleList(
              (0): Linear(in_features=256, out_features=256, bias=True)
              (1): Linear(in_features=256, out_features=256, bias=True)
              (2): Linear(in_features=256, out_features=4, bias=True)
            )
        '''

    def forward(self, x): # torch.Size([6, 2, 100, 4])
        for i, layer in enumerate(self.layers):
            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
        pdb.set_trace()
        return x


def build(args):
    # the `num_classes` naming here is somewhat misleading.
    # it indeed corresponds to `max_obj_id + 1`, where max_obj_id
    # is the maximum id for a class in your dataset. For example,
    # COCO has a max_obj_id of 90, so we pass `num_classes` to be 91.
    # As another example, for a dataset that has a single class with id 1,
    # you should pass `num_classes` to be 2 (max_obj_id + 1).
    # For more details on this, check the following discussion
    # https://github.com/facebookresearch/detr/issues/108#issuecomment-650269223
    num_classes = 20 if args.dataset_file != 'coco' else 91 # 91 不是81
    if args.dataset_file == "coco_panoptic":
        # for panoptic, we just add a num_classes that is large enough to hold
        # max_obj_id + 1, but the exact value doesn't really matter
        num_classes = 250
    device = torch.device(args.device)
    
    # 1. build backbone entry
    backbone = build_backbone(args) 

    # 2. build transformer entry
    transformer = build_transformer(args)
    # 3. build DETR entry   integrate backbone and transformer     
    model = DETR( 
        backbone,
        transformer,
        num_classes=num_classes,
        num_queries=args.num_queries,
        aux_loss=args.aux_loss,
    )
    if args.masks:
        model = DETRsegm(model, freeze_detr=(args.frozen_weights is not None))

    # 4. build matcher 
    matcher = build_matcher(args) 

    weight_dict = {
   'loss_ce': 1, 'loss_bbox': args.bbox_loss_coef} # {'loss_ce': 1, 'loss_bbox': 5}
    weight_dict['loss_giou'] = args.giou_loss_coef # 1
    if args.masks:
        weight_dict["loss_mask"] = args.mask_loss_coef
        weight_dict["loss_dice"] = args.dice_loss_coef
    # TODO this is a hack
    if args.aux_loss: # True
        aux_weight_dict = {
   }
        for i in range(args.dec_layers - 1):# 5
            aux_weight_dict.update({
   k + f'_{i}': v for k, v in weight_dict.items()})
        weight_dict.update(aux_weight_dict)

    losses = ['labels', 'boxes', 'cardinality']
    if args.masks: # F
        losses += ["masks"]

    # 5. build criterion
    criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,
                             eos_coef=args.eos_coef, losses=losses)
    criterion.to(device)

    postprocessors = {
   'bbox': PostProcess()}
    if args.masks:
        postprocessors['segm'] = PostProcessSegm()
        if args.dataset_file == "coco_panoptic":
            is_thing_map = {
   i: i <= 90 for i in range(201)}
            postprocessors["panoptic"] = PostProcessPanoptic(is_thing_map, threshold=0.85)
    #pdb.set_trace()
    return model, criterion, postprocessors


'''
{
  'loss_ce': 1, 'loss_bbox': 5, 'loss_giou': 2, 
  'loss_ce_0': 1, 'loss_bbox_0': 5, 'loss_giou_0': 2, 
  'loss_ce_1': 1, 'loss_bbox_1': 5, 'loss_giou_1': 2, 
  'loss_ce_2': 1, 'loss_bbox_2': 5, 'loss_giou_2': 2, 
  'loss_ce_3': 1, 'loss_bbox_3': 5, 'loss_giou_3': 2, 
  'loss_ce_4': 1, 'loss_bbox_4': 5, 'loss_giou_4': 2
}


Detr(
  (detr): DETR(
    (transformer): Transformer(
      (encoder): TransformerEncoder(
        (layers): ModuleList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (1): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (2): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )

最低0.47元/天解锁文章

Ah丶Weii

关注

2
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
[DETR] End-to-End Object Detection with Transformers (ECCV. 2020 oral）代码笔记

End-to-End Object Detection with Transformers文章目录End-to-End Object Detection with Transformers网络结构detr/models/detr.pydetr/models/backbone.py论文:https://arxiv.org/pdf/2005.12872.pdf代码:https://github.com/facebookresearch/detr网络结构detr/models/detr.py代码
复制链接

扫一扫