本文链接：https://blog.csdn.net/litt1e/article/details/132611196

在这里插入图片描述
训练流程
1.输入图像经过CNN的backbone获得32倍下采样的深度特征；
2.将图片给拉直形成token，并添加位置编码送入encoder中；
3.将encoder的输出以及Object Query作为decoder的输入得到解码特征；
4.将解码后的特征传入FFN得到预测特征；
5.根据预测特征计算cost matrix，并由匈牙利算法匹配GT，获得正负样本；
6.根据正负样本计算分类与回归loss。

前两章，我们分析了训练步骤的1，2，3。了解DETR是通过CNN的backbone获取的32倍下采样特征，并构建Object Query可学习编码，利用encoder-decoder结构输出解码特征。接下来我们需要理解DETR如何进行正负样本分配（label assignment），以及loss是如何根据正负样本构建的。文章代码很多，希望大家能够耐心看完。

代码分析

继续书接上回，我们将x输入self.transformer（outs_dec, _ = self.transformer）并获得输出outs_dec,outs_dec的维度为[6,2,100,256],其中第一维度6是因为保存了self.decoder中间层的输出，2是指batch-size。

self.fc_cls = Linear(in_features=256, out_features=8, bias=True)，将outs_dec送入分类头self.fc_cls获得all_cls_scores（维度为[6,2,100,92]）,coco数据集中存在91类别，加上背景，一共92个类别。self.reg_ffn是FFN模块，如下所示，存在两个全连接层与激活成以及dropout层。self.fc_reg = Linear(in_features=256, out_features=4, bias=True)，将outs_dec送入回归头self.fc_reg获得all_bbox_preds（维度为[6,2,100,4]），其中保存了4个坐标特征。

FFN(
(activate): ReLU(inplace=True)
(layers): Sequential(
(0): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.0, inplace=False)
)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Dropout(p=0.0, inplace=False)
)
(dropout_layer): Identity()
)

def forward_single(self, x, img_metas):
        """"Forward function for a single feature level.

        Args:
            x (Tensor): Input feature from backbone's single stage, shape
                [bs, c, h, w].
            img_metas (list[dict]): List of image information.

        Returns:
            all_cls_scores (Tensor): Outputs from the classification head,
                shape [nb_dec, bs, num_query, cls_out_channels]. Note
                cls_out_channels should includes background.
            all_bbox_preds (Tensor): Sigmoid outputs from the regression
                head with normalized coordinate format (cx, cy, w, h).
                Shape [nb_dec, bs, num_query, 4].
        """
        # construct binary masks which used for the transformer.
        # NOTE following the official DETR repo, non-zero values representing
        # ignored positions, while zero values means valid positions.
        batch_size = x.size(0)
        input_img_h, input_img_w = img_metas[0]['batch_input_shape']
        masks = x.new_ones((batch_size, input_img_h, input_img_w))
        for img_id in range(batch_size):
            img_h, img_w, _ = img_metas[img_id]['img_shape']
            masks[img_id, :img_h, :img_w] = 0

        x = self.input_proj(x)
        # interpolate masks to have the same spatial shape with x
        masks = F.interpolate(
            masks.unsqueeze(1), size=x.shape[-2:]).to(torch.bool).squeeze(1)
        # position encoding
        pos_embed = self.positional_encoding(masks)  # [bs, embed_dim, h, w]
        # outs_dec: [nb_dec, bs, num_query, embed_dim]
        outs_dec, _ = self.transformer(x, masks, self.query_embedding.weight,
                                       pos_embed)

        all_cls_scores = self.fc_cls(outs_dec)
        all_bbox_preds = self.fc_reg(self.activate(
            self.reg_ffn(outs_dec))).sigmoid()
        return all_cls_scores, all_bbox_preds

程序从forward_single到loss_single过程简单，不赘述了。loss_single传入的参数cls_scores的维度是[2,100,92], bbox_preds的维度为[2,100,4]，gt_bboxes_list存放了batchsize=2个图片的坐标框，gt_labels_list存放了图片类别。

def loss_single(self,
                    cls_scores,
                    bbox_preds,
                    gt_bboxes_list,
                    gt_labels_list,
                    img_metas,
                    gt_bboxes_ignore_list=None):
        """"Loss function for outputs from a single decoder layer of a single
        feature level.

        Args:
            cls_scores (Tensor): Box score logits from a single decoder layer
                for all images. Shape [bs, num_query, cls_out_channels].
            bbox_preds (Tensor): Sigmoid outputs from a single decoder layer
                for all images, with normalized coordinate (cx, cy, w, h) and
                shape [bs, num_query, 4].
            gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image
                with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
            gt_labels_list (list[Tensor]): Ground truth class indices for each
                image with shape (num_gts, ).
            img_metas (list[dict]): List of image meta information.
            gt_bboxes_ignore_list (list[Tensor], optional): Bounding
                boxes which can be ignored for each image. Default None.

        Returns:
            dict[str, Tensor]: A dictionary of loss components for outputs from
                a single decoder layer.
        """
        num_imgs = cls_scores.size(0)
        cls_scores_list = [cls_scores[i] for i in range(num_imgs)]
        bbox_preds_list = [bbox_preds[i] for i in range(num_imgs)]
        cls_reg_targets = self.get_targets(cls_scores_list, bbox_preds_list,
                                           gt_bboxes_list, gt_labels_list,
                                           img_metas, gt_bboxes_ignore_list)
        (labels_list, label_weights_list, bbox_targets_list, bbox_weights_list,
         num_total_pos, num_total_neg) = cls_reg_targets
        labels = torch.cat(labels_list, 0)
        label_weights = torch.cat(label_weights_list, 0)
        bbox_targets = torch.cat(bbox_targets_list, 0)
        bbox_weights = torch.cat(bbox_weights_list, 0)

        # classification loss
        cls_scores = cls_scores.reshape(-1, self.cls_out_channels)
        # construct weighted avg_factor to match with the official DETR repo
        cls_avg_factor = num_total_pos * 1.0 + \
            num_total_neg * self.bg_cls_weight
        if self.sync_cls_avg_factor:
            cls_avg_factor = reduce_mean(
                cls_scores.new_tensor([cls_avg_factor]))
        cls_avg_factor = max(cls_avg_factor, 1)

        loss_cls = self.loss_cls(
            cls_scores, labels, label_weights, avg_factor=cls_avg_factor)

        # Compute the average number of gt boxes across all gpus, for
        # normalization purposes
        num_total_pos = loss_cls.new_tensor([num_total_pos])
        num_total_pos = torch.clamp(reduce_mean(num_total_pos), min=1).item()

        # construct factors used for rescale bboxes
        factors = []
        for img_meta, bbox_pred in zip(img_metas, bbox_preds):
            img_h, img_w, _ = img_meta['img_shape']
            factor = bbox_pred.new_tensor([img_w, img_h, img_w,
                                           img_h]).unsqueeze(0).repeat(
                                               bbox_pred.size(0), 1)
            factors.append(factor)
        factors = torch.cat(factors, 0)

        # DETR regress the relative position of boxes (cxcywh) in the image,
        # thus the learning target is normalized by the image size. So here
        # we need to re-scale them for calculating IoU loss
        bbox_preds = bbox_preds.reshape(-1, 4)
        bboxes = bbox_cxcywh_to_xyxy(bbox_preds) * factors
        bboxes_gt = bbox_cxcywh_to_xyxy(bbox_targets) * factors

        # regression IoU loss, defaultly GIoU loss
        loss_iou = self.loss_iou(
            bboxes, bboxes_gt, bbox_weights, avg_factor=num_total_pos)

        # regression L1 loss
        loss_bbox = self.loss_bbox(
            bbox_preds, bbox_targets, bbox_weights, avg_factor=num_total_pos)
        return loss_cls, loss_bbox, loss_iou

进一步来到_get_target_single，这里对每一张图片匹配正负样本点，其中输入的参数没有batch的维度，即cls_score的维度为[100,8]。self.assigner在detr_head初始化时已经激活，它是来自HungarianAssigner类的一个对象。

def _get_target_single(self,
                           cls_score,
                           bbox_pred,
                           gt_bboxes,
                           gt_labels,
                           img_meta,
                           gt_bboxes_ignore=None):
        """"Compute regression and classification targets for one image.

        Outputs from a single decoder layer of a single feature level are used.

        Args:
            cls_score (Tensor): Box score logits from a single decoder layer
                for one image. Shape [num_query, cls_out_channels].
            bbox_pred (Tensor): Sigmoid outputs from a single decoder layer
                for one image, with normalized coordinate (cx, cy, w, h) and
                shape [num_query, 4].
            gt_bboxes (Tensor): Ground truth bboxes for one image with
                shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
            gt_labels (Tensor): Ground truth class indices for one image
                with shape (num_gts, ).
            img_meta (dict): Meta information for one image.
            gt_bboxes_ignore (Tensor, optional): Bounding boxes
                which can be ignored. Default None.

        Returns:
            tuple[Tensor]: a tuple containing the following for one image.

                - labels (Tensor): Labels of each image.
                - label_weights (Tensor]): Label weights of each image.
                - bbox_targets (Tensor): BBox targets of each image.
                - bbox_weights (Tensor): BBox weights of each image.
                - pos_inds (Tensor): Sampled positive indices for each image.
                - neg_inds (Tensor): Sampled negative indices for each image.
        """

        num_bboxes = bbox_pred.size(0)
        # assigner and sampler
        assign_result = self.assigner.assign(bbox_pred, cls_score, gt_bboxes,
                                             gt_labels, img_meta,
                                             gt_bboxes_ignore)
        sampling_result = self.sampler.sample(assign_result, bbox_pred,
                                              gt_bboxes)
        pos_inds = sampling_result.pos_inds
        neg_inds = sampling_result.neg_inds

        # label targets
        labels = gt_bboxes.new_full((num_bboxes, ),
                                    self.num_classes,
                                    dtype=torch.long)
        labels[pos_inds] = gt_labels[sampling_result.pos_assigned_gt_inds]
        label_weights = gt_bboxes.new_ones(num_bboxes)

        # bbox targets
        bbox_targets = torch.zeros_like(bbox_pred)
        bbox_weights = torch.zeros_like(bbox_pred)
        bbox_weights[pos_inds] = 1.0
        img_h, img_w, _ = img_meta['img_shape']

        # DETR regress the relative position of boxes (cxcywh) in the image.
        # Thus the learning target should be normalized by the image size, also
        # the box format should be converted from defaultly x1y1x2y2 to cxcywh.
        factor = bbox_pred.new_tensor([img_w, img_h, img_w,
                                       img_h]).unsqueeze(0)
        pos_gt_bboxes_normalized = sampling_result.pos_gt_bboxes / factor
        pos_gt_bboxes_targets = bbox_xyxy_to_cxcywh(pos_gt_bboxes_normalized)
        bbox_targets[pos_inds] = pos_gt_bboxes_targets
        return (labels, label_weights, bbox_targets, bbox_weights, pos_inds,
                neg_inds)

继续来到核心函数assign中，num_gts表示GT个数，num_bboxes=100，img_meta[‘img_shape’]表示现在图片的尺寸（经过前处理即增强后的）。DETR的label assignment的主要思想是将目标检测问题转变为集合问题，通过pred的cls以及iou构建cost matrix，并利用匈牙利算法获得最优匹配。由于正样本与GT是一一对应的，所以没有anchor以及nms的过程

cost_matrix一部分由ClassificationCost组成，如下所示，首先对cls_pred（维度[100,92]）最后一维进行softmax激活，获得类别得分。gt_labels表示该图GT的类别，cls_score[:, gt_labels]将gt对应的类选出并赋值给cls_cost （维度[100,num_gts]）。

cls_score = cls_pred.softmax(-1)
cls_cost = -cls_score[:, gt_labels]
return cls_cost * self.weight

cost_matrix一部分是BBoxL1Cost，因为bbox_pred回归经过sigmoid归一化，所以gt_bboxes也需要进行归一化。如下所示，先将xyxy格式的gt改成xywh格式与bboxo_pred一致，然后计算每个pred对应GT坐标的距离bbox_cost （维度[100,num_gts]）。

gt_bboxes = bbox_xyxy_to_cxcywh(gt_bboxes)
bbox_cost = torch.cdist(bbox_pred, gt_bboxes, p=1)
return bbox_cost * self.weight

cost_matrix另一部分是由IoUCost构成，先将bbox_pred格式改成xyxy再rescale成图片尺寸大小。如下所示，计算pred与GT的iou。

注意，这里iou_cost 是-iou，所以值越小代表pred与GT越接近，cls_cost 也是负值，同样值越小表示该类别置信度越高，bbox_cost 表示pred与GT坐标的距离，值越小表示与GT越接近。因此，将三者相加，（cost = cls_cost + reg_cost + iou_cost），那么cost的值越小表示该pred与该GT越应该匹配。

matched_row_inds, matched_col_inds = linear_sum_assignment(cost)，cost即二分图匹配的cost matrix，输入linear_sum_assignment能够获得Object query与GT的最优匹配（一个query对应一个GT）。最后，将assignment的结果存放在AssignResult类中。

loss过程与常规目标检测类似，比较简单，这里不赘述了。介绍一下detr使用的三种loss，回归使用Giou loss，分类使用交叉熵loss，不同的是，DETR使用了L1 loss帮助坐标回归。应该是当年还没有DIOU的loss，GIOU没有强调GT与pred中心点的距离，所以添加了L1 loss来补充。

overlaps = bbox_overlaps(
bboxes, gt_bboxes, mode=self.iou_mode, is_aligned=False)
# The 1 is a constant that doesn’t change the matching, so omitted.
iou_cost = -overlaps
return iou_cost * self.weight

def assign(self,
               bbox_pred,
               cls_pred,
               gt_bboxes,
               gt_labels,
               img_meta,
               gt_bboxes_ignore=None,
               eps=1e-7):
        """Computes one-to-one matching based on the weighted costs.

        This method assign each query prediction to a ground truth or
        background. The `assigned_gt_inds` with -1 means don't care,
        0 means negative sample, and positive number is the index (1-based)
        of assigned gt.
        The assignment is done in the following steps, the order matters.

        1. assign every prediction to -1
        2. compute the weighted costs
        3. do Hungarian matching on CPU based on the costs
        4. assign all to 0 (background) first, then for each matched pair
           between predictions and gts, treat this prediction as foreground
           and assign the corresponding gt index (plus 1) to it.

        Args:
            bbox_pred (Tensor): Predicted boxes with normalized coordinates
                (cx, cy, w, h), which are all in range [0, 1]. Shape
                [num_query, 4].
            cls_pred (Tensor): Predicted classification logits, shape
                [num_query, num_class].
            gt_bboxes (Tensor): Ground truth boxes with unnormalized
                coordinates (x1, y1, x2, y2). Shape [num_gt, 4].
            gt_labels (Tensor): Label of `gt_bboxes`, shape (num_gt,).
            img_meta (dict): Meta information for current image.
            gt_bboxes_ignore (Tensor, optional): Ground truth bboxes that are
                labelled as `ignored`. Default None.
            eps (int | float, optional): A value added to the denominator for
                numerical stability. Default 1e-7.

        Returns:
            :obj:`AssignResult`: The assigned result.
        """
        assert gt_bboxes_ignore is None, \
            'Only case when gt_bboxes_ignore is None is supported.'
        num_gts, num_bboxes = gt_bboxes.size(0), bbox_pred.size(0)

        # 1. assign -1 by default
        assigned_gt_inds = bbox_pred.new_full((num_bboxes, ),
                                              -1,
                                              dtype=torch.long)
        assigned_labels = bbox_pred.new_full((num_bboxes, ),
                                             -1,
                                             dtype=torch.long)
        if num_gts == 0 or num_bboxes == 0:
            # No ground truth or boxes, return empty assignment
            if num_gts == 0:
                # No ground truth, assign all to background
                assigned_gt_inds[:] = 0
            return AssignResult(
                num_gts, assigned_gt_inds, None, labels=assigned_labels)
        img_h, img_w, _ = img_meta['img_shape']
        factor = gt_bboxes.new_tensor([img_w, img_h, img_w,
                                       img_h]).unsqueeze(0)

        # 2. compute the weighted costs
        # classification and bboxcost.
        cls_cost = self.cls_cost(cls_pred, gt_labels)
        # regression L1 cost
        normalize_gt_bboxes = gt_bboxes / factor
        reg_cost = self.reg_cost(bbox_pred, normalize_gt_bboxes)
        # regression iou cost, defaultly giou is used in official DETR.
        bboxes = bbox_cxcywh_to_xyxy(bbox_pred) * factor
        iou_cost = self.iou_cost(bboxes, gt_bboxes)
        # weighted sum of above three costs
        cost = cls_cost + reg_cost + iou_cost

        # 3. do Hungarian matching on CPU using linear_sum_assignment
        cost = cost.detach().cpu()
        if linear_sum_assignment is None:
            raise ImportError('Please run "pip install scipy" '
                              'to install scipy first.')
        matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
        matched_row_inds = torch.from_numpy(matched_row_inds).to(
            bbox_pred.device)
        matched_col_inds = torch.from_numpy(matched_col_inds).to(
            bbox_pred.device)

        # 4. assign backgrounds and foregrounds
        # assign all indices to backgrounds first
        assigned_gt_inds[:] = 0
        # assign foregrounds based on matching results
        assigned_gt_inds[matched_row_inds] = matched_col_inds + 1
        assigned_labels[matched_row_inds] = gt_labels[matched_col_inds]
        return AssignResult(
            num_gts, assigned_gt_inds, None, labels=assigned_labels)

结论

在这里插入图片描述
可以看出DETR的效果还是很不错的，基于ResNet50的DETR取得了和经过各种finetune的Faster-RCNN相媲美的效果。同时DETR在大目标检测上性能是最好的，但是小目标上稍差。这个比较容易理解，因为DETR将32倍下采样的特征输入encoder-decoder，相对于CNN，自注意力具有全局视野，自由度更高，因此对于大尺寸目标，其检测效果更佳，但也由于缺少细节信息（只有32倍下采样特征，同时缺少类似FPN融合机制），DETR无法很好的处理小目标。

而且基于match的loss导致学习很难收敛。在传统检测任务中，loss可用借助人工设计的anchor，直接根据anchor位置（每个GT对应的anchor是固定的）对相应卷积进行优化。而在DETR中，没有anchor的概念，DETR设计了Object query让模型自己学习对应GT的位置信息（与训练集的数据分布强相关），导致DETR收敛困难，所需的数据量大，但也正因为DETR模型自由度大，可视化自注意力特征，该特征能够完全分离实例，这是传统检测做不到的。

对于上面的两个问题，Deformable DETR较好的解决了这两个问题。下面我们将带着大家一起剖析Deformable DETR。

在这里插入图片描述