【Transformer】detr之loss逐行梳理(四)

胡侃有料

已于 2024-04-26 14:54:25 修改

阅读量1.2k

点赞数 29

分类专栏： # Transformer 文章标签： transformer 目标检测

于 2024-04-26 14:52:27 首次发布

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/weixin_39190382/article/details/138218863

版权

Transformer 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. 前言

detr之loss逐行梳理

1. 整体

20240423175349

20240423175543

loss 来自于SetCriterion，

2. 前景回顾

2.1 Decoder返回值

Decoder有两种返回，

返回一个列表，列表中保存的是每个DecoderLayer的输出，对每一层的输出进行loss计算
返回最后一个DecoderLayer的输出,对最后一个损失计算

class TransformerDecoder(nn.Module):

    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
        super().__init__()
        # 对指定层进行复制
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.return_intermediate = return_intermediate

    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        output = tgt

        intermediate = [] # 保存每一个DecoderLayer的输出,可用于深监督

        for layer in self.layers:
            # 输出作为输入 output (100,bs,512)
            output = layer(output, memory, tgt_mask=tgt_mask,
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos)
            if self.return_intermediate:
                intermediate.append(self.norm(output)) # 保存每个DecoderLayer的输出

        if self.norm is not None:
            output = self.norm(output)
            if self.return_intermediate:
                intermediate.pop()
                intermediate.append(output)

        # 返回1： 返回的是一个列表, 列表中保存的是每个DecoderLayer的输出，(6,100,b,512)
        if self.return_intermediate:
            return torch.stack(intermediate)

        # 返回2： 返回最后一个DecoderLayer的输出 (100,b,512) -> (1,100,b,512)
        return output.unsqueeze(0)

2.2 Transformer返回值

Transformer中的返回值，只有第一个返回值有些许变换，根据Decoder中的返回，可能是返回最后一个DecoderLayer结果，也可能返回多个DecoderLayer的结果，所以hs 的shape有：

(1,100,bs,512)
(6,100,bs,512)

最终shape为：

(1,bs,100,512)
(6,bs,100,512)

class Transformer(nn.Module):

    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False,
                 return_intermediate_dec=False):
        super().__init__()

        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
        # encoder 部分
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
                                          return_intermediate=return_intermediate_dec)

        self._reset_parameters()

        self.d_model = d_model
        self.nhead = nhead

    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(self, src, mask, query_embed, pos_embed):
        # flatten bxCxHxW to HWxbxC
        bs, c, h, w = src.shape
        # (b,c,h,w) ->(b,c,hw) -> (hw,b,c) 
        src = src.flatten(2).permute(2, 0, 1)
        # (b,c,h,w) ->(b,c,hw) -> (hw,b,c) 
        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
        mask = mask.flatten(1)

        # (hw,b,c)
        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)

        # (num_query,hidden_dim) -> (num_query,1,hidden_dim) -> (num_query,bs,hidden_dim)
        # (100,512) -> (100,1,512) -> (100,bs,512)
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
        # (100,bs,512)
        tgt = torch.zeros_like(query_embed)

        # hs (1,100,bs,512)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)
        
        # 第一个返回值，(1,bs,100,512)
        # 第二个返回值，(hw,b,c) -> (b,c,hw)-> (b,c,h,w)
        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

2.3 DETR的返回结果

detr同样有两种返回结果，一种是直接返回最后一个DecoderLayer的输出结果，所以shape为:

class: pred_logits，(bs,100,num_class+1)
box: pred_boxes，(bs,100,4)

另一种是返回多个DecoderLayer的输出结果，所以shape为:

aux_outputs：
   [
    {
    * pred_logits:(bs,100,num_class+1)  # class
    * pred_boxes:(bs,100,4)  # box
    },
    ...
   ]

class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """
        super().__init__()
        self.num_queries = num_queries
        self.transformer = transformer
        hidden_dim = transformer.d_model

        self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # 类别 ,加1,背景
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3) # box

        ... # 略

    def forward(self,samples:NestedTensor):
        if isinstance(samples, (list, torch.Tensor)):
            samples = nested_tensor_from_tensor_list(samples)

        features, pos = self.backbone(samples)

        src, mask = features[-1].decompose()
        assert mask is not None
        # 取decoer的最后的输出 (1,bs,100,512)
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

        # 类别, (1,bs,100,512) -> (1,bs,100,num_class+1)
        outputs_class = self.class_embed(hs)
        # box, (1,bs,100,512) -> (1,bs,100,4)
        outputs_coord = self.bbox_embed(hs).sigmoid()

        # 最后一个输出，class: (bs,100,num_class+1), box: (bs,100,4)
        out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
        if self.aux_loss:
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        return out

20240425095255

3. SetCriterion损失计算

损失函数用的是SetCriterion

20240424174850

3.1 调用部分

在下面代码中计算损失

def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
                    data_loader: Iterable, optimizer: torch.optim.Optimizer,
                    device: torch.device, epoch: int, max_norm: float = 0):
    model.train()
    criterion.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
    metric_logger.add_meter('class_error', utils.SmoothedValue(window_size=1, fmt='{value:.2f}'))
    header = 'Epoch: [{}]'.format(epoch)
    print_freq = 10

    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
        samples = samples.to(device)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # 前向传播
        outputs = model(samples)
        # 计算损失  loss_dict: 'loss_ce' + 'loss_bbox' + 'loss_giou'    用于log日志: 'class_error' + 'cardinality_error'
        loss_dict = criterion(outputs, targets)
        # 权重系数 {'loss_ce': 1, 'loss_bbox': 5, 'loss_giou': 2}
        weight_dict = criterion.weight_dict   
        # 总损失 = 回归损失：loss_bbox（L1）+loss_bbox  +   分类损失：loss_ce
        losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)

        ... # 略

3.2 定义部分

主要完整两件事：

匹配，预测框和gt框进行匹配
计算损失

完整代码如下：

class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """
        super().__init__()
        self.num_classes = num_classes     # 数据集类别数
        self.matcher = matcher             # HungarianMatcher()  匈牙利算法 二分图匹配
        self.weight_dict = weight_dict     # dict: 18  3x6  6个decoder的损失权重   6*(loss_ce+loss_giou+loss_bbox)
        self.eos_coef = eos_coef           # 0.1
        self.losses = losses               # list: 3  ['labels', 'boxes', 'cardinality']
        empty_weight = torch.ones(self.num_classes + 1)
        empty_weight[-1] = self.eos_coef   # tensro: 92   前91=1  92=eos_coef=0.1
        self.register_buffer('empty_weight', empty_weight)

    def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
        ... # 略
    def loss_boxes(self, outputs, targets, indices, num_boxes):
        ... # 略
    def _get_src_permutation_idx(self, indices):
        ... # 略
    def _get_tgt_permutation_idx(self, indices):
        ... # 略
    def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
        ... # 略

    def forward(self, outputs, targets):
        """ This performs the loss computation.
        Parameters:
             outputs: dict of tensors, see the output specification of the model for the format
                      dict: 'pred_logits'=Tensor[bs, 100, 92个class]  'pred_boxes'=Tensor[bs, 100, 4]  最后一个decoder层输出
                             'aux_output'={list:5}  0-4  每个都是dict:2 pred_logits+pred_boxes 表示5个decoder前面层的输出
             targets: list of dicts, such that len(targets) == batch_size.   list: bs
                      每张图片包含以下信息：'boxes'、'labels'、'image_id'、'area'、'iscrowd'、'orig_size'、'size'
                      The expected keys in each dict depends on the losses applied, see each loss' doc
        """
        # dict: 2   最后一个decoder层输出  pred_logits[bs, 100, 92个class] + pred_boxes[bs, 100, 4]
        outputs_without_aux = {k: v for k, v in outputs.items() if k != 'aux_outputs'}

        # --------------------------------------------------------------------------------------
        # 1. match
        # 匈牙利算法  解决二分图匹配问题  从100个预测框中找到和N个gt框一一对应的预测框  其他的100-N个都变为背景
        # Retrieve the matching between the outputs of the last layer and the targets  list:1
        # tuple: 2    0=Tensor3=Tensor[5, 35, 63]  匹配到的3个预测框  其他的97个预测框都是背景
        #             1=Tensor3=Tensor[1, 0, 2]    对应的三个gt框
        indices = self.matcher(outputs_without_aux, targets)

        # Compute the average number of target boxes accross all nodes, for normalization purposes
        num_boxes = sum(len(t["labels"]) for t in targets)   # int 统计这整个batch的所有图片的gt总个数  3
        num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
        if is_dist_avail_and_initialized():
            torch.distributed.all_reduce(num_boxes)
        num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item()   # 3.0

        # --------------------------------------------------------------------------------------
        # 2. 计算最后层decoder损失  Compute all the requested losses
        losses = {}
        for loss in self.losses:
            losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))

        # --------------------------------------------------------------------------------------
        # 2.2 计算前面5层decoder损失  累加到一起  得到最终的losses,辅助监督可能用到
        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
        if 'aux_outputs' in outputs:
            for i, aux_outputs in enumerate(outputs['aux_outputs']):
                indices = self.matcher(aux_outputs, targets)   # 同样匈牙利算法匹配
                for loss in self.losses:   # 计算各个loss
                    if loss == 'masks':
                        # Intermediate masks losses are too costly to compute, we ignore them.
                        continue
                    kwargs = {}
                    if loss == 'labels':
                        # Logging is enabled only for the last layer
                        kwargs = {'log': False}
                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
                    l_dict = {k + f'_{i}': v for k, v in l_dict.items()}
                    losses.update(l_dict)
        # 参加权重更新的损失：losses: 'loss_ce' + 'loss_bbox' + 'loss_giou'    用于log日志: 'class_error' + 'cardinality_error'
        # --------------------------------------------------------------------------------------
        
        return losses

3.2.1 匹配

定义的函数匈牙利匹配，但是实际使用的是linear_sum_assignment，而该函数是Jonker-Volgenant算法具体可参考
https://blog.csdn.net/weixin_39190382/article/details/138188580?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22138188580%22%2C%22source%22%3A%22weixin_39190382%22%7D

有类别损失、box损失、iou损失组成一个成本矩阵，然后用linear_sum_assignment进行匹配。
linear_sum_assignment
即预测的100个结果和真正的box(比如所一幅图像上有3个)进行匹配，使成本矩阵（损失）最小

class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network

    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher

        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class # 类别权重 1
        self.cost_bbox = cost_bbox # box权重 5
        self.cost_giou = cost_giou # iou权重 2
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"
    
    # 不需要更新梯度  只是一种匹配方式
    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes]=[bs,100,92] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4]=[bs,100,4] with the predicted box coordinates

            targets: list:bs This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes]=[3] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        # batch_size  100
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        # [batch_size * num_queries, num_classes]
        # [2,100,92] -> [200, 92] -> [200, 92]概率
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  
        # [2,100,4] -> [200, 4]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        # [3]  idx = 32, 1, 85  concat all labels
        tgt_ids = torch.cat([v["labels"] for v in targets])
        # [3, 4]  concat all box
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # 计算损失   分类 + L1 box + GIOU box
        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # ----------------------------------------------------------------------------
        # 1. 合并三个损失，组成一个成本矩阵
        # Final cost matrix   成本矩阵，[100, 3]  bs*100个预测框分别和3个gt框的损失矩阵
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()  # [bs, 100, 3]

        sizes = [len(v["boxes"]) for v in targets]   # gt个数 3

        # ----------------------------------------------------------------------------
        # 2. 根据成本矩阵，对预测值和真值进行匹配,获取匹配的索引
        # 匈牙利算法进行二分图匹配  从100个预测框中挑选出最终的3个预测框 分别和gt计算损失  这个组合的总损失是最小的
        # 0: (3,)  5, 35, 63   匹配到的gt个预测框idx
        # 1: (3,)  1, 0, 2     对应的gt idx
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        
        # list: bs  返回bs张图片的匹配结果
        # 每张图片都是一个tuple:2
        # 0 = Tensor[gt_num,]  匹配到的正样本idx       1 = Tensor[gt_num,]  gt的idx
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]


def build_matcher(args):
    return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)

3.2.2 损失计算

损失计算分为两种，一种是只计算最后输出的损失，另一种是还计算中间层的损失用于辅助监督，原理一样，用的是get_loss函数。

# --------------------------------------------------------------------------------------
# 2.1 计算最后层decoder损失  Compute all the requested losses
losses = {}
for loss in self.losses:
    losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))

# --------------------------------------------------------------------------------------
# 2.2 计算前面5层decoder损失  累加到一起  得到最终的losses,辅助监督可能用到
# In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
if 'aux_outputs' in outputs:
    for i, aux_outputs in enumerate(outputs['aux_outputs']):
        indices = self.matcher(aux_outputs, targets)   # 同样匈牙利算法匹配
        for loss in self.losses:   # 计算各个loss
            if loss == 'masks':
                # Intermediate masks losses are too costly to compute, we ignore them.
                continue
            kwargs = {}
            if loss == 'labels':
                # Logging is enabled only for the last layer
                kwargs = {'log': False}
            l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
            l_dict = {k + f'_{i}': v for k, v in l_dict.items()}
            losses.update(l_dict)
# 参加权重更新的损失：losses: 'loss_ce' + 'loss_bbox' + 'loss_giou'    用于log日志: 'class_error' + 'cardinality_error'
# --------------------------------------------------------------------------------------

get_loss中有四种损失，我们主要看：

labels: 分类损失
boxes: L1 box损失

def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
    loss_map = {
        'labels': self.loss_labels,
        'cardinality': self.loss_cardinality,
        'boxes': self.loss_boxes,
        'masks': self.loss_masks
    }
    assert loss in loss_map, f'do you really want to compute {loss} loss?'
    return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)

3.2.2.1 labels: 分类损失

def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
    """Classification loss (NLL)
    targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
    outputs：'pred_logits'=[bs, 100, 92] 'pred_boxes'=[bs, 100, 4] 'aux_outputs'=5*([bs, 100, 92]+[bs, 100, 4])
    targets：'boxes'=[3,4] labels=[3] ...
    indices： [3] 如：5,35,63  匹配好的3个预测框idx
    num_boxes：当前batch的所有gt个数
    """
    assert 'pred_logits' in outputs
    # -----------------------------------------------------------------------------------
    src_logits = outputs['pred_logits']  # 分类：[bs, 100, 92类别]

    # idx tuple:2  0=[num_all_gt] 记录每个gt属于哪张图片  1=[num_all_gt] 记录每个匹配到的预测框的index
    idx = self._get_src_permutation_idx(indices)

    target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)])
    # (bs,100)，全部填充91,表示背景
    target_classes = torch.full(src_logits.shape[:2], self.num_classes,
                                dtype=torch.int64, device=src_logits.device)
    
    # 有物体的位置填上对应的物体类别
    # 正样本+负样本  上面匹配到的预测框作为正样本 正常的idx  而100个中没有匹配到的预测框作为负样本(idx=91 背景类)
    target_classes[idx] = target_classes_o

    # -----------------------------------------------------------------------------------
    # 分类损失 = 正样本 + 负样本 权重 src_logits (bs,92,100) @ (bs,100)
    # self.empty_weight (92,) 有物体是1，背景是0.1
    loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)
    losses = {'loss_ce': loss_ce}

    # -----------------------------------------------------------------------------------
    # 日志 记录Top-1精度
    if log:
        # TODO this should probably be a separate loss, not hacked in this one here
        losses['class_error'] = 100 - accuracy(src_logits[idx], target_classes_o)[0]

    # losses: 'loss_ce': 分类损失
    #         'class_error':Top-1精度 即预测概率最大的那个类别与对应被分配的GT类别是否一致  这部分仅用于日志显示 并不参与模型训练
    return losses

def _get_src_permutation_idx(self, indices):
    # permute predictions following indices
    # [num_all_gt]  记录每个gt都是来自哪张图片的 idx
    batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
    # 记录匹配到的预测框的idx
    src_idx = torch.cat([src for (src, _) in indices])
    return batch_idx, src_idx

说明：

标签是(bs,100),没有物体的值是91，有无物体的值是对应的索引(0~90)
有物体的权重是1，没有物体的权重是0.1

然后用交叉熵计算损失：

# 分类损失 = 正样本 + 负样本 权重 src_logits (bs,92,100) @ (bs,100)
# self.empty_weight (92,) 有物体是1，背景是0.1
loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)
losses = {'loss_ce': loss_ce}

3.2.2.2 boxes: box损失

根据索引获取预测框和gt框的坐标，然后计算L1损失和GIOU损失，最后将两个损失相加得到最终的box损失。

def loss_boxes(self, outputs, targets, indices, num_boxes):
    """Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
        targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
        The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
    outputs：'pred_logits'=[bs, 100, 92] 'pred_boxes'=[bs, 100, 4] 'aux_outputs'=5*([bs, 100, 92]+[bs, 100, 4])
    targets：'boxes'=[3,4] labels=[3] ...
    indices： [3] 如：5,35,63  匹配好的3个预测框idx
    num_boxes：当前batch的所有gt个数
    """
    assert 'pred_boxes' in outputs
    # idx tuple:2  0=[num_all_gt] 记录每个gt属于哪张图片  1=[num_all_gt] 记录每个匹配到的预测框的index
    idx = self._get_src_permutation_idx(indices)

    # [all_gt_num, 4]  这个batch的所有正样本的预测框坐标
    src_boxes = outputs['pred_boxes'][idx]
    # [all_gt_num, 4]  这个batch的所有gt框坐标
    target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)

    # 计算L1损失
    loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')

    losses = {}
    losses['loss_bbox'] = loss_bbox.sum() / num_boxes

    # 计算GIOU损失
    loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
        box_ops.box_cxcywh_to_xyxy(src_boxes),
        box_ops.box_cxcywh_to_xyxy(target_boxes)))
    losses['loss_giou'] = loss_giou.sum() / num_boxes

    # 'loss_bbox': L1回归损失   'loss_giou': giou回归损失  
    return losses

def _get_src_permutation_idx(self, indices):
    # permute predictions following indices
    batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
    src_idx = torch.cat([src for (src, _) in indices])
    return batch_idx, src_idx

4. box 后处理

在测试阶段会对预测的框进行后处理。

其实就是把预测结果进行统计，剔除背景类，得到每张图片预测的100个预测框的所属类别的概率分数scores 、所属类别labels 、绝对位置坐标boxes 。

然后最后将这个结果送入coco_evaluator中，计算coco相关指标。

而在预测的时候，实际上我们最终的预测物体一般没有100个物体，这时候是怎么处理的呢？一般是会设置一个预测概率分数的阈值（0.7），大于这个预测的预测框最终才会保留下来显示，那些小于预测的预测框会舍去。

class PostProcess(nn.Module):
    """ This module converts the model's output into the format expected by the coco api"""
    @torch.no_grad()
    def forward(self, outputs, target_sizes):
        """ Perform the computation
        Parameters:
            outputs: raw outputs of the model
                     0 pred_logits 分类头输出[bs, 100, 92(类别数)]
                     1 pred_boxes 回归头输出[bs, 100, 4]
                     2 aux_outputs list: 5  前5个decoder层输出 5个pred_logits[bs, 100, 92(类别数)] 和 5个pred_boxes[bs, 100, 4]
            target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
                          For evaluation, this must be the original image size (before any data augmentation)
                          For visualization, this should be the image size after data augment, but before padding
        """
        # out_logits：[bs, 100, 92(类别数)]
        # out_bbox：[bs, 100, 4]
        out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']

        assert len(out_logits) == len(target_sizes)
        assert target_sizes.shape[1] == 2

        # [bs, 100, 92]  对每个预测框的类别概率取softmax
        prob = F.softmax(out_logits, -1)
        # prob[..., :-1]: [bs, 100, 92] -> [bs, 100, 91]  删除背景
        # .max(-1): scores=[bs, 100]  100个预测框属于最大概率类别的概率
        #           labels=[bs, 100]  100个预测框的类别
        scores, labels = prob[..., :-1].max(-1)

        # cxcywh to xyxy  format   [bs, 100, 4]
        boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
        # and from relative [0, 1] to absolute [0, height] coordinates  bs张图片的宽和高
        img_h, img_w = target_sizes.unbind(1)
        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
        boxes = boxes * scale_fct[:, None, :]  # 归一化坐标 -> 绝对位置坐标(相对于原图的坐标)  [bs, 100, 4]

        results = [{'scores': s, 'labels': l, 'boxes': b} for s, l, b in zip(scores, labels, boxes)]

        # list: bs    每个list都是一个dict  包括'scores'  'labels'  'boxes'三个字段
        # scores = Tensor[100,]  这张图片预测的100个预测框概率分数
        # labels = Tensor[100,]  这张图片预测的100个预测框所属类别idx
        # boxes = Tensor[100, 4] 这张图片预测的100个预测框的绝对位置坐标(相对这张图片的原图大小的坐标)
        return results