【Transformer】detr之loss逐行梳理(四)

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. 前言

detr之loss逐行梳理

1. 整体

20240423175349

20240423175543

loss 来自于SetCriterion,

2. 前景回顾

2.1 Decoder返回值

Decoder有两种返回,

  1. 返回一个列表,列表中保存的是每个DecoderLayer的输出,对每一层的输出进行loss计算

  2. 返回最后一个DecoderLayer的输出,对最后一个损失计算

class TransformerDecoder(nn.Module):

    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
        super().__init__()
        # 对指定层进行复制
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.return_intermediate = return_intermediate

    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        output = tgt

        intermediate = [] # 保存每一个DecoderLayer的输出,可用于深监督

        for layer in self.layers:
            # 输出作为输入 output (100,bs,512)
            output = layer(output, memory, tgt_mask=tgt_mask,
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos)
            if self.return_intermediate:
                intermediate.append(self.norm(output)) # 保存每个DecoderLayer的输出

        if self.norm is not None:
            output = self.norm(output)
            if self.return_intermediate:
                intermediate.pop()
                intermediate.append(output)

        # 返回1: 返回的是一个列表, 列表中保存的是每个DecoderLayer的输出,(6,100,b,512)
        if self.return_intermediate:
            return torch.stack(intermediate)

        # 返回2: 返回最后一个DecoderLayer的输出 (100,b,512) -> (1,100,b,512)
        return output.unsqueeze(0)

2.2 Transformer返回值

Transformer中的返回值,只有第一个返回值有些许变换,根据Decoder中的返回,可能是返回最后一个DecoderLayer结果,也可能返回多个DecoderLayer的结果,所以hs 的shape有:

  • (1,100,bs,512)
  • (6,100,bs,512)

最终shape为:

  • (1,bs,100,512)
  • (6,bs,100,512)
class Transformer(nn.Module):

    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False,
                 return_intermediate_dec=False):
        super().__init__()

        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
        # encoder 部分
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
                                          return_intermediate=return_intermediate_dec)

        self._reset_parameters()

        self.d_model = d_model
        self.nhead = nhead

    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(self, src, mask, query_embed, pos_embed):
        # flatten bxCxHxW to HWxbxC
        bs, c, h, w = src.shape
        # (b,c,h,w) ->(b,c,hw) -> (hw,b,c) 
        src = src.flatten(2).permute(2, 0, 1)
        # (b,c,h,w) ->(b,c,hw) -> (hw,b,c) 
        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
        mask = mask.flatten(1)

        # (hw,b,c)
        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)

        # (num_query,hidden_dim) -> (num_query,1,hidden_dim) -> (num_query,bs,hidden_dim)
        # (100,512) -> (100,1,512) -> (100,bs,512)
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
        # (100,bs,512)
        tgt = torch.zeros_like(query_embed)

        # hs (1,100,bs,512)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)
        
        # 第一个返回值,(1,bs,100,512)
        # 第二个返回值,(hw,b,c) -> (b,c,hw)-> (b,c,h,w)
        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

2.3 DETR的返回结果

detr同样有两种返回结果,一种是直接返回最后一个DecoderLayer的输出结果,所以shape为:

  • class: pred_logits,(bs,100,num_class+1)
  • box: pred_boxes,(bs,100,4)

另一种是返回多个DecoderLayer的输出结果,所以shape为:

aux_outputs:
   [
    {
    * pred_logits:(bs,100,num_class+1)  # class
    * pred_boxes:(bs,100,4)  # box
    },
    ...
   ]
class DETR(nn.Module):
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """
        super().__init__()
        self.num_queries = num_queries
        self.transformer = transformer
        hidden_dim = transformer.d_model

        self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # 类别 ,加1,背景
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3) # box

        ... # 略

    def forward(self,samples:NestedTensor):
        if isinstance(samples, (list, torch.Tensor)):
            samples = nested_tensor_from_tensor_list(samples)

        features, pos = self.backbone(samples)

        src, mask = features[-1].decompose()
        assert mask is not None
        # 取decoer的最后的输出 (1,bs,100,512)
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

        # 类别, (1,bs,100,512) -> (1,bs,100,num_class+1)
        outputs_class = self.class_embed(hs)
        # box, (1,bs,100,512) -> (1,bs,100,4)
        outputs_coord = self.bbox_embed(hs).sigmoid()

        # 最后一个输出,class: (bs,100,num_class+1), box: (bs,100,4)
        out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
        if self.aux_loss:
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        return out

20240425095255

3. SetCriterion损失计算

损失函数用的是SetCriterion

20240424174850

3.1 调用部分

在下面代码中计算损失

def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
                    data_loader: Iterable, optimizer: torch.optim.Optimizer,
                    device: torch.device, epoch: int, max_norm: float = 0):
    model.train()
    criterion.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
    metric_logger.add_meter('class_error', utils.SmoothedValue(window_size=1, fmt='{value:.2f}'))
    header = 'Epoch: [{}]'.format(epoch)
    print_freq = 10

    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
        samples = samples.to(device)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # 前向传播
        outputs = model(samples)
        # 计算损失  loss_dict: 'loss_ce' + 'loss_bbox' + 'loss_giou'    用于log日志: 'class_error' + 'cardinality_error'
        loss_dict = criterion(outputs, targets)
        # 权重系数 {'loss_ce': 1, 'loss_bbox': 5, 'loss_giou': 2}
        weight_dict = criterion.weight_dict   
        # 总损失 = 回归损失:loss_bbox(L1)+loss_bbox  +   分类损失:loss_ce
        losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)

        ... # 略

3.2 定义部分

主要完整两件事:

  1. 匹配,预测框和gt框进行匹配
  2. 计算损失

完整代码如下:

class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """
        super().__init__()
        self.num_classes = num_classes     # 数据集类别数
        self.matcher = matcher             # HungarianMatcher()  匈牙利算法 二分图匹配
        self.weight_dict = weight_dict     # dict: 18  3x6  6个decoder的损失权重   6*(loss_ce+loss_giou+loss_bbox)
        self.eos_coef = eos_coef           # 0.1
        self.losses = losses               # list: 3  ['labels', 'boxes', 'cardinality']
        empty_weight = torch.ones(self.num_classes + 1)
        empty_weight[-1] = self.eos_coef   # tensro: 92   前91=1  92=eos_coef=0.1
        self.register_buffer('empty_weight', empty_weight)

    def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
        ... # 略
    def loss_boxes(self, outputs, targets, indices, num_boxes):
        ... # 略
    def _get_src_permutation_idx(self, indices):
        ... # 略
    def _get_tgt_permutation_idx(self, indices):
        ... # 略
    def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
        ... # 略

    def forward(self, outputs, targets):
        """ This performs the loss computation.
        Parameters:
             outputs: dict of tensors, see the output specification of the model for the format
                      dict: 'pred_logits'=Tensor[bs, 100, 92个class]  'pred_boxes'=Tensor[bs, 100, 4]  最后一个decoder层输出
                             'aux_output'={list:5}  0-4  每个都是dict:2 pred_logits+pred_boxes 表示5个decoder前面层的输出
             targets: list of dicts, such that len(targets) == batch_size.   list: bs
                      每张图片包含以下信息:'boxes'、'labels'、'image_id'、'area'、'iscrowd'、'orig_size'、'size'
                      The expected keys in each dict depends on the losses applied, see each loss' doc
        """
        # dict: 2   最后一个decoder层输出  pred_logits[bs, 100, 92个class] + pred_boxes[bs, 100, 4]
        outputs_without_aux = {k: v for k, v in outputs.items() if k != 'aux_outputs'}

        # --------------------------------------------------------------------------------------
        # 1. match
        # 匈牙利算法  解决二分图匹配问题  从100个预测框中找到和N个gt框一一对应的预测框  其他的100-N个都变为背景
        # Retrieve the matching between the outputs of the last layer and the targets  list:1
        # tuple: 2    0=Tensor3=Tensor[5, 35, 63]  匹配到的3个预测框  其他的97个预测框都是背景
        #             1=Tensor3=Tensor[1, 0, 2]    对应的三个gt框
        indices = self.matcher(outputs_without_aux, targets)

        # Compute the average number of target boxes accross all nodes, for normalization purposes
        num_boxes = sum(len(t["labels"]) for t in targets)   # int 统计这整个batch的所有图片的gt总个数  3
        num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
        if is_dist_avail_and_initialized():
            torch.distributed.all_reduce(num_boxes)
        num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item()   # 3.0

        # --------------------------------------------------------------------------------------
        # 2. 计算最后层decoder损失  Compute all the requested losses
        losses = {}
        for loss in self.losses:
            losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))

        # --------------------------------------------------------------------------------------
        # 2.2 计算前面5层decoder损失  累加到一起  得到最终的losses,辅助监督可能用到
        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
        if 'aux_outputs' in outputs:
            for i, aux_outputs in enumerate(outputs['aux_outputs']):
                indices = self.matcher(aux_outputs, targets)   # 同样匈牙利算法匹配
                for loss in self.losses:   # 计算各个loss
                    if loss == 'masks':
                        # Intermediate masks losses are too costly to compute, we ignore them.
                        continue
                    kwargs = {}
                    if loss == 'labels':
                        # Logging is enabled only for the last layer
                        kwargs = {'log': False}
                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
                    l_dict = {k + f'_{i}': v for k, v in l_dict.items()}
                    losses.update(l_dict)
        # 参加权重更新的损失:losses: 'loss_ce' + 'loss_bbox' + 'loss_giou'    用于log日志: 'class_error' + 'cardinality_error'
        # --------------------------------------------------------------------------------------
        
        return losses

3.2.1 匹配

定义的函数匈牙利匹配,但是实际使用的是linear_sum_assignment,而该函数是Jonker-Volgenant算法具体可参考
https://blog.csdn.net/weixin_39190382/article/details/138188580?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22138188580%22%2C%22source%22%3A%22weixin_39190382%22%7D

有类别损失、box损失、iou损失组成一个成本矩阵,然后用linear_sum_assignment进行匹配。
linear_sum_assignment
即预测的100个结果和真正的box(比如所一幅图像上有3个)进行匹配,使成本矩阵(损失)最小

class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network

    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher

        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class # 类别权重 1
        self.cost_bbox = cost_bbox # box权重 5
        self.cost_giou = cost_giou # iou权重 2
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"
    
    # 不需要更新梯度  只是一种匹配方式
    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes]=[bs,100,92] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4]=[bs,100,4] with the predicted box coordinates

            targets: list:bs This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes]=[3] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        # batch_size  100
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        # [batch_size * num_queries, num_classes]
        # [2,100,92] -> [200, 92] -> [200, 92]概率
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  
        # [2,100,4] -> [200, 4]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        # [3]  idx = 32, 1, 85  concat all labels
        tgt_ids = torch.cat([v["labels"] for v in targets])
        # [3, 4]  concat all box
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # 计算损失   分类 + L1 box + GIOU box
        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # ----------------------------------------------------------------------------
        # 1. 合并三个损失,组成一个成本矩阵
        # Final cost matrix   成本矩阵,[100, 3]  bs*100个预测框分别和3个gt框的损失矩阵
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()  # [bs, 100, 3]

        sizes = [len(v["boxes"]) for v in targets]   # gt个数 3

        # ----------------------------------------------------------------------------
        # 2. 根据成本矩阵,对预测值和真值进行匹配,获取匹配的索引
        # 匈牙利算法进行二分图匹配  从100个预测框中挑选出最终的3个预测框 分别和gt计算损失  这个组合的总损失是最小的
        # 0: (3,)  5, 35, 63   匹配到的gt个预测框idx
        # 1: (3,)  1, 0, 2     对应的gt idx
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        
        # list: bs  返回bs张图片的匹配结果
        # 每张图片都是一个tuple:2
        # 0 = Tensor[gt_num,]  匹配到的正样本idx       1 = Tensor[gt_num,]  gt的idx
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]


def build_matcher(args):
    return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)

3.2.2 损失计算

损失计算分为两种,一种是只计算最后输出的损失,另一种是还计算中间层的损失用于辅助监督,原理一样,用的是get_loss函数。

# --------------------------------------------------------------------------------------
# 2.1 计算最后层decoder损失  Compute all the requested losses
losses = {}
for loss in self.losses:
    losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))

# --------------------------------------------------------------------------------------
# 2.2 计算前面5层decoder损失  累加到一起  得到最终的losses,辅助监督可能用到
# In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
if 'aux_outputs' in outputs:
    for i, aux_outputs in enumerate(outputs['aux_outputs']):
        indices = self.matcher(aux_outputs, targets)   # 同样匈牙利算法匹配
        for loss in self.losses:   # 计算各个loss
            if loss == 'masks':
                # Intermediate masks losses are too costly to compute, we ignore them.
                continue
            kwargs = {}
            if loss == 'labels':
                # Logging is enabled only for the last layer
                kwargs = {'log': False}
            l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
            l_dict = {k + f'_{i}': v for k, v in l_dict.items()}
            losses.update(l_dict)
# 参加权重更新的损失:losses: 'loss_ce' + 'loss_bbox' + 'loss_giou'    用于log日志: 'class_error' + 'cardinality_error'
# --------------------------------------------------------------------------------------

get_loss中有四种损失,我们主要看:

  • labels: 分类损失
  • boxes: L1 box损失
def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
    loss_map = {
        'labels': self.loss_labels,
        'cardinality': self.loss_cardinality,
        'boxes': self.loss_boxes,
        'masks': self.loss_masks
    }
    assert loss in loss_map, f'do you really want to compute {loss} loss?'
    return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)
3.2.2.1 labels: 分类损失
def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
    """Classification loss (NLL)
    targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
    outputs:'pred_logits'=[bs, 100, 92] 'pred_boxes'=[bs, 100, 4] 'aux_outputs'=5*([bs, 100, 92]+[bs, 100, 4])
    targets:'boxes'=[3,4] labels=[3] ...
    indices: [3] 如:5,35,63  匹配好的3个预测框idx
    num_boxes:当前batch的所有gt个数
    """
    assert 'pred_logits' in outputs
    # -----------------------------------------------------------------------------------
    src_logits = outputs['pred_logits']  # 分类:[bs, 100, 92类别]

    # idx tuple:2  0=[num_all_gt] 记录每个gt属于哪张图片  1=[num_all_gt] 记录每个匹配到的预测框的index
    idx = self._get_src_permutation_idx(indices)

    target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)])
    # (bs,100),全部填充91,表示背景
    target_classes = torch.full(src_logits.shape[:2], self.num_classes,
                                dtype=torch.int64, device=src_logits.device)
    
    # 有物体的位置填上对应的物体类别
    # 正样本+负样本  上面匹配到的预测框作为正样本 正常的idx  而100个中没有匹配到的预测框作为负样本(idx=91 背景类)
    target_classes[idx] = target_classes_o

    # -----------------------------------------------------------------------------------
    # 分类损失 = 正样本 + 负样本 权重 src_logits (bs,92,100) @ (bs,100)
    # self.empty_weight (92,) 有物体是1,背景是0.1
    loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)
    losses = {'loss_ce': loss_ce}

    # -----------------------------------------------------------------------------------
    # 日志 记录Top-1精度
    if log:
        # TODO this should probably be a separate loss, not hacked in this one here
        losses['class_error'] = 100 - accuracy(src_logits[idx], target_classes_o)[0]

    # losses: 'loss_ce': 分类损失
    #         'class_error':Top-1精度 即预测概率最大的那个类别与对应被分配的GT类别是否一致  这部分仅用于日志显示 并不参与模型训练
    return losses

def _get_src_permutation_idx(self, indices):
    # permute predictions following indices
    # [num_all_gt]  记录每个gt都是来自哪张图片的 idx
    batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
    # 记录匹配到的预测框的idx
    src_idx = torch.cat([src for (src, _) in indices])
    return batch_idx, src_idx

说明:

标签是(bs,100),没有物体的值是91,有无物体的值是对应的索引(0~90)
有物体的权重是1,没有物体的权重是0.1

然后用交叉熵计算损失:

# 分类损失 = 正样本 + 负样本 权重 src_logits (bs,92,100) @ (bs,100)
# self.empty_weight (92,) 有物体是1,背景是0.1
loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)
losses = {'loss_ce': loss_ce}
3.2.2.2 boxes: box损失

根据索引获取预测框和gt框的坐标,然后计算L1损失和GIOU损失,最后将两个损失相加得到最终的box损失。

def loss_boxes(self, outputs, targets, indices, num_boxes):
    """Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
        targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
        The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
    outputs:'pred_logits'=[bs, 100, 92] 'pred_boxes'=[bs, 100, 4] 'aux_outputs'=5*([bs, 100, 92]+[bs, 100, 4])
    targets:'boxes'=[3,4] labels=[3] ...
    indices: [3] 如:5,35,63  匹配好的3个预测框idx
    num_boxes:当前batch的所有gt个数
    """
    assert 'pred_boxes' in outputs
    # idx tuple:2  0=[num_all_gt] 记录每个gt属于哪张图片  1=[num_all_gt] 记录每个匹配到的预测框的index
    idx = self._get_src_permutation_idx(indices)

    # [all_gt_num, 4]  这个batch的所有正样本的预测框坐标
    src_boxes = outputs['pred_boxes'][idx]
    # [all_gt_num, 4]  这个batch的所有gt框坐标
    target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)

    # 计算L1损失
    loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')

    losses = {}
    losses['loss_bbox'] = loss_bbox.sum() / num_boxes

    # 计算GIOU损失
    loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
        box_ops.box_cxcywh_to_xyxy(src_boxes),
        box_ops.box_cxcywh_to_xyxy(target_boxes)))
    losses['loss_giou'] = loss_giou.sum() / num_boxes

    # 'loss_bbox': L1回归损失   'loss_giou': giou回归损失  
    return losses

def _get_src_permutation_idx(self, indices):
    # permute predictions following indices
    batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
    src_idx = torch.cat([src for (src, _) in indices])
    return batch_idx, src_idx

4. box 后处理

在测试阶段会对预测的框进行后处理。

其实就是把预测结果进行统计,剔除背景类,得到每张图片预测的100个预测框的所属类别的概率分数scores 、所属类别labels 、绝对位置坐标boxes 。

然后最后将这个结果送入coco_evaluator中,计算coco相关指标。

而在预测的时候,实际上我们最终的预测物体一般没有100个物体,这时候是怎么处理的呢?一般是会设置一个预测概率分数的阈值(0.7),大于这个预测的预测框最终才会保留下来显示,那些小于预测的预测框会舍去。

class PostProcess(nn.Module):
    """ This module converts the model's output into the format expected by the coco api"""
    @torch.no_grad()
    def forward(self, outputs, target_sizes):
        """ Perform the computation
        Parameters:
            outputs: raw outputs of the model
                     0 pred_logits 分类头输出[bs, 100, 92(类别数)]
                     1 pred_boxes 回归头输出[bs, 100, 4]
                     2 aux_outputs list: 5  前5个decoder层输出 5个pred_logits[bs, 100, 92(类别数)] 和 5个pred_boxes[bs, 100, 4]
            target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
                          For evaluation, this must be the original image size (before any data augmentation)
                          For visualization, this should be the image size after data augment, but before padding
        """
        # out_logits:[bs, 100, 92(类别数)]
        # out_bbox:[bs, 100, 4]
        out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']

        assert len(out_logits) == len(target_sizes)
        assert target_sizes.shape[1] == 2

        # [bs, 100, 92]  对每个预测框的类别概率取softmax
        prob = F.softmax(out_logits, -1)
        # prob[..., :-1]: [bs, 100, 92] -> [bs, 100, 91]  删除背景
        # .max(-1): scores=[bs, 100]  100个预测框属于最大概率类别的概率
        #           labels=[bs, 100]  100个预测框的类别
        scores, labels = prob[..., :-1].max(-1)

        # cxcywh to xyxy  format   [bs, 100, 4]
        boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
        # and from relative [0, 1] to absolute [0, height] coordinates  bs张图片的宽和高
        img_h, img_w = target_sizes.unbind(1)
        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
        boxes = boxes * scale_fct[:, None, :]  # 归一化坐标 -> 绝对位置坐标(相对于原图的坐标)  [bs, 100, 4]

        results = [{'scores': s, 'labels': l, 'boxes': b} for s, l, b in zip(scores, labels, boxes)]

        # list: bs    每个list都是一个dict  包括'scores'  'labels'  'boxes'三个字段
        # scores = Tensor[100,]  这张图片预测的100个预测框概率分数
        # labels = Tensor[100,]  这张图片预测的100个预测框所属类别idx
        # boxes = Tensor[100, 4] 这张图片预测的100个预测框的绝对位置坐标(相对这张图片的原图大小的坐标)
        return results

参考

  1. https://blog.csdn.net/weixin_39190382/article/details/137905915
  2. https://hukai.blog.csdn.net/article/details/127616634
### 回答1: 我认为降低detrloss最好的办法是:1. 重新审视模型架构,以确保它最大限度地捕捉数据之间的相关性;2. 增加训练数据量,以改善模型的泛化能力;3. 调整超参数,如学习率、正则项系数等;4. 使用正则化技术,如 Dropout,以防止过拟合。 ### 回答2: 降低 DETR(Detection Transformer)的损失可以通过以下几种方法来实施: 1. 数据增强:通过对训练数据进行增强,如随机旋转、平移、缩放和裁剪等,可以增加多样性,使模型更好地学习各种场景和变换。 2. 损失函数:可以尝试使用更适合任务的损失函数。DETR使用了两个损失函数,即分类损失和回归损失。可以根据实际情况调整它们的权重或选择其他合适的损失函数。 3. 学习率调度:通过动态调整学习率,可以帮助模型更好地收敛并避免过拟合。可以使用学习率调度器,如余弦退火调度器或学习率衰减等方法。 4. 增加训练数据:如果可能,可以尝试增加更多的训练数据,以增加模型的泛化能力和鲁棒性。 5. 网络结构调整:可以调整模型的架构,如增加层数、调整通道数或引入注意力机制等,以提高模型的表达能力。 6. 集合策略:DETR使用了集合策略来解决目标检测中的匹配问题。可以尝试使用不同的集合策略,如类别嵌入、高斯混合或自适应匹配等,以提高模型的性能。 综上所述,降低 DETR 的损失需要在数据增强、损失函数、学习率调度、训练数据、网络结构和集合策略等方面进行综合优化,以提高模型的性能和准确度。 ### 回答3: 要降低DETR(Detection Transformer)的损失,可以采取以下几个方法: 1. 增加训练数据量:通过增加训练数据的多样性和数量,可以帮助DETR更好地学习目标检测任务。这可以包括数据增强、合成数据和从其他源收集更多标注数据等。 2. 调整学习率(Learning Rate):适当调整学习率可以提高训练的稳定性和收敛速度。可以使用学习率调度算法,如学习率衰减或动态调整,以降低收敛到局部最小值的可能性。 3. 修改损失函数:可以尝试调整DETR的损失函数,例如调整目标检测损失和定位损失之间的权重,或者将其他优化目标(如交叉熵损失)与原有损失函数结合使用,以提高模型的整体性能。 4. 超参数调优:对其他超参数进行调优,例如正则化参数、梯度裁剪阈值等。通过系统地调整超参数,可以找到更好的模型配置,从而降低DETR的损失。 5. 模型改进: DE TR是一个非常复杂的模型,因此可以尝试对其进行改进,例如通过增加注意力机制、调整Transformer层的深度和宽度,或添加其他模块来提高性能。 6. 增加训练迭代次数:增加训练迭代次数可以允许模型更多地学习数据集中的模式和特征。但要注意过拟合的问题,需要在一定的范围内增加迭代次数。 总而言之,降低DETR的损失需要综合考虑数据、超参数、模型结构和训练策略等多个方面,通过不断地调整和改进这些因素,可以帮助提高DETR的性能和准确率。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

胡侃有料

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值