目标检测学习笔记1-Detr源码分析

lddddss

已于 2024-02-22 17:19:38 修改

阅读量1.9k

点赞数 45

分类专栏：目标检测文章标签：学习笔记

于 2024-02-13 21:33:03 首次发布

本文链接：https://blog.csdn.net/lddddss/article/details/135888250

版权

目标检测专栏收录该内容

3 篇文章

订阅专栏

前言

最近在学习detr，看完论文尝试着看源码，水平不行还是参考了很多大牛的解析，简单说一下看论文和源码的思路吧，我觉得这种方法还是很有帮助的

如何理解一篇网络代码
- 在网上找到相关的讲解，对网络形成大概印象
- 在gethub上找到相关模型代码，根据原论文，进一步理解
- 运行环境debug，并通过解析train.py马上找到主要模型相关的内容，然后着重关注模型方面的解析跑通代码，重点分析网络搭建、数据预处理、损失计算这些部分

关于detr源码学习的重点主要包括：

backbone：Positional Encoding（PositionEmbeddingSine）；
Transformer：TransformerEncoderLayer + TransformerDecoderLayer；
损失函数：匈牙利算法，二分图匹配（self.matcher）

PositionEmbeddingSine

PositionEmbeddingLearned

Backbone_CNN

Joiner(backbone, position_embedding)

2.1.2 Transformer

TransformerEncoderLayer

TransformerEncoder

TransformerDncoderLayer

TransformerDncoder

2.1.3 损失计算：SetCriterion

计算损失：self.get_loss

匈牙利匹配：HungarianMatcher

2.1.4 bbox后处理：PostProcess

2.2 搭建DETR

Reference

1.数据处理

dataset：

从mine.py为入口：

    dataset_train = build_dataset(image_set='train', args=args)
    dataset_val = build_dataset(image_set='val', args=args)

datasets/__init__.py

def build_dataset(image_set, args):
    if args.dataset_file == 'coco':   # 判断是否是coco
        return build_coco(image_set, args)
    if args.dataset_file == 'coco_panoptic':
        # to avoid making panopticapi required for coco
        from .coco_panoptic import build as build_coco_panoptic
        return build_coco_panoptic(image_set, args)
    raise ValueError(f'dataset {args.dataset_file} not supported')

datasets/coco.py

通过build_coco传进

from .coco import build as build_coco


def build(image_set, args):
    root = Path(args.coco_path)
    assert root.exists(), f'provided COCO path {root} does not exist'
    mode = 'instances'    # 注解文件
    PATHS = {
        "train": (root / "train2017", root / "annotations" / f'{mode}_train2017.json'),     # 前面图片，后面注解
        "val": (root / "val2017", root / "annotations" / f'{mode}_val2017.json'),
    }

    img_folder, ann_file = PATHS[image_set]   # 传入train或者val
    # 图片路径 注解路径 对图片进行数据增强处理
    dataset = CocoDetection(img_folder, ann_file, transforms=make_coco_transforms(image_set), return_masks=args.masks)
    return dataset

make_coco_transforms ()：定义数据增强操作

def make_coco_transforms(image_set):

    normalize = T.Compose([
        T.ToTensor(),
        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # 归一化
    ])

    scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]

    if image_set == 'train':
        return T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomSelect(
                T.RandomResize(scales, max_size=1333),  # 随机缩放到指定尺寸范围内
                T.Compose([
                    T.RandomResize([400, 500, 600]),
                    T.RandomSizeCrop(384, 600),  # 随机尺寸裁剪（
                    T.RandomResize(scales, max_size=1333),  # 再次随机缩放到指定尺寸
                ])
            ),
            normalize,
        ])

    if image_set == 'val':
        return T.Compose([
            T.RandomResize([800], max_size=1333),
            normalize,
        ])

    raise ValueError(f'unknown {image_set}')

CocoDetection()

class CocoDetection(torchvision.datasets.CocoDetection):
    def __init__(self, img_folder, ann_file, transforms, return_masks):
        super(CocoDetection, self).__init__(img_folder, ann_file)
        self._transforms = transforms
        self.prepare = ConvertCocoPolysToMask(return_masks)

    def __getitem__(self, idx):
        img, target = super(CocoDetection, self).__getitem__(idx)
        image_id = self.ids[idx]
        target = {'image_id': image_id, 'annotations': target}
        img, target = self.prepare(img, target)  # 滤除有遮掩的物体以及一些bbox不合法的数据
        if self._transforms is not None:
            img, target = self._transforms(img, target)
        return img, target  # 返回数据和标签

其中prepare(img, target) 主要是通过ConvertCocoPolysToMask(return_masks)函数

滤除有遮掩的物体以及一些bbox不合法的数据，代码实现如下：

anno = target["annotations"]

        anno = [obj for obj in anno if 'iscrowd' not in obj or obj['iscrowd'] == 0]  # 是否有遮挡

        boxes = [obj["bbox"] for obj in anno]
        # guard against no boxes via resizing
        # 将边界框信息转换为一个浮点型的 PyTorch 张量，并将其形状调整为 (N, 4)，其中 N 是边界框的数量。
        # 每一行表示一个边界框，其中包含四个值：左上角 x 坐标、左上角 y 坐标、右下角 x 坐标和右下角 y 坐标。
        boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
        boxes[:, 2:] += boxes[:, :2]  # 将边界框的右下角坐标调整为相对于左上角坐标的偏移量。
        boxes[:, 0::2].clamp_(min=0, max=w)  # 将边界框的 x 坐标限制在图像宽度范围内。
        boxes[:, 1::2].clamp_(min=0, max=h)  # 将边界框的 y 坐标限制在图像宽度范围内。

最终将处理好的数据返回到build_dataset中

dataloader:

    data_loader_train = DataLoader(dataset_train, batch_sampler=batch_sampler_train,
                                   collate_fn=utils.collate_fn, num_workers=args.num_workers)  #  collate_fn=utils.collate_fn：将不同大小的图像填充到相同的大小以进行批处理
    data_loader_val = DataLoader(dataset_val, args.batch_size, sampler=sampler_val,
                                 drop_last=False, collate_fn=utils.collate_fn, num_workers=args.num_workers)

collate_fn=utils.collate_fn------misc:获取一个batch所有图片的shape，生成一个max[w,h]的全为1的tensor，对比每张图片，将每张图片的有效部分全设置为False，其余部分设置为Ture，这样设置的目的是将不同大小的图像填充到相同的大小以进行批处理

def nested_tensor_from_tensor_list(tensor_list: List[Tensor]):
    # TODO make this more general
    if tensor_list[0].ndim == 3:
        if torchvision._is_tracing():
            # nested_tensor_from_tensor_list() does not export well to ONNX
            # call _onnx_nested_tensor_from_tensor_list() instead
            return _onnx_nested_tensor_from_tensor_list(tensor_list)

        # TODO make it support different-sized images
        max_size = _max_by_axis([list(img.shape) for img in tensor_list])
        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))
        batch_shape = [len(tensor_list)] + max_size
        b, c, h, w = batch_shape
        dtype = tensor_list[0].dtype
        device = tensor_list[0].device
        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)
        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)
        for img, pad_img, m in zip(tensor_list, tensor, mask):
            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)  # 将img张量的部分内容（根据其形状）复制到pad_img张量中
            m[: img.shape[1], :img.shape[2]] = False  # 将mask张量中与img张量相同大小的区域的值设置为False，标记出每个图像张量的有效区域。
    else:
        raise ValueError('not supported')
    return NestedTensor(tensor, mask)

2.DETR整体架构

从mine.py为入口，

搭建DETR：Backbone + Transformer + MLP
初始化损失函数：criterion + 初始化后处理：postprocessor

    # 模型搭建返回model（Backbone + Transformer + MLP）, criterion（初始化损失函数）, postprocessors（初始化后处理，调用coco api后处理函数）
    model, criterion, postprocessors = build_model(args)
    model.to(device)

2.1 build函数

def build(args):
    # the `num_classes` naming here is somewhat misleading.
    # it indeed corresponds to `max_obj_id + 1`, where max_obj_id
    # is the maximum id for a class in your dataset. For example,
    # COCO has a max_obj_id of 90, so we pass `num_classes` to be 91.
    # As another example, for a dataset that has a single class with id 1,
    # you should pass `num_classes` to be 2 (max_obj_id + 1).
    # For more details on this, check the following discussion
    # https://github.com/facebookresearch/detr/issues/108#issuecomment-650269223
    num_classes = 20 if args.dataset_file != 'coco' else 91
    if args.dataset_file == "coco_panoptic":
        # for panoptic, we just add a num_classes that is large enough to hold
        # max_obj_id + 1, but the exact value doesn't really matter
        num_classes = 250
    device = torch.device(args.device)

    # 搭建backbone resnet + PositionEmbeddingSine
    backbone = build_backbone(args)

    # 搭建transformer
    transformer = build_transformer(args)

    # 搭建整个DETR模型
    model = DETR(
        backbone,
        transformer,
        num_classes=num_classes,
        num_queries=args.num_queries,
        aux_loss=args.aux_loss,
    )

    # 是否需要额外的分割任务
    if args.masks:
        model = DETRsegm(model, freeze_detr=(args.frozen_weights is not None))

    # HungarianMatcher()  二分图匹配 匈牙利算法
    matcher = build_matcher(args)

    # 损失权重
    weight_dict = {'loss_ce': 1, 'loss_bbox': args.bbox_loss_coef}
    weight_dict['loss_giou'] = args.giou_loss_coef
    if args.masks:   # 分割任务  False
        weight_dict["loss_mask"] = args.mask_loss_coef
        weight_dict["loss_dice"] = args.dice_loss_coef
    # TODO this is a hack
    if args.aux_loss:   # 辅助损失  每个decoder都参与计算损失  True
        aux_weight_dict = {}
        for i in range(args.dec_layers - 1):
            aux_weight_dict.update({k + f'_{i}': v for k, v in weight_dict.items()})
        weight_dict.update(aux_weight_dict)

    losses = ['labels', 'boxes', 'cardinality']
    if args.masks:
        losses += ["masks"]

    # 定义损失函数
    criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,
                             eos_coef=args.eos_coef, losses=losses)
    criterion.to(device)

    # 定义后处理
    postprocessors = {'bbox': PostProcess()}

    # 分割
    if args.masks:
        postprocessors['segm'] = PostProcessSegm()
        if args.dataset_file == "coco_panoptic":
            is_thing_map = {i: i <= 90 for i in range(201)}
            postprocessors["panoptic"] = PostProcessPanoptic(is_thing_map, threshold=0.85)

    return model, criterion, postprocessors

2.1.1 Backbone

def build_backbone(args):
    position_embedding = build_position_encoding(args)
    train_backbone = args.lr_backbone > 0
    return_interm_layers = args.masks
    backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)
    model = Joiner(backbone, position_embedding)
    model.num_channels = backbone.num_channels
    return model

backbone主要由positon_embedding与Backbone_CNN组成

positon_embedding

def build_position_encoding(args):
    N_steps = args.hidden_dim // 2  # 使用二维位置编码，前一半为x轴位置编码，后面为y轴位置编码
    if args.position_embedding in ('v2', 'sine'):   # 正余弦函数作为位置编码
        # TODO find a better way of exposing other arguments
        position_embedding = PositionEmbeddingSine(N_steps, normalize=True)
    elif args.position_embedding in ('v3', 'learned'):  # 可学习函数作为位置编码 
        position_embedding = PositionEmbeddingLearned(N_steps)
    else:
        raise ValueError(f"not supported {args.position_embedding}")

    return position_embedding

位置编码的生成中，采用正余弦函数或者可学习位置编码，在detr中使用第一种(第二种简单介绍)

PositionEmbeddingSine

class PositionEmbeddingSine(nn.Module):
    """
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats
        self.temperature = temperature
        self.normalize = normalize
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            scale = 2 * math.pi
        self.scale = scale

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        mask = tensor_list.mask  # 在数据处理中得到的mask模板
        assert mask is not None
        not_mask = ~mask   # 取反：为Ture的位置
        y_embed = not_mask.cumsum(1, dtype=torch.float32)  # 行累加
        x_embed = not_mask.cumsum(2, dtype=torch.float32)  # 列累加
        if self.normalize:
            eps = 1e-6  # 确保x_max作分母不为0
            # 最大最小值规范化：x_min/x_max 将其缩放到0-1之间，再*2pi
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)  # 公式pos后面的常数
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)  # 2 * (dim_t // 2）：确保是偶数

        pos_x = x_embed[:, :, :, None] / dim_t  # 公式
        pos_y = y_embed[:, :, :, None] / dim_t
        # 使用stack在最后一个维度拼接，即偶数函数后接着下一个奇数函数
        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
        # [bs,256,19,26] 最终输出19x26共计256个位置，在256个位置都设置位置编码
        return pos

公式如下：

关于y_embed = not_mask.cumsum(1, dtype=torch.float32)

在mask模板中，为False的位置代表图片真实有效的位置，在将mask进行取反后，Ture的位置为真实有效的位置，在沿着x轴（y轴）进行累加求和得到每个位置的坐标，会发现在False的位置所可能会出现相同坐标，False的位置在后面会将其赋值为-infnity，在经过softmax处理之后，数值会无限接近于0，因此False位置的位置编码并不重要，本身它也就代表着是padding之后的位置，不属于图片真实有效的位置

PositionEmbeddingLearned

class PositionEmbeddingLearned(nn.Module):
    """
    Absolute pos embedding, learned.
    """
    def __init__(self, num_pos_feats=256):
        super().__init__()
        self.row_embed = nn.Embedding(50, num_pos_feats)  # 50表示位置的维度，num_pos_feats表示每个位置上的特征数。
        self.col_embed = nn.Embedding(50, num_pos_feats)
        # nn.init.uniform_函数来对row_embed.weight和col_embed.weight进行均匀分布的随机初始化
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.uniform_(self.row_embed.weight)
        nn.init.uniform_(self.col_embed.weight)

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        h, w = x.shape[-2:]
        i = torch.arange(w, device=x.device)
        j = torch.arange(h, device=x.device)
        x_emb = self.col_embed(i)  # (w, num_pos_feats)
        y_emb = self.row_embed(j)  # (h, num_pos_feats)
        pos = torch.cat([
            x_emb.unsqueeze(0).repeat(h, 1, 1),
            y_emb.unsqueeze(1).repeat(1, w, 1),
        ], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
        return pos

Backbone_CNN

调用Backbone类

class Backbone(BackboneBase):
    """ResNet backbone with frozen BatchNorm."""
    def __init__(self, name: str,
                 train_backbone: bool,
                 return_interm_layers: bool,
                 dilation: bool):
        # getattr返回torchvision.models对象的属性值
        backbone = getattr(torchvision.models, name)(
            replace_stride_with_dilation=[False, False, dilation],
            pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d)
        num_channels = 512 if name in ('resnet18', 'resnet34') else 2048
        super().__init__(backbone, train_backbone, num_channels, return_interm_layers)

继承来自BackboneBase

class BackboneBase(nn.Module):

    def __init__(self, backbone: nn.Module, train_backbone: bool, num_channels: int, return_interm_layers: bool):
        super().__init__()
        # layer0 layer1不需要训练 前层提取的信息有限
        for name, parameter in backbone.named_parameters():
            if not train_backbone or 'layer2' not in name and 'layer3' not in name and 'layer4' not in name:
                parameter.requires_grad_(False)
        # False 检测任务不需要返回中间层
        if return_interm_layers:
            return_layers = {"layer1": "0", "layer2": "1", "layer3": "2", "layer4": "3"}
        else:
            return_layers = {'layer4': "0"}
        # 检测任务直接返回layer4即可  执行torchvision.models._utils.IntermediateLayerGetter函数直接返回对应层的输出结果
        self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
        self.num_channels = num_channels

    def forward(self, tensor_list: NestedTensor):
        xs = self.body(tensor_list.tensors)   # 取出要求特征层的信息（包含整个特征与mask信息）
        out: Dict[str, NestedTensor] = {}  # 创建一个空字典out，用于存储处理后的特征张量
        # 目的是将mask padding填充的特征与图片特征分离
        for name, x in xs.items():   # name表示特征层的名称，x表示该特征层的输出
            m = tensor_list.mask  # 源输入的mask [bs, 608, 810] 知道图片哪些区域是有效的 哪些位置是pad之后的无效的
            assert m is not None
            #  对掩码m进行插值，使其与特征张量x具有相同的空间尺寸
            # m[None]表示将掩码m添加一个维度，形状变为(1, h, w)，其中h和w分别表示原始掩码的高度和宽度
            # size=x.shape[-2:]表示插值后的结果应该具有和特征张量x的最后两个维度相同的高度和宽度
            mask = F.interpolate(m[None].float(), size=x.shape[-2:]).to(torch.bool)[0]
            out[name] = NestedTensor(x, mask)
        return out   # 每个元素是一个NestedTensor对象，其中包含了特征张量和对应的掩码

Joiner(backbone, position_embedding)

通过Joiner(backbone, position_embedding)将backeone与positon_embedding连接在一起

class Joiner(nn.Sequential):
    def __init__(self, backbone, position_embedding):
        super().__init__(backbone, position_embedding)

    def forward(self, tensor_list: NestedTensor):  # 输入参数tensor_list是一个NestedTensor类型的张量，具有两个属性：tensors和mask，分别表示输入张量和对应的掩码
        xs = self[0](tensor_list)  # self[0]，即主干网络模块backbone
        out: List[NestedTensor] = []
        pos = []
        for name, x in xs.items():
            out.append(x)
            # position encoding
            pos.append(self[1](x).to(x.tensors.dtype))   # self[1]:position_embedding

        return out, pos   # 返回一个元组(out, pos)，其中out包含了所有特征层的输出，pos包含了所有特征层对应的位置编码

2.1.2 Transformer

通过bulid_transformer建立

def build_transformer(args):
    """
    d_model: 编码器里面mlp（前馈神经网络  2个linear层）的hidden dim 512
    nhead: 多头注意力头数 8
    num_encoder_layers: encoder的层数 6
    num_decoder_layers: decoder的层数 6
    dim_feedforward: 前馈神经网络的维度 2048
    dropout: 0.1
    activation: 激活函数类型 relu
    normalize_before: 是否使用前置LN
    return_intermediate_dec: 是否返回decoder中间层结果  False
    """
    return Transformer(
        d_model=args.hidden_dim,
        dropout=args.dropout,
        nhead=args.nheads,
        dim_feedforward=args.dim_feedforward,
        num_encoder_layers=args.enc_layers,
        num_decoder_layers=args.dec_layers,
        normalize_before=args.pre_norm,
        return_intermediate_dec=True,
    )

调用transformer类：

TransformerEncoderLayer

encoderlayer结构图：

encoder由四部分组成：multi-head Attention + add&Norm + feed forward + add&Norm

其中add&Norm就是一个残差结构+LN

class TransformerEncoderLayer(nn.Module):
    """
    d_model:mlp隐藏层输入dim（一个全连接层）
    nhead：多头注意力头数
    dim_feedforward：前馈神经网络的维度 2048
    dropout：0.1
    activate:激活函数类型relu
    normalize_before：是否使用前置LN
    """
    # 结构：multi-head Attention + add&Norm + feed forward + add&Norm
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before

    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos  # 将tensor与位置编码相加，参数经过backbone输出是features, pos分开的

    def forward_post(self,
                     src,   # backbone输出的最后一层特征图，映射到hidden_dim  shape（h*w,bs，hidden_dim）
                     src_mask: Optional[Tensor] = None,   # 判断是否存在transformer作弊，遮住当前预测位置之后的位置，忽略这些位置，不计算与其相关的注意力权重 None
                     src_key_padding_mask: Optional[Tensor] = None,  # 最后一层输出特征图对应的mask，即对特征图无效位置不进行encoder
                     pos: Optional[Tensor] = None):    # 位置编码
        # 每一层q、k都加上位置编码  信息不断加强  最终得到的特征全局相关性最强
        # 这里对query和key增加位置编码 是因为需要在图像特征中各个位置之间计算相似度/相关性 而value作为原图像的特征 和 相关性矩阵加权
        # key_padding_mask：对无效位置计算注意力会被填充为-inf，这样最终生成注意力经过softmax时输出就趋向于0，相当于忽略不计
        q = k = self.with_pos_embed(src, pos)  # 对特征与位置编码进行相加---q、k = src + pos
        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,   # v = backbone输出的最后一层特征图
                              key_padding_mask=src_key_padding_mask)[0]  # nn.MultiheadAttention: 返回两个值  第一个是自注意力层的输出  第二个是自注意力权重  这里取0
        src = src + self.dropout1(src2)  # add
        src = self.norm1(src)  # Norm
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))  # feed forward
        src = src + self.dropout2(src2)  # add
        src = self.norm2(src)  # Norm
        return src  # 输出特征图shape不改变

    def forward_pre(self, src,
                    src_mask: Optional[Tensor] = None,
                    src_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None):
        src2 = self.norm1(src)
        q = k = self.with_pos_embed(src2, pos)
        src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src2 = self.norm2(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
        src = src + self.dropout2(src2)
        return src

    def forward(self, src,
                src_mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        if self.normalize_before:
            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
        return self.forward_post(src, src_mask, src_key_padding_mask, pos)

TransformerEncoder

调用_get_clones函数，复制TransformerEncoderLayer类，然后前向传播依次输入这6个TransformerEncoderLayer类，不断的计算特征图的自注意力，并不断的增强特征图，最终得到最强的（信息最多的）特征图output：[h*w, bs, 256]。整个TransformerEncoder过程特征图的shape是不变的。

class TransformerEncoder(nn.Module):

    def __init__(self, encoder_layer, num_layers, norm=None):
        super().__init__()
        # 复制num_layers次encoder_layer
        self.layers = _get_clones(encoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm

    def forward(self, src,
                mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        output = src
        # 遍历，注意：特征图的大小不变
        for layer in self.layers:
            output = layer(output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos)

        if self.norm is not None:
            output = self.norm(output)

        return output  # 得到最终ENCODER的输出 [h*w, bs, 256]

TransformerDncoderLayer

encoderlayer结构图：

decoder layer = Masked Multi-Head self-Attention + Add&Norm + Multi-Head Attention + add&Norm + feed forward + add&Norm

其中关键是Masked Multi-Head self-Attentio（没用到decoder中的信息，自己学习，划分每个向量q的任务）与 Multi-Head Attention（使用encoder中提供的q与k整合之前自注意力机制提供的q，使q变得更精确）

class TransformerDecoderLayer(nn.Module):

    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)      # 与encoder相关
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)   # 不与encoder相关
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before

    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos

    def forward_post(self, tgt, memory,
                     tgt_mask: Optional[Tensor] = None,
                     memory_mask: Optional[Tensor] = None,
                     tgt_key_padding_mask: Optional[Tensor] = None,
                     memory_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None,
                     query_pos: Optional[Tensor] = None):
        """
        tgt: 需要预测的目标 query embedding  负责预测物体  用于建模图像当中的物体信息  在每层decoder层中不断的被refine
             [num_quires, bs, hidden_dim]  和 query_embed形状相同  设置为全0初始化
        query_pos: shape与tgt一致  query embedding/tgt的位置编码  负责建模物体与物体之间的位置关系  随机初始化
        memory: [h*w, bs, hidden_dim]  Encoder输出  具有全局相关性（增强后）的特征表示
        tgt_mask: None
        memory_mask: None
        tgt_key_padding_mask: None  tgt中每个位置都需要计算
        memory_key_padding_mask: [bs, h*w]  记录Encoder输出特征图的每个位置是否是被padding的（True无效   False有效）
        pos: [h*w, bs, 256]  encoder输出特征图的位置编码

        tgt_mask、memory_mask、tgt_key_padding_mask是防止作弊的 这里都没有使用
        """
        # 第一个self-attention的目的：找到图像中物体的信息，然后将Q，即从quire_position拿到物体之间的相互关系提供给第二个atten
        # 没用到decoder中的信息，自己学习，划分每个向量q的任务
        # 通过第一个self-attention 不断建模物体与物体之间的关系，知道图像当中哪些位置会存在物体  物体信息->tgt
        q = k = self.with_pos_embed(tgt, query_pos)
        # 计算query embedding的自注意力
        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        # Add & Norm
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)
        # 第二个self-attention的目的：不断增强encoder的输出特征，将物体的信息不断加入encoder的输出特征中去，更好地表征了图像中的各个物体
        # 第二个多头注意力层，也叫Encoder-Decoder self attention：key和value来自Encoder层输出,从encoder拿到全图之间的相互关系（特征）   Query来自Decoder层输入（第一个self_atten的Q),拿到物体与物体之间的关系
        # 第二个self-attention 可以建模图像 与 物体之间的关系
        # 根据上一个self_atten的query 不断的去encoder输出的特征图中去问（q和k计算相似度）  问图像当中的物体在哪里呢？
        # 问完之后再将物体的位置信息融合encoder输出的特征图中（和v做运算）  这样我得到的v的特征就有 encoder增强后特征信息 + 物体的位置信息
        # query = query embedding  +  query_pos
        # key = encoder输出特征 + 特征位置编码
        # value = encoder输出特征
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        # Add & Norm
        tgt = tgt + self.dropout2(tgt2)
        tgt = self.norm2(tgt)
        # FFN
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        # Add & Norm
        tgt = tgt + self.dropout3(tgt2)
        tgt = self.norm3(tgt)
        return tgt  # 输出为tag1+tag2

    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        return tgt

    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        if self.normalize_before:
            return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
                                    tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
        return self.forward_post(tgt, memory, tgt_mask, memory_mask,
                                 tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)

TransformerDncoder

Decoder结构也是用_get_clones复制6份TransformerDecoderLayer类，然后前向传播依次输入这6个TransformerDecoderLayer类，不过，Decoder需要输入这6个TransformerDecoderLayer的输出，后面这6个层的输出会一起参与损失计算。

class TransformerDecoder(nn.Module):

    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
        super().__init__()
        # 复制num_layers=decoder_layer=TransformerDecoderLayer
        # 每一层Decoder都是逐层解析，逐层加强的，所以前面层的解析效果对后面层的解析是有意义的，所以作者把前面5层的输出也加入损失计算
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.return_intermediate = return_intermediate

    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        output = tgt

        intermediate = []   # 保存6层decoder输出
        # 遍历6层decoder
        for layer in self.layers:
            output = layer(output, memory, tgt_mask=tgt_mask,
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos)
            if self.return_intermediate:  # 是否保留中间层输出  Ture
                intermediate.append(self.norm(output))

        # 如果return_intermediate为Ture，等于是将最后一层输出删除后再将最后以后norm后添加到列表中，但是在前面最后一层输出以及norm处理了
        # 但是return_intermediate在decoder中又必须为Ture，因为需要返回每一层的输出并对他进行损失计算，那么这段代码的意义是？？？
        if self.norm is not None:
            output = self.norm(output)
            if self.return_intermediate:
                intermediate.pop()
                intermediate.append(output)
                
        # 最后把  6x[100,bs,256] -> [6(6层decoder输出),100,bs,256]
        if self.return_intermediate:
            return torch.stack(intermediate)

        return output.unsqueeze(0)  # 不执行

疑问一
有的人可能疑问，为什么这里定义的物体信息tgt，初始化为全0，物体位置信息query_pos，随机初始化，但是可以表示这么复杂的含义呢？它明明是初始化为全0或随机初始化的，模型怎么知道的它们代表的含义？这其实就和损失函数有关了，损失函数定义好了，通过计算损失，梯度回传，网络不断的学习，最终学习得到的tgt和query_pos就是这里表示的含义。这就和回归损失一样的，定义好了这四个channel代表xywh，那网络怎么知道的？就是通过损失函数梯度回传，网络不断学习，最终知道这四个channel就是代表xywh。

疑问二
为什么这里要将tgt1 + tgt2做为decoder的输出呢？不是单独的用tgt1或者tgt2呢？

首先tgt1代表图像中的物体信息 + 物体的位置信息，但是他没有太多的图像特征，这是不行的，最后预测效果肯定不好（预测物体类别肯定不是很准）；

其次tgt2代表的encoder增强版的图像特征 + 物体的位置信息，它缺少了物体的信息，这也是不行的，最后的预测效果肯定也不好（预测物体位置肯定不是很准）；

所以两者相加的特征作为decoder的输出，去预测物体的类别和位置，效果最好。https://hukai.blog.csdn.net/article/details/127616634

2.1.3 损失计算：SetCriterion

在detr.py中定义损失函数

    criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,
                             eos_coef=args.eos_coef, losses=losses)

计算损失：self.get_loss

    def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
        loss_map = {
            'labels': self.loss_labels,
            'cardinality': self.loss_cardinality,
            'boxes': self.loss_boxes,
            'masks': self.loss_masks
        }
        assert loss in loss_map, f'do you really want to compute {loss} loss?'
        return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)

调用分类损失（self.loss_labels）、回归损失（self.boxes）和cardinality损失。不过cardinality损失只用于log，并不参与梯度更新。mask分割损失计算作用与分割任务，不展开叙述。

class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """
        super().__init__()
        self.num_classes = num_classes  # 数据集类别数目
        self.matcher = matcher    # 匈牙利匹配算法，计算目标和预测之间的匹配
        self.weight_dict = weight_dict      # dict: 18  3x6  6个decoder的损失权重   6*(loss_ce+loss_giou+loss_bbox)
        self.eos_coef = eos_coef     # 0.1
        self.losses = losses          # list: 3  ['labels', 'boxes', 'cardinality']
        empty_weight = torch.ones(self.num_classes + 1)
        empty_weight[-1] = self.eos_coef   # tensro: 92   前91=1  92=eos_coef=0.1
        # empty_weight 张量被注册为模型 self 的缓冲区参数，并被命名为 empty_weight
        # empty_weight 成为了模型的一部分，但不会参与模型的反向传播或梯度更新过程。而是作为固定的常量参数存在于模型中。
        self.register_buffer('empty_weight', empty_weight)

    def loss_labels(self, outputs, targets, indices, num_boxes, log=True):   # 分类损失
        """Classification loss (NLL)
        targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
        """
        assert 'pred_logits' in outputs
        src_logits = outputs['pred_logits']

        # idx tuple:2  0=[num_all_gt] 记录每个gt属于哪张图片  1=[num_all_gt] 记录每个匹配到的预测框的index
        idx = self._get_src_permutation_idx(indices)
        target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)])
        # 正样本+负样本  上面匹配到的预测框作为正样本 正常的idx  而100个中没有匹配到的预测框作为负样本(idx=91 背景类)
        target_classes = torch.full(src_logits.shape[:2], self.num_classes,
                                    dtype=torch.int64, device=src_logits.device)
        target_classes[idx] = target_classes_o
        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)
        losses = {'loss_ce': loss_ce}

        if log:   # 日志
            # TODO this should probably be a separate loss, not hacked in this one here
            losses['class_error'] = 100 - accuracy(src_logits[idx], target_classes_o)[0]
        return losses

    @torch.no_grad()    # 不参与梯度计算
    def loss_cardinality(self, outputs, targets, indices, num_boxes):
        """ Compute the cardinality error, ie the absolute error in the number of predicted non-empty boxes
        This is not really a loss, it is intended for logging purposes only. It doesn't propagate gradients
        """
        pred_logits = outputs['pred_logits']
        device = pred_logits.device
        tgt_lengths = torch.as_tensor([len(v["labels"]) for v in targets], device=device)
        # Count the number of predictions that are NOT "no-object" (which is the last class)
        card_pred = (pred_logits.argmax(-1) != pred_logits.shape[-1] - 1).sum(1)
        card_err = F.l1_loss(card_pred.float(), tgt_lengths.float())
        losses = {'cardinality_error': card_err}
        return losses

    def loss_boxes(self, outputs, targets, indices, num_boxes):   # 回归损失
        """Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
           targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
           The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
        """
        assert 'pred_boxes' in outputs
        # idx tuple:2  0=[num_all_gt] 记录每个gt属于哪张图片  1=[num_all_gt] 记录每个匹配到的预测框的index
        idx = self._get_src_permutation_idx(indices)
        # [all_gt_num, 4]  这个batch的所有正样本的预测框坐标
        src_boxes = outputs['pred_boxes'][idx]
        # [all_gt_num, 4]  这个batch的所有gt框坐标
        target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)
        # 计算L1损失
        loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')

        losses = {}
        losses['loss_bbox'] = loss_bbox.sum() / num_boxes
        # 计算GIOU损失
        loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
            box_ops.box_cxcywh_to_xyxy(src_boxes),
            box_ops.box_cxcywh_to_xyxy(target_boxes)))
        losses['loss_giou'] = loss_giou.sum() / num_boxes
        # 回归损失：只计算所有正样本的回归损失；
        # 回归损失 = L1Loss + GIOULoss
        return losses

    def loss_masks(self, outputs, targets, indices, num_boxes):   # 与分割任务相关，不做描述
        """Compute the losses related to the masks: the focal loss and the dice loss.
           targets dicts must contain the key "masks" containing a tensor of dim [nb_target_boxes, h, w]
        """
        assert "pred_masks" in outputs

        src_idx = self._get_src_permutation_idx(indices)
        tgt_idx = self._get_tgt_permutation_idx(indices)
        src_masks = outputs["pred_masks"]
        src_masks = src_masks[src_idx]
        masks = [t["masks"] for t in targets]
        # TODO use valid to mask invalid areas due to padding in loss
        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()
        target_masks = target_masks.to(src_masks)
        target_masks = target_masks[tgt_idx]

        # upsample predictions to the target size
        src_masks = interpolate(src_masks[:, None], size=target_masks.shape[-2:],
                                mode="bilinear", align_corners=False)
        src_masks = src_masks[:, 0].flatten(1)

        target_masks = target_masks.flatten(1)
        target_masks = target_masks.view(src_masks.shape)
        losses = {
            "loss_mask": sigmoid_focal_loss(src_masks, target_masks, num_boxes),
            "loss_dice": dice_loss(src_masks, target_masks, num_boxes),
        }
        return losses

    def _get_src_permutation_idx(self, indices):
        # permute predictions following indices
        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
        src_idx = torch.cat([src for (src, _) in indices])
        return batch_idx, src_idx

    def _get_tgt_permutation_idx(self, indices):
        # permute targets following indices
        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])
        tgt_idx = torch.cat([tgt for (_, tgt) in indices])
        return batch_idx, tgt_idx

    def forward(self, outputs, targets):
        """ This performs the loss computation.
        Parameters:
             outputs: dict of tensors, see the output specification of the model for the format
             targets: list of dicts, such that len(targets) == batch_size.
                      The expected keys in each dict depends on the losses applied, see each loss' doc
        """
        # dict: 2   最后一个decoder层输出  pred_logits[bs, 100, 92个class] + pred_boxes[bs, 100, 4]（将前5层中间层筛选开）
        outputs_without_aux = {k: v for k, v in outputs.items() if k != 'aux_outputs'}

        # 匈牙利算法  解决二分图匹配问题  从100个预测框中找到和N个gt框一一对应的预测框  其他的100-N个都变为背景
        # Retrieve the matching between the outputs of the last layer and the targets  list:1
        # tuple: 2    0=Tensor3=Tensor[5, 35, 63]  匹配到的3个预测框  其他的97个预测框都是背景
        #             1=Tensor3=Tensor[1, 0, 2]    对应的三个gt框
        indices = self.matcher(outputs_without_aux, targets)

        # Compute the average number of target boxes accross all nodes, for normalization purposes
        num_boxes = sum(len(t["labels"]) for t in targets)
        num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
        if is_dist_avail_and_initialized():   # 分布式
            torch.distributed.all_reduce(num_boxes)
        num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item()

        # Compute all the requested losses
        losses = {}
        # 计算最后一层decoder的loss
        for loss in self.losses:
            losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))

        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
        # 这段代码的目的是处理辅助输出和相关损失函数。根据不同的损失函数类型和辅助输出，对每个辅助输出计算相应的损失，并将其添加到总体的损失字典中。
        if 'aux_outputs' in outputs:
            for i, aux_outputs in enumerate(outputs['aux_outputs']):
                indices = self.matcher(aux_outputs, targets)
                for loss in self.losses:
                    if loss == 'masks':    # 对于 'masks' 损失，因为计算中间的掩码损失开销较大，所以跳过计算
                        # Intermediate masks losses are too costly to compute, we ignore them.
                        continue
                    kwargs = {}
                    if loss == 'labels':
                        # Logging is enabled only for the last layer
                        kwargs = {'log': False}
                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
                    l_dict = {k + f'_{i}': v for k, v in l_dict.items()}  # 通过使用字典推导式将损失字典中的键加上后缀 _i，并将其与对应的值组成新的键值对，以区分不同的辅助输出。
                    losses.update(l_dict)

        return losses

匈牙利匹配：HungarianMatcher

class HungarianMatcher(nn.Module):

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher

        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"

    @torch.no_grad()
    def forward(self, outputs, targets):
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        # 将预测的分类 logits（outputs["pred_logits"]）通过 softmax 函数进行归一化处理。
        # [batch_size, num_queries, num_classes]->>[batch_size * num_queries, num_classes]，然后在最后一个维度上应用 softmax 函数。
        # softmax 函数将每个类别的 logits 转换为一个概率分布，使得每个类别的概率值都在 0 到 1 之间，并且所有类别的概率之和为 1。
        # 输出的 out_prob 张量包含了每个预测的类别概率分布
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        # 将目标的类别标签和边界框坐标连接起来，形成两个张量 tgt_ids 和 tgt_bbox
        #  [num_target_boxes]
        tgt_ids = torch.cat([v["labels"] for v in targets])
        #  [num_target_boxes, 4]
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        # [batch_size * num_queries, num_classes] -->> [batch_size * num_queries, num_target_boxes] 整个batch中的预测与目标
        # 第 i 个预测和第 j 个目标的匹配成本被定义为预测为目标 j 的负对数概率值（即 -log p(y_i=j|x_i)）
        # 分类损失通常被定义为交叉熵损失，它是一个正数，并且随着模型的训练应该逐渐减小。因此，为了将其作为损失函数的一部分，我们需要将其取负数
        cost_class = -out_prob[:, tgt_ids]  # 张量包含了所有预测和目标之间的分类匹配成本

        # Compute the L1 cost between boxes
        # 使用 PyTorch 的 cdist 函数计算了预测边界框和目标边界框之间的L1距离。
        # 这个函数接受两个张量作为输入，返回一个形状为 [batch_size * num_queries, num_target_boxes] 的张量
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)  # 包含了所有预测和目标之间的边界框坐标回归损失

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))  # 包含GIOU损失

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()  # [batch_size * num_queries, -1] -->> [batch_size, num_queries, -1]

        sizes = [len(v["boxes"]) for v in targets]   # 一张图片的gt数目
        # 先计算每个预测框（100个）和每个gt框的总损失，形成损失矩阵C，
        # 调用匈牙利算法计算这个二分图的度量矩阵的最小权重分配方式，返回的是匹配方案对应的矩阵行索引（预测框idx）和列索引（gt框idx）得到loss最小的组合
        # 若一张图片中gt为3
        # 0: [3]  5, 35, 63   匹配到的gt个预测框idx
        # 1: [3]  1, 0, 2     对应的gt idx
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        # list: bs  返回bs张图片的匹配结果
        # 每张图片都是一个tuple:2
        # 0 = Tensor[gt_num,]  匹配到的正样本idx       1 = Tensor[gt_num,]  gt的idx
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]

总的来说，先计算每个预测框（100个）和每个gt框的总损失，形成损失矩阵C，然后调用匈牙利算法计算这个二分图的度量矩阵的最小权重分配方式，返回匹配方案对应的矩阵行索引（预测框idx）和列索引（gt框idx）得到loss最小的组合
需要注意的是：在 DETR（DEtection TRansformer）模型的匈牙利匹配中，通常是按照每张图片进行匹配，而不是一个 batch 去匹配。这样可以更好地处理每张图片的检测结果，并确保每张图片的预测框与目标框之间能够建立正确的对应关系，以便进行损失计算和参数更新。

2.1.4 bbox后处理：PostProcess

在detr.py中定义后处理函数

    postprocessors = {'bbox': PostProcess()}

PostProcess类

class PostProcess(nn.Module):
    """ This module converts the model's output into the format expected by the coco api"""
    @torch.no_grad()
    def forward(self, outputs, target_sizes):
        """ Perform the computation
        Parameters:
            outputs: raw outputs of the model
            target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
                          For evaluation, this must be the original image size (before any data augmentation)
                          For visualization, this should be the image size after data augment, but before padding
        """
        # output: 模型的原始输出，其中包括预测的 logits 和边界框
        # target_sizes 是一个张量，大小为 [batch_size x 2]，包含了每个图像的尺寸信息。
        # 在评估时，应该是原始图像的尺寸（在任何数据增强之前）。在可视化时，应该是数据增强后的图像尺寸，但没有填充。
        out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']

        assert len(out_logits) == len(target_sizes)
        assert target_sizes.shape[1] == 2
        # [bs, 100, 92]  对每个预测框的类别概率取softmax  prob[..., :-1]: [bs, 100, 92] -> [bs, 100, 91]  删除背景
        # .max(-1): scores=[bs, 100]  100个预测框属于最大概率类别的概率
        #           labels=[bs, 100]  100个预测框的类别
        prob = F.softmax(out_logits, -1)
        scores, labels = prob[..., :-1].max(-1)

        # convert to [x0, y0, x1, y1] format
        # 将边界框从中心宽高格式转换为对角点坐标格式 [x0, y0, x1, y1]
        boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
        # and from relative [0, 1] to absolute [0, height] coordinates
        img_h, img_w = target_sizes.unbind(1)  # nbind(1) 的作用是沿着第二维度（从 0 开始索引）将张量拆分成两个张量 img_h 和 img_w
        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
        # boxes * scale_fct[:, None, :]：这一步是对边界框张量 boxes 进行缩放操作。通过广播机制，将 scale_fct 张量扩展到与 boxes 张量相同的维度，
        # 然后对应元素相乘，实现对边界框坐标进行缩放。最终的结果是将边界框的坐标从相对坐标（0 到 1）转换为绝对坐标（0 到图像宽度/高度)
        boxes = boxes * scale_fct[:, None, :]

        results = [{'scores': s, 'labels': l, 'boxes': b} for s, l, b in zip(scores, labels, boxes)]

        return results

后处理目的是把预测结果进行统计，剔除背景类，得到每张图片预测的100个预测框的所属类别的概率分数scores 、所属类别labels 、绝对位置坐标boxes 。其作用是将模型的输出转换为 COCO API 预期格式的结果。

注意：
而在预测的时候，实际上我们最终的预测物体一般没有100个物体，这时候是怎么处理的呢？一般是会设置一个预测概率分数的阈值（0.7），大于这个预测的预测框最终才会保留下来显示，那些小于预测的预测框会舍去。https://blog.csdn.net/qq_38253797/article/details/127618402

2.2 搭建DETR

class DETR(nn.Module):
    """ This is the DETR module that performs object detection """
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """
        super().__init__()
        self.num_queries = num_queries  # 每张图片中要查询的边界框数量的参数
        self.transformer = transformer
        hidden_dim = transformer.d_model
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)   # 分类
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)  # 回归
        self.query_embed = nn.Embedding(num_queries, hidden_dim)   # self.query_embed 类似于传统目标检测里面的anchor 这里设置了100个  [100,256]
        self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)
        self.backbone = backbone
        self.aux_loss = aux_loss

    def forward(self, samples: NestedTensor):
        """ The forward expects a NestedTensor, which consists of:
               - samples.tensor: batched images, of shape [batch_size x 3 x H x W]
               - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels

            It returns a dict with the following elements:
               - "pred_logits": the classification logits (including no-object) for all queries.
                                Shape= [batch_size x num_queries x (num_classes + 1)]
               - "pred_boxes": The normalized boxes coordinates for all queries, represented as
                               (center_x, center_y, height, width). These values are normalized in [0, 1],
                               relative to the size of each individual image (disregarding possible padding).
                               See PostProcess for information on how to retrieve the unnormalized bounding box.
               - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
                                dictionnaries containing the two above keys for each decoder layer.
        """
        if isinstance(samples, (list, torch.Tensor)):
            samples = nested_tensor_from_tensor_list(samples)   # 将mask张量中与img张量相同大小的区域的值设置为False，标记出每个图像张量的有效区域。  Ture为无效区域 False为有效区域
        # out: list{0: tensor=[bs,2048,19,26] + mask=[bs,19,26]}  经过backbone resnet50 block5输出的结果
        # pos: list{0: [bs,256,19,26]}  位置编码
        features, pos = self.backbone(samples)
        # src: Tensor [bs,2048,19,26]
        # mask: Tensor [bs,19,26]
        src, mask = features[-1].decompose()
        assert mask is not None
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
        # 分类 [6个decoder, bs, 100, 256] -> [6, bs, 100, 92(类别)]
        outputs_class = self.class_embed(hs)
        # 回归 [6个decoder, bs, 100, 256] -> [6, bs, 100, 4]
        outputs_coord = self.bbox_embed(hs).sigmoid()
        out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
        if self.aux_loss:
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        # dict: 3
        # 0 pred_logits 分类头输出[bs, 100, 92(类别数)]
        # 1 pred_boxes 回归头输出[bs, 100, 4]
        # 3 aux_outputs list: 5  前5个decoder层输出 5个pred_logits[bs, 100, 92(类别数)] 和 5个pred_boxes[bs, 100, 4]
        return out

    @torch.jit.unused
    def _set_aux_loss(self, outputs_class, outputs_coord):
        # this is a workaround to make torchscript happy, as torchscript
        # doesn't support dictionary with non-homogeneous values, such
        # as a dict having both a Tensor and a list.
        return [{'pred_logits': a, 'pred_boxes': b}
                for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]