Transformer系列-4丨DETR模型和代码解析

大耳朵爱学习

于 2024-08-21 11:31:28 发布

阅读量880

点赞数 19

文章标签： transformer 深度学习人工智能语言模型自然语言处理 DETR 大模型

本文链接：https://blog.csdn.net/2401_85379281/article/details/141389196

版权

1 前言

往期的文章中，笔者从网络结构和代码实现角度较为深入地和大家解析了Transformer模型、Vision Transformer模型（ViT）以及BERT模型，其具体的链接如下：

本期内容，笔者想和大家聊一聊2020年非常火热的一个目标检测模型，叫做DEtection TRansformer，缩写为DETR 。

之所以火热的原因，并非这个模型的性能有多好，或者运行速度有多快。相反，DETR 的性能仅能与2015年提出的Faster R-CNN媲美，如下表1所示。

图1

那么为什么这个模型能火起来呢？

通常来说，性能和速度是衡量检测模型的重要指标，为什么DETR性能一般，并且训练很困难（训练所需epoch非常长），也能受到如此大的关注？

这里直接抛出答案，该模型火的具体原因可以归纳为：

（1）真正做到了End-to-End检测。在我们熟知的基于CNN的检测模型中，不论是单阶段（如YOLO），双阶段（Faster R-CNN），亦或者是无锚框设计（Anchor-free）的CenterNet，其在进行训练或者测试时，都需要人为设置一些前处理或者后处理操作，比如设置锚框来提供参考，或者利用非极大值（NMS）抑制来筛除多余的框。然而，DETR抛弃了几乎所有的前处理和后处理操作，使模型做到了真正的End-to-End；
（2）第一个将Transformer拓展到目标检测领域中，基本没有对Transformer做什么结构上的改动，且性能仍有上升空间。这个性质对研究者来说就是一个福音，没有对模型改动且性能有上升空间，意味着后续的研究者可以紧跟这个方向，通过对网络模型结构或者训练目标的优化来展开自己的研究，进而产出自己的成果。

基于上述提到的原因，DETR一跃成为目标检测领域的热点范式。在提出工作的近一年内，涌现大量相关工作，比如Deformable DETR，Swin Transformer等。在后续的文章中，我们也会一一进行解析，欢迎大家关注。

这里我们推荐几个不错的视频课程和解析类的文章，笔者也是通过阅读原文+观看相应的文章/视频，加深了对DETR的理解：

“

【B站】李沐老师团队：DETR 论文精读【论文精读】，网址：https://www.bilibili.com/video/BV1GB4y1X72R/?spm_id_from=333.788&vd_source=beab624366b929b20152279cfa775ff6

“

【知乎】用Transformer做object detection：DETR，网址：https://zhuanlan.zhihu.com/p/267156624

建议大家在看本文前，可以自行复习一下Transformer的相关知识。本文的代码如下：

“

https://github.com/facebookresearch/detr

2 DETR解析

2.1 简介

DETR的简要结构图如下图2所示。其大致可以归纳为以下几部分：

（1）将输入图像利用卷积神经网络（CNN）映射为特征图；
（2）将特征图输入到Transformer模型中，输出个包含物体的区域集合（一个固定数量的集合）;
（3）对输出的区域集合和真实的标签，计算两者间的集合相似度（二分图匹配损失）；
（4）利用计算的损失，反向更新卷积神经网络（CNN）和Transformer模型的参数。

图2

这里，有两块内容最值得关注，分别为：

（1）DETR模型的结构以及一些其实现细节（如解码器的输入，即object queries）
（2）这个集合相似度如何计算

接下来我们就这两个点进行详细说明。

2.2 DETR模型

图3

上面提到， DETR模型主要分为两块，即卷积神经网络和Transformer模型，其中Transformer模型由分为编码器、解码器和预测头。

那么具体的DETR模型的结构如图3所示。其具体的运算流程可以归纳为：

（1）将图像（维度为）输入至卷积神经网络（比如说ResNet50），经过五次尺寸上的缩减（每次降为原来1/2）后，输出维度为的特征图；
（2）利用1x1卷积层将CNN输出的特征图的维度降低至256，记做输入特征图，其维度为；
（3）生成一个大小为的位置编码，将其与输入特征图按位相加，其相加后维度依旧是（仍记做输入特征图）；
（4）将输入特征图reshape成大小喂给Transformer编码器，输出同大小的特征图（记做编码器输出特征），其维度依旧是；
（5）构建N个（N=100），维度为256的object queries，其为可学习的embeddings，这里的100是希望模型产生至多100个物体区域。通过学习后，object queries和起到和anchor相似的作用，大致就是告诉解码器哪些区域可能会有物体；
（6）将object queries和编码器输出特征图喂给Transformer解码器，产生大小的特征输出。其中，object queries为第一层多头自注意力（MSA）的输入，而MSA的输出以及编码器输出的特征将作为解码器中编码-解码多头自注意力的输入，如下图4所示；
（7）利用两个并行、不共享权重的全连接层，将Transformer解码器的输出（维度为）映射成两个输出，一个用以分类（维度为），一个用于位置回归（维度为）。其中，回归的目标就是检测框归一化的中心坐标和宽高。

图4

说到这儿，基本上DETR的模型结构就解析的差不多啦。

这里我们对两块内容展开来说，其一是这个位置编码，其二是object queries，以加深大家的理解。

（1）位置编码

大家在学习Transformer的时候应该了解过Transformer对各个输入的位置是不敏感的。

一般来说，为了让Transformer了解各输入间的位置关系，在输入Transformer之前需要为每个特征嵌入相应的位置信息。

这种位置信息嵌入方式主要有两种，其一是为每个输入手动计算相应的位置编码，其二是设置可学习的位置编码。

在DETR的代码实现中，作者提供了两种选择。下面我们先展示手动计算的位置编码的代码实现：

class PositionEmbeddingSine(nn.Module):
    """
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats
        self.temperature = temperature
        self.normalize = normalize
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            scale = 2 * math.pi
        self.scale = scale

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        mask = tensor_list.mask
        assert mask is not None
        not_mask = ~mask
        y_embed = not_mask.cumsum(1, dtype=torch.float32)
        x_embed = not_mask.cumsum(2, dtype=torch.float32)
        if self.normalize:
            eps = 1e-6
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)

        pos_x = x_embed[:, :, :, None] / dim_t
        pos_y = y_embed[:, :, :, None] / dim_t
        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
        return pos

事实上，这种设置方法和《Attention is all your need》那篇的设置方式完全一致，感兴趣的小伙伴们可以看我之前的文章。

另外，作者还设置了一种自动学习的位置编码方式，其代码实现如下所示：

class PositionEmbeddingLearned(nn.Module):
    """
    Absolute pos embedding, learned.
    """
    def __init__(self, num_pos_feats=256):
        super().__init__()
        self.row_embed = nn.Embedding(50, num_pos_feats)
        self.col_embed = nn.Embedding(50, num_pos_feats)
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.uniform_(self.row_embed.weight)
        nn.init.uniform_(self.col_embed.weight)

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        h, w = x.shape[-2:]
        i = torch.arange(w, device=x.device)
        j = torch.arange(h, device=x.device)
        x_emb = self.col_embed(i)
        y_emb = self.row_embed(j)
        pos = torch.cat([
            x_emb.unsqueeze(0).repeat(h, 1, 1),
            y_emb.unsqueeze(1).repeat(1, w, 1),
        ], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
        return pos

笔者感觉这里的位置编码方式好像与ViT中设置的位置编码方式有些差异。

在ViT那篇文章中，作者设置的可学习的位置编码仅一行代码实现，如下：

self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim)) # 可学习的参数，长度为图像块数量+1，这里的1是class token

（2）object queries

这个不得不再提一下这个object queries，因为在DETR这篇文章中，笔者认为其和**集合相似度计算（二分图匹配损失）**都是DETR的核心所在。

前面提到，这个object queries的维度为，其为可学习的embeddings，训练刚开始的时候可以是随机初始化的。

这里的100是希望模型产生至多100个物体区域，在大多数检测任务中，图像中的目标数量远远小于这个值。通过学习后，object queries和起到和anchor相似的作用，大致就是告诉解码器哪些区域可能会有物体。

如下图5所示，作者在100个object queries中选了20个object queries，展示了这20个object queries在COCO(val)数据集上的可视化，也就是将这个数据集上所有的框在其对应的object query 的方块中分别进行了可视化。图中每个点表示一个框，其中点的位置为框的中心点，颜色表示大小（绿色为小块，红色为大的水平框，蓝色为大的竖直框）。

从图中可以看出，每个object query负责检测的框的位置和大小都有一些区别，确实有点anchor的味道，只不过anchor是预先设置好的，而object queries是学习出来的。

图5

另外一点需要注意的是，这里的100个物体区域是Transformer解码器同时输出的，而非像应用于机器翻译的Transformer那样自回归地、逐个输出，这样也极大提高了输出的效率。

2.3 集合相似度计算（二分图匹配损失）

在对DETR的模型结构讨论完毕后，大家可能会有两个疑惑。

DETR最后输出个物体区域，但是一般图像中哪有这么多物体呢？
还有就是，网络输出为固定数量的检测框，而其对应的标签里面的物体数量是变化（假设数量为m<<N）的，这个损失如何进行计算呢？

本文中，作者人为构造一个新的物体类别 ϕ （即非物体的背景类），将其补充到图像的标签中去，使得。这就构建了两个等容量的集合了，一个记做为预测集，一个为标签集。

有了预测集和标签集后，如何计算两者之间的相似度，进而求解两者之间的损失呢？

这里，作者在计算相似度之前执行了最优二分图匹配，来找到预测集和标签集间各要素的最佳匹配。让我们能够耳熟能详的最优二分图匹配就是匈牙利算法啦。

匈牙利算法的计算和实现其实很简单，就是先计算两个集合间（二分图）的代价矩阵（cost matrix）。然后调用from scipy.optimize import linear_sum_assignment就可以找到最佳的二分图匹配。

总言之，要实现最终用于DETR训练的损失计算，需要的步骤就是1. 求解代价矩阵—>2. 匈牙利算法求最优匹配—>3. 根据匹配计算损失。

（1）求解代价矩阵

匹配的核心就在于如何去计算两个集合间各要素的代价（cost）。具体地，本文作者利用以下公式计算预测集和标签集中各要素间的代价（cost），具体如下：

cost计算

这里的第一项为类别上的匹配代价，也就是类别相同，代价越小，反之越大；第二项为位置上的匹配代价，如果两个框越重合，该代价就越小。

（2）匈牙利算法求最优匹配

这个最简单，利用from scipy.optimize import linear_sum_assignment就可以找到最佳的二分图匹配。其代码实现如下所示：

class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network

    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher

        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"

    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        tgt_ids = torch.cat([v["labels"] for v in targets])
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()

        sizes = [len(v["boxes"]) for v in targets]
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]