【深度学习】【DETR】End-to-End Object Detection with Transformers

最新推荐文章于 2024-03-19 10:31:10 发布

VIP文章 Hanawh

最新推荐文章于 2024-03-19 10:31:10 发布

阅读量2.9k

点赞数 6

分类专栏：深度学习文章标签：深度学习

本文链接：https://blog.csdn.net/qq_36530992/article/details/106519492

版权

【论文阅读】End-to-End Object Detection with Transformers

损失函数
网络架构
- Transformer
代码运行

创新：

引入transformer，提出DETR（DEtection TRansformer）模型
去除以往模型引入的先验知识（achor、nms）
大物体精度高—得益于transformer的non-local computations，但是对小物体精度低

损失函数

网络的输出为N个检测框(包含类别)，因为很难去度量检测结果的好坏，就提出了一个损失。首先将gt box也变成长度为N的序列以便和网络输出进行匹配，不够长度的用 $\emptyset$ 补充，然后对这个序列排列组合，找到和预测的N个序列损失最小的序列来优化。这样就可以得到一对一的关系，也不用后处理操作NMS。
损失函数具体实现？？

网络架构

在这里插入图片描述
结构分为3部分：

backbone：[3, H, W] 变为[2048, H/32, W/32]
encoder-decoder transformer
FFN（feed forward network）

Transformer

这部分建议不懂得可以先看李宏毅老师的transformer讲解（b站自行搜索）以及https://jalammar.github.io/illustrated-transformer/

看论文过程中不太懂
positional encoding如何表达??

代码运行

数据处理部分：(只对部分代码进行说明，基本的数据提取和变换不做介绍）
构建一个batch的时候，通过collate_fn传入到dataloader里面收集处理该batch，对图像的像素主要通过NestedTensor这个类进行处理，从而可以处理不同大小的图片，处理之后每个batch的大小可能不一样，batch里面的图像大小是一样的。

# DETR/util/misc.py
def collate_fn(batch):
    #print(len(batch)) #2
    #print(batch[0][0].shape) #和batch[0][1]中‘size’一样
    #print(batch[0][1]) #包含0图片的各种信息
    # batch 是多个img boxes label 等信息组成的列表 通过zip从里面各取出元素组成新的列表
    batch = list(zip(*batch)) 
    batch[0] = NestedTensor.from_tensor_list(batch[0]) # batch[0]就仅仅包含了reshape后的batch img
    return tuple(batch)

class NestedTensor(object):
    def __init__(self, tensors, mask):
        self.tensors = tensors
        self.mask = mask

    def to(self, *args, **kwargs):
        cast_tensor = self.tensors.to(*args, **kwargs)
        cast_mask = self.mask.to(*args, **kwargs) if self.mask is not None else None
        return type(self)(cast_tensor, cast_mask)

    def decompose(self):
        return self.tensors, self.mask

    @classmethod # 可以不需要实例化
    def from_tensor_list(cls, tensor_list): # 表示自身对象的self和自身类的cls参数
        # TODO make this more general
        if tensor_list[0].ndim == 3:
            # TODO make it support different-sized images
            max_size = tuple(max(s) for s in zip(*[img.shape for img in tensor_list]))
            #print(max_size) # 该batch里面长宽最大的值
            #print(len(tensor_list)) # 2
            # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))
            batch_shape = (len(tensor_list),) + max_size
            #print(batch_shape)
            b, c, h, w = batch_shape
            dtype = tensor_list[0].dtype
            device = tensor_list[0].device
            tensor = torch.zeros(batch_shape, dtype=dtype, device=device)
            mask = torch.ones((b, h, w), dtype=torch.bool, device=device)
            for img, pad_img, m in zip(tensor_list, tensor, mask):
                pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
                m[: img.shape[1], :img.shape[2]] = False
        else:
            raise ValueError('not supported')
        return cls(tensor, mask)

    def __repr__(self):
        return repr(self.tensors)

网络结构

backbone部分主要由resnet50和PosionEncoding两个层构成，将resnet50最后一层的输出再输入到PosionEncoding里面进行对位置的编码。其主要变换就是对特征图上的每一个像素的x和y坐标分别生成一个128维的向量，前半部分由sin函数得到，后半部分由cos函数得到。然后将x和y生成的encoding向量拼接起来。

class PositionEmbeddingSine(nn.Module):
    """
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats # embedding的一半
        self.temperature = temperature
        self.normalize = normalize
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            scale = 2 * math.pi
        self.scale = scale

    def forward(self, tensor_list):
        x = tensor_list.tensors
        mask = tensor_list.mask # mask中以左上角为起始点，resize后的原图大小区域为false，其他区域为1
        not_mask = ~mask
        y_embed = not_mask.cumsum(1, dtype=torch.float32) # 按列累加
        x_embed = not_mask.cumsum(2, dtype=torch.float32) # 按行累加
        if self.normalize:
            eps = 1e-6
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale # y_embed[:, -1:, :] 最大的数
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device) 
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) # 128

        pos_x = x_embed[:, :, :, None] / dim_t # torch.Size([2, 24, 36, 128])
        pos_y = y_embed[:, :, :