5.关于Deformable Detr

安逸sgr

已于 2024-09-05 11:05:47 修改

阅读量612

点赞数 4

分类专栏： Transformer 文章标签：计算机视觉人工智能图像处理视觉检测神经网络

于 2024-09-05 10:41:51 首次发布

本文链接：https://blog.csdn.net/sgr011215/article/details/141924841

版权

Transformer 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

5.关于Deformable Detr

模型架构

举例源码中使用multi-scale都是四层

在这里插入图片描述

Detr缺点

在进行self-attention时，如果序列过长的话，在进行q和v计算过大，对于过大输入图像计算时间太长
Detr对于小目标检测的效果不好。

Deformable Detr

Deformable Detr 使用的（self-attention）注意力机制与传统transformer的self-attention中所有q要和所有v计算不同，采用对于某个点附近几个点较为关注的点进行计算，不再让所有的q都和v进行计算，大大的减少了计算量。
采用一种multi-scale（多层）的机制来实现对多维度特征的提取，采用一些位置信息来对准不同层次下同一点附近(实际是附近4个点)的信息采集，对于在不同层计算出来的位置，可能不是整数的问题，采用区别内四个点做交叉计算，来解决计算不同层下，计算对应位置点的偏移不是整数的问题。

BlockOne

build_backbone 函数通常用于构建模型的骨干网络（backbone），即特征提取器。常用的骨干网络包括 ResNet、EfficientNet 等。

使用resnet来进行特征提取分层

build_backbone -> Backbone # 调用关系
def build_backbone(args):
    position_embedding = build_position_encoding(args) #  获取位置编码信息
    train_backbone = args.lr_backbone > 0
    return_interm_layers = args.masks or (args.num_feature_levels > 1)
    backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)
    model = Joiner(backbone, position_embedding) # 将位置编码信息和四个层次的图像信息结果返回
    return model

class Backbone(BackboneBase):
    """ResNet backbone with frozen BatchNorm."""
    def __init__(self, name: str,
                 train_backbone: bool,
                 return_interm_layers: bool,
                 dilation: bool):
        norm_layer = FrozenBatchNorm2d
        backbone = getattr(torchvision.models, name)(
            replace_stride_with_dilation=[False, False, dilation],
            pretrained=is_main_process(), norm_layer=norm_layer)
        assert name not in ('resnet18', 'resnet34'), "number of channels are hard coded"
        super().__init__(backbone, train_backbone, return_interm_layers)
        if dilation:
            self.strides[-1] = self.strides[-1] // 2

DeformableTransformerEncoder

计算公式

在这里插入图片描述

M代表多次，Amqk代表Attention，Pmqk代表位置偏移量

可变detr的encoder，是这个模型的关键，在这里完成特征的提取，源码计算和上面图中展示的是不一样，源码中将四个层的信息拉长成一个序列，并且记录每一层的起始位置，最后得到一个很长序列。

在这里插入图片描述

class DeformableTransformerEncoder(nn.Module):
    ...
    def forward(self, src, spatial_shapes, level_start_index, valid_ratios, pos=None, padding_mask=None):
        output = src
        # 获取偏移点信息，每个特征点会有 4 个位置的偏移信息，这些偏移信息也是计算得到的
        reference_points = self.get_reference_points(spatial_shapes, valid_ratios, device=src.device)
        for _, layer in enumerate(self.layers): # 经过多个encoder编码器提取特征
            output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)
        return output

get_reference_points

初始化每一层的参数点，源码中是每个特征点，会有四个参考点，四个参考点的坐标也是学习得到的。

def get_reference_points(spatial_shapes, valid_ratios, device):
    reference_points_list = []  # 存储每个层级的参考点
    
    # 遍历每个层级的空间形状
    for lvl, (H_, W_) in enumerate(spatial_shapes):
        # 生成参考点网格
        ref_y, ref_x = torch.meshgrid(
            torch.linspace(0.5, H_ - 0.5, H_, dtype=torch.float32, device=device),  # y 轴坐标
            torch.linspace(0.5, W_ - 0.5, W_, dtype=torch.float32, device=device)   # x 轴坐标
        )
        
        # 展平 y 和 x 坐标，并计算标准化的参考点
        ref_y = ref_y.reshape(-1)[None] / (valid_ratios[:, None, lvl, 1] * H_)  # y 坐标除以有效比率和层级高度
        ref_x = ref_x.reshape(-1)[None] / (valid_ratios[:, None, lvl, 0] * W_)  # x 坐标除以有效比率和层级宽度
        
        # 将参考点组合成 (x, y) 对
        ref = torch.stack((ref_x, ref_y), -1)
        
        # 将参考点添加到列表中
        reference_points_list.append(ref)
    
    # 将所有层级的参考点连接成一个张量
    reference_points = torch.cat(reference_points_list, 1)
    
    # 根据有效比率调整参考点
    reference_points = reference_points[:, :, None] * valid_ratios[:, None]
    
    return reference_points

MSDeformAttn

encoder最关键的地方，就是注意力机制是如何计算的，对比于传统的transformer的self-attention，这里的attn，既不像传统的self-attention，也不像卷积

在这里插入图片描述

class MSDeformAttn(nn.Module):
    def forward(self, query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None):
        """
        前向传播函数
        :param query:                      查询张量，形状为 (N, Length_{query}, C)
        :param reference_points:           参考点，形状为 (N, Length_{query}, n_levels, 2)，范围在 [0, 1]，左上角 (0,0)，右下角 (1, 1)，包括填充区域
                                          或 (N, Length_{query}, n_levels, 4)，添加额外的 (w, h) 形成参考框
        :param input_flatten:              展平的输入特征图，形状为 (N, \sum_{l=0}^{L-1} H_l \cdot W_l, C)
        :param input_spatial_shapes:       输入的空间形状，形状为 (n_levels, 2)，例如 [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]
        :param input_level_start_index:    输入的层级开始索引，形状为 (n_levels, )，例如 [0, H_0*W_0, H_0*W_0+H_1*W_1, ...]
        :param input_padding_mask:         输入的填充掩码，形状为 (N, \sum_{l=0}^{L-1} H_l \cdot W_l)，True 表示填充元素，False 表示非填充元素

        :return output:                    输出特征，形状为 (N, Length_{query}, C)
        """
        N, Len_q, _ = query.shape  # 获取批量大小和查询长度
        N, Len_in, _ = input_flatten.shape  # 获取输入展平特征图的维度
        assert (input_spatial_shapes[:, 0] * input_spatial_shapes[:, 1]).sum() == Len_in  # 确保展平的输入尺寸与空间形状一致

        # 通过线性变换(全连接)获取 value 张量
        value = self.value_proj(input_flatten)

        # 如果有填充掩码，则将填充位置的值设置为 0
        if input_padding_mask is not None:
            value = value.masked_fill(input_padding_mask[..., None], float(0))

        # 重新调整 value 张量的形状
        value = value.view(N, Len_in, self.n_heads, self.d_model // self.n_heads)

        # 获取采样偏移量，使用query，进行全连接获得采样偏移量
        sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)

        # 获取注意力权重，同样采用q计算，注意力权重
        attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)
        
        attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)

        # 计算采样位置
        if reference_points.shape[-1] == 2:
            # 如果参考点的维度为 2，计算标准化的采样位置
            offset_normalizer = torch.stack([input_spatial_shapes[..., 1], input_spatial_shapes[..., 0]], -1)
            sampling_locations = reference_points[:, :, None, :, None, :] \
                                 + sampling_offsets / offset_normalizer[None, None, None, :, None, :]
        elif reference_points.shape[-1] == 4:
            # 如果参考点的维度为 4，计算参考框的采样位置
            sampling_locations = reference_points[:, :, None, :, None, :2] \
                                 + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5
        else:
            raise ValueError(
                'Last dim of reference_points must be 2 or 4, but get {} instead.'.format(reference_points.shape[-1])
            )

        # 调用自定义的 MSDeformAttnFunction 进行变形注意力计算
        output = MSDeformAttnFunction.apply(
            value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step
        )

        # 通过线性变换获取最终输出
        output = self.output_proj(output)

        return output

MSDeformAttnFunction注意力计算

由于复制q和v是一个东西，在这里对每层的特征都进行提取，主要就是q得到v，同时使用q经过fc(全连接)得到采样位置的形状，在通过v和采样的权重加权，得到加权后的v，在通过q经过全连接得到attention_weights，再将v和attention_weights加权，得到最终的特征输出

def ms_deform_attn_core_pytorch(value, value_spatial_shapes, sampling_locations, attention_weights):
    """
    变形注意力的核心计算函数（仅用于调试和测试，实际应用中需使用 CUDA 版本）

    :param value:                     输入特征值，形状为 (N_, S_, M_, D_)
    :param value_spatial_shapes:      输入特征图的空间形状，形状为 (n_levels, 2)，例如 [(H_0, W_0), (H_1, W_1), ...]
    :param sampling_locations:        采样位置，形状为 (N_, Lq_, M_, L_, P_, 2)
    :param attention_weights:         注意力权重，形状为 (N_, Lq_, M_, L_, P_)

    :return:                         输出特征，形状为 (N_, Length_{query}, C)
    """
    N_, S_, M_, D_ = value.shape  # 获取输入特征的维度
    _, Lq_, M_, L_, P_, _ = sampling_locations.shape  # 获取采样位置的维度
    
    # 将输入特征按照空间形状分割成列表
    value_list = value.split([H_ * W_ for H_, W_ in value_spatial_shapes], dim=1)
    
    # 将采样位置从 [0, 1] 范围映射到 [-1, 1] 范围
    sampling_grids = 2 * sampling_locations - 1
    sampling_value_list = []
    
    for lid_, (H_, W_) in enumerate(value_spatial_shapes): # 对四个层级都做
        # 对每个空间层级，调整特征值的形状以适应采样
        value_l_ = value_list[lid_].flatten(2).transpose(1, 2).reshape(N_*M_, D_, H_, W_)
        
        # 调整采样位置的形状
        sampling_grid_l_ = sampling_grids[:, :, :, lid_].transpose(1, 2).flatten(0, 1)
        
        # 使用双线性插值进行采样
        sampling_value_l_ = F.grid_sample(value_l_, sampling_grid_l_,
                                          mode='bilinear', padding_mode='zeros', align_corners=False)
        sampling_value_list.append(sampling_value_l_)
    
    # 重新调整注意力权重的形状
    attention_weights = attention_weights.transpose(1, 2).reshape(N_*M_, 1, Lq_, L_*P_)
    
    # 计算加权平均，得到最终输出特征
    output = (torch.stack(sampling_value_list, dim=-2).flatten(-2) * attention_weights).sum(-1).view(N_, M_*D_, Lq_)
    
    return output.transpose(1, 2).contiguous()