【CV | TAL】论文浅读 - - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

最新推荐文章于 2025-05-01 09:54:32 发布

szu_ljm

最新推荐文章于 2025-05-01 09:54:32 发布

阅读量2k

点赞数 10

文章标签：人工智能深度学习 python

本文链接：https://blog.csdn.net/m0_55202222/article/details/133280064

版权

文章目录

前言
一、VideoMAE V2论文解读摘录
- 1、研究背景动机
- 2、改进方法分析
二、VideoMAE V2源码简析
总结

前言

在这里插入图片描述

Masked Autoencoder (MAE) 是一种基于自编码器（Autoencoder）的无监督学习方法，用于特征学习和数据降维。它通过在训练阶段对输入数据施加掩码（mask），强制模型只能使用部分输入信息进行重构，从而促使模型学习到更有意义的表示。

自编码器是一种神经网络模型，由编码器和解码器组成。编码器将输入数据映射到低维表示，解码器则将低维表示映射回重构数据。自编码器的目标是尽可能准确地重构输入数据，以便捕捉输入数据的关键特征。

通过掩码操作，MAE 可以对输入数据进行特征选择，只保留对模型重构有贡献的重要特征。这有助于减少数据维度，提高模型的泛化能力，并且可以用于异常检测、降噪、特征提取等应用场景。

MAE 的训练过程通常使用无监督的方式进行，通过最小化输入数据与重构数据之间的重构误差来优化模型参数。在训练完成后，可以使用编码器部分的模型来提取输入数据的低维表示，这些表示具有更高层次的抽象特征。

一、VideoMAE V2论文解读摘录

论文链接：

https://arxiv.org/abs/2303.16727

源码链接：

https://github.com/OpenGVLab/VideoMAEv2

1、研究背景动机

在这里插入图片描述

在视觉领域，有很多工作致力于发展预训练（Pretrain）模型。其中，采用掩码自编码方法预训练的 Transformer 模型正受到越来越多的关注，作为一种自监督视觉学习器，它在概念上简单却十分有效。研究人员尝试利用这种自监督范式在视频数据集上训练Video Transformer，并且提出了一种基于掩码和重建 (masking-and-reconstruction)这种代理任务的视频自监督预训练算法VideoMAE(Video Masked Autoencoder)。经过VideoMAE预训练的ViT模型能够在Kinetics-400和Something-Something V2这种较大的视频数据集，以及UCF101和HMDB51这种规模相对小的视频数据集上取得大幅优于其他方法的效果

视频数据不同于图像数据，视频有额外的时序维度和极高的时序冗余性，比如视频片段中会有很多静态背景帧，其本身并不具有很重要的时序特征。且视频特征维度高，计算开销更甚于图像，这阻滞了掩码自编码预训练方法在视频领域的可扩展性研究，所以MAE方法仍存在一些问题包括：

1）扩展会导致高昂的计算开销和显存占用，在现有硬件上难以承受；

2）掩码自编码预训练方法依然需要大量数据来减小训练大模型时的过拟合风险，但已有的公开视频数据集比较小；

3）如何充分释放十亿参数级别的预训练大模型的性能。

VideoMAE在解决视频数据集体量相对较小难以训练出性能较好的视觉大模型的问题上给出了开拓性的答案，同时它也降低了视觉大模型的训练成本，提高训练效率。但VideoMAE仍会导致高昂的计算开销和显存占用问题，论文作者运用视频数据的冗余性特点，通过VideoMAE v2的双重掩码策略在减小计算资源占用的同时也保持了VideoMAE强大的重建视频特征的能力。

2、改进方法分析

在这里插入图片描述

VideoMAE v2采用双重掩码方案，上一版本的VideoMAE采用对encoder掩码的方式来训练，V2提出同时对encoder和decoder进行掩码，生成两个掩码映射矩阵 $Me = M_e(ρ^e)$ 和 $Md = M_d(ρ^d)$ ，具有两种不同的掩码生成策略和掩码比率。通过对decoder掩码来减少解码器的输入长度，可以提高效率，同时获得与完整重建相似的信息。

编码器采用管道掩码策略，取video每帧图像按固定窗口大小沿时间维度 $T$ 进行延拓进行mask形成cube；解码器采用运动单元掩码策略，需要处理编码器输出的合并标记，并将剩余的可见标记置于解码器掩码，其中， $Z$ 是来自编码器的潜表征， $M_i$ 是具有相应位置嵌入的可学习掩码标记
$Zc = Z ∪ {M_i}_{(i∈M_d)}$

VideoMAE v2的损失函数是计算归一化掩码后重建出的pixels $I_i$ 和没有被掩码的pixels $\hat I_i$ 之间的 MSE 均方损失，其中 $ρ^d$ 表示decoder的掩码比率， $N$ 表示总的样本数

$\frac{1}{(1 − ρ^d)N} \sum_{i∈M_d ∩M_e} |I_i − \hat I_i|^2$

VideoMAE v2选择 ViT-g/14 作为编码器骨干网络，编码器和解码器都是由多头注意力层和全连接层组成，解码器的深度和宽度则分别控制在 4 和 512 以减少计算开销，具体结构如下所示：

在这里插入图片描述

为了让预训练的 video transformer 模型更好地适应下游任务，需要确定一个合适的迁移方案。掩码自编码预训练能让模型学到某些不变性特征，为模型的参数空间提供一个有利的初始化，但模型也需要更高语义的监督来释放性能。原始的 VideoMAE 只在对应数据集上微调模型，监督信号比较受限。这种直接迁移策略可能无法充分释放大型预训练 video transformer 模型的性能。因此，我们遵循中间微调方法，采用了一种渐进式的训练范式，降低模型过拟合风险的同时，充分释放其潜在性能。

首先，我们在无标签多源视频数据集上进行自监督预训练（Pre-training）；然后，我们合并多个公开的带标签监督视频数据集来建立一个带标签的混合数据集，并在该数据集上对模型进行有监督的后预训练微调（Post-pre-training）阶段，以融入多源的高维语义；最后，我们在目标数据集上进行专门微调（Specific Fine-tuning），使模型迁移到以该任务为中心的知识域中。

二、VideoMAE V2源码简析

VideoMAE v2选择 Vision Transformers 作为编码器骨干网络，下面康康ViT的核心代码框架，主要由各模块嵌入组成网络框架，首先简单分析一些模块封装代码。

下面是多层感知机（MLP）模块，初始化设置线性层输入输出维度，用高斯误差线性单元（GELU）作为激活函数，MLP主要由两层线性层组成，，调用dropout方法降低模型复杂度

class Mlp(nn.Module):

    def __init__(self,
                 in_features,
                 hidden_features=None,
                 out_features=None,
                 act_layer=nn.GELU,
                 drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        # x = self.drop(x)
        # commit this for the orignal BERT implement
        x = self.fc2(x)
        x = self.drop(x)
        return x

下面是CosAttention模块，输入 x 是一个张量，表示输入特征。在前向传播过程中，首先通过线性层 self.qkv 将输入特征映射为查询、键和值。然后，对查询和键进行归一化，并计算注意力权重。注意力权重通过缩放参数 self.scale 进行缩放，并经过 softmax 函数进行归一化。接下来，将注意力权重与值向量相乘，得到注意力加权的值向量。最后，通过线性层 self.proj 将注意力加权的值向量映射回原始特征维度，并进行输出特征的dropout操作。

class CosAttention(nn.Module):

    def __init__(self,
                 dim,
                 num_heads=8,
                 qkv_bias=False,
                 qk_scale=None,
                 attn_drop=0.,
                 proj_drop=0.,
                 attn_head_dim=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        if attn_head_dim is not None:
            head_dim = attn_head_dim
        all_head_dim = head_dim * self.num_heads
        # self.scale = qk_scale or head_dim**-0.5
        # DO NOT RENAME [self.scale] (for no weight decay)
        if qk_scale is None:
            self.scale = nn.Parameter(
                torch.log(10 * torch.ones((num_heads, 1, 1))),
                requires_grad=True)
        else:
            self.scale = qk_scale

        self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
        if qkv_bias:
            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
        else:
            self.q_bias = None
            self.v_bias = None

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(all_head_dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        qkv_bias = None
        if self.q_bias is not None:
            qkv_bias = torch.cat(
                (self.q_bias,
                 torch.zeros_like(self.v_bias,
                                  requires_grad=False), self.v_bias))
        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[
            2]  # make torchscript happy (cannot use tensor as tuple)

        attn = (
            F.normalize(q, dim=-1) @ F.normalize(k, dim=-1).transpose(-2, -1))

        # torch.log(torch.tensor(1. / 0.01)) = 4.6052
        logit_scale = torch.clamp(self.scale, max=4.6052).exp()

        attn = attn * logit_scale

        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)

        x = self.proj(x)
        x = self.proj_drop(x)
        return x

下面是Attention模块，前向传播中的代码与 CosAttention 相似，但在计算注意力权重的部分有所不同。在这里，首先对查询进行缩放操作，将其乘以缩放参数 self.scale。然后，通过矩阵乘法计算注意力权重，将查询和键的转置相乘。接下来，对注意力权重进行 softmax 归一化，并进行 dropout 操作。
之后的计算过程和 CosAttention 类似，将注意力权重与值向量相乘，得到注意力加权的值向量。最后，通过线性层 self.proj 将注意力加权的值向量映射回原始特征维度，并进行输出特征的dropout操作

class Attention(nn.Module):

    def __init__(self,
                 dim,
                 num_heads=8,
                 qkv_bias=False,
                 qk_scale=None,
                 attn_drop=0.,
                 proj_drop=0.,
                 attn_head_dim=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        if attn_head_dim is not None:
            head_dim = attn_head_dim
        all_head_dim = head_dim * self.num_heads
        self.scale = qk_scale or head_dim**-0.5

        self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
        if qkv_bias:
            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
        else:
            self.q_bias = None
            self.v_bias = None

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(all_head_dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        qkv_bias = None
        if self.q_bias is not None:
            qkv_bias = torch.cat(
                (self.q_bias,
                 torch.zeros_like(self.v_bias,
                                  requires_grad=False), self.v_bias))
        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[
            2]  # make torchscript happy (cannot use tensor as tuple)

        q = q * self.scale
        attn = (q @ k.transpose(-2, -1))

        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)

        x = self.proj(x)
        x = self.proj_drop(x)
        return x

下面是对注意力模块和MLP模块做一个Block封装，通过判断来决定Block是用Attention模块还是要CosAttention模块，初始化各模块的参数和超参数，配置好dropout方法和LayerNorm层归一化方法，前向传播中模块间使用残差连接方式，提高深层网络学习的效果

class Block(nn.Module):

    def __init__(self,
                 dim,
                 num_heads,
                 mlp_ratio=4.,
                 qkv_bias=False,
                 qk_scale=None,
                 drop=0.,
                 attn_drop=0.,
                 drop_path=0.,
                 init_values=None,
                 act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm,
                 attn_head_dim=None,
                 cos_attn=False):
        super().__init__()
        self.norm1 = norm_layer(dim)
        if cos_attn:
            self.attn = CosAttention(
                dim,
                num_heads=num_heads,
                qkv_bias=qkv_bias,
                qk_scale=qk_scale,
                attn_drop=attn_drop,
                proj_drop=drop,
                attn_head_dim=attn_head_dim)
        else:
            self.attn = Attention(
                dim,
                num_heads=num_heads,
                qkv_bias=qkv_bias,
                qk_scale=qk_scale,
                attn_drop=attn_drop,
                proj_drop=drop,
                attn_head_dim=attn_head_dim)
        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
        self.drop_path = DropPath(
            drop_path) if drop_path > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(
            in_features=dim,
            hidden_features=mlp_hidden_dim,
            act_layer=act_layer,
            drop=drop)

        if init_values > 0:
            self.gamma_1 = nn.Parameter(
                init_values * torch.ones((dim)), requires_grad=True)
            self.gamma_2 = nn.Parameter(
                init_values * torch.ones((dim)), requires_grad=True)
        else:
            self.gamma_1, self.gamma_2 = None, None

    def forward(self, x):
        if self.gamma_1 is None:
            x = x + self.drop_path(self.attn(self.norm1(x)))
            x = x + self.drop_path(self.mlp(self.norm2(x)))
        else:
            x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x)))
            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
        return x

下面是用于将图像转换为Patch表示的模块。它接受图像作为输入，并将其划分为大小为patch_size的块。然后，通过3D卷积将每个块映射到嵌入维度embed_dim。

前向传播中首先通过3D卷积层proj将输入x映射到嵌入维度。然后，它将结果展平并转置，使得每个块可以被视为模型中的一个位置。最终，它返回形状为(B, num_patches, embed_dim)的tensor，其中num_patches是图像中的Patch数。

get_sinusoid_encoding_table函数用于生成位置编码表。它接受n_position和d_hid作为参数，并使用正弦和余弦函数生成位置编码表。位置编码表的形状为(1, n_position, d_hid)，其中每个位置的编码是一个长度为d_hid的向量。该函数返回形状为(1, n_position, d_hid)的tensor

class PatchEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(self,
                 img_size=224,
                 patch_size=16,
                 in_chans=3,
                 embed_dim=768,
                 num_frames=16,
                 tubelet_size=2):
        super().__init__()
        img_size = to_2tuple(img_size)
        patch_size = to_2tuple(patch_size)
        num_spatial_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        num_patches = num_spatial_patches * (num_frames // tubelet_size)

        self.img_size = img_size
        self.tubelet_size = tubelet_size
        self.patch_size = patch_size
        self.num_patches = num_patches
        self.proj = nn.Conv3d(
            in_channels=in_chans,
            out_channels=embed_dim,
            kernel_size=(self.tubelet_size, patch_size[0], patch_size[1]),
            stride=(self.tubelet_size, patch_size[0], patch_size[1]))

    def forward(self, x, **kwargs):
        B, C, T, H, W = x.shape
        assert H == self.img_size[0] and W == self.img_size[
            1], f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
        # b, c, l -> b, l, c
        x = self.proj(x).flatten(2).transpose(1, 2)
        return x


# sin-cos position encoding
# https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31
def get_sinusoid_encoding_table(n_position, d_hid):
    ''' Sinusoid position encoding table '''

    # TODO: make it with torch instead of numpy
    def get_position_angle_vec(position):
        return [
            position / np.power(10000, 2 * (hid_j // 2) / d_hid)
            for hid_j in range(d_hid)
        ]

    sinusoid_table = np.array(
        [get_position_angle_vec(pos_i) for pos_i in range(n_position)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1

    return torch.tensor(
        sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)

下面是ViT的核心代码，对上面各模块进行调用统一封装，_init_weights函数用于初始化近似正态分布的权重参数和将偏置项填充为0。

特征提取的前向传播中首先将输入图像映射为Patch表示。如果存在位置编码(pos_embed)，则将位置编码与输入相加。位置编码通过使用expand方法将其扩展为与输入相同的形状，然后使用type_as方法将其转换为与输入相同的数据类型，并使用to方法将其移动到与输入相同的设备上。然后，应用位置编码的Dropout操作(pos_drop)。接下来，通过迭代处理self.blocks中的每个块，对输入进行处理。如果启用了with_cp，则使用cp.checkpoint方法对块进行检查点操作，否则直接调用块的前向传播。最后，如果存在fc_norm，则对提取的特征进行均值池化操作，并经过fc_norm进行归一化处理，返回归一化后的特征。否则，直接返回提取的特征的第一个时间步的特征。

前向传播中首先调用forward_features方法获取特征表示。然后，应用head_dropout对特征进行Dropout操作。最后，将特征输入到head模块中进行分类或回归等任务的处理，并返回结果

class VisionTransformer(nn.Module):
    """ Vision Transformer with support for patch or hybrid CNN input stage
    """

    def __init__(self,
                 img_size=224,
                 patch_size=16,
                 in_chans=3,
                 num_classes=1000,
                 embed_dim=768,
                 depth=12,
                 num_heads=12,
                 mlp_ratio=4.,
                 qkv_bias=False,
                 qk_scale=None,
                 drop_rate=0.,
                 attn_drop_rate=0.,
                 drop_path_rate=0.,
                 head_drop_rate=0.,
                 norm_layer=nn.LayerNorm,
                 init_values=0.,
                 use_learnable_pos_emb=False,
                 init_scale=0.,
                 all_frames=16,
                 tubelet_size=2,
                 use_mean_pooling=True,
                 with_cp=False,
                 cos_attn=False):
        super().__init__()
        self.num_classes = num_classes
        # num_features for consistency with other models
        self.num_features = self.embed_dim = embed_dim
        self.tubelet_size = tubelet_size
        self.patch_embed = PatchEmbed(
            img_size=img_size,
            patch_size=patch_size,
            in_chans=in_chans,
            embed_dim=embed_dim,
            num_frames=all_frames,
            tubelet_size=tubelet_size)
        num_patches = self.patch_embed.num_patches
        self.with_cp = with_cp

        if use_learnable_pos_emb:
            self.pos_embed = nn.Parameter(
                torch.zeros(1, num_patches, embed_dim))
        else:
            # sine-cosine positional embeddings is on the way
            self.pos_embed = get_sinusoid_encoding_table(
                num_patches, embed_dim)

        self.pos_drop = nn.Dropout(p=drop_rate)

        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)
               ]  # stochastic depth decay rule
        self.blocks = nn.ModuleList([
            Block(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                qk_scale=qk_scale,
                drop=drop_rate,
                attn_drop=attn_drop_rate,
                drop_path=dpr[i],
                norm_layer=norm_layer,
                init_values=init_values,
                cos_attn=cos_attn) for i in range(depth)
        ])
        self.norm = nn.Identity() if use_mean_pooling else norm_layer(
            embed_dim)
        self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
        self.head_dropout = nn.Dropout(head_drop_rate)
        self.head = nn.Linear(
            embed_dim, num_classes) if num_classes > 0 else nn.Identity()

        if use_learnable_pos_emb:
            trunc_normal_(self.pos_embed, std=.02)

        self.apply(self._init_weights)

        self.head.weight.data.mul_(init_scale)
        self.head.bias.data.mul_(init_scale)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def get_num_layers(self):
        return len(self.blocks)

    @torch.jit.ignore
    def no_weight_decay(self):
        return {'pos_embed', 'cls_token'}

    def get_classifier(self):
        return self.head

    def reset_classifier(self, num_classes, global_pool=''):
        self.num_classes = num_classes
        self.head = nn.Linear(
            self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()

    def forward_features(self, x):
        B = x.size(0)

        x = self.patch_embed(x)

        if self.pos_embed is not None:
            x = x + self.pos_embed.expand(B, -1, -1).type_as(x).to(
                x.device).clone().detach()
        x = self.pos_drop(x)

        for blk in self.blocks:
            if self.with_cp:
                x = cp.checkpoint(blk, x)
            else:
                x = blk(x)

        if self.fc_norm is not None:
            return self.fc_norm(x.mean(1))
        else:
            return self.norm(x[:, 0])

    def forward(self, x):
        x = self.forward_features(x)
        x = self.head_dropout(x)
        x = self.head(x)
        return x

总结

以上就是VideoMAE v2论文浅读学习笔记的全部内容，本文简单介绍了VideoMAE v2的研究动机和方法改进，简单分析一下VideoMAE v2骨干网络ViT的代码框架。VideoMAE v2这种掩码自编码方法应用在目前视觉大模型的训练中有广阔的前景，VideoMAE的Tube masked方式增强了网络空间建模的能力，通过学习相邻位置的cube特征来重构被掩码的cube，有效提升网络在小容量数据集上的训练效果，也减小了计算开销。