论文阅读1-《VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts》

最新推荐文章于 2024-08-16 22:03:44 发布

UncleDrew_lsy

最新推荐文章于 2024-08-16 22:03:44 发布

阅读量358

点赞数 10

文章标签：论文阅读

本文链接：https://blog.csdn.net/PETERPARKERRR/article/details/140268049

版权

在这里插入图片描述论文里面的图

预训练：先训图像v-ffn,训完了冻结再训l-ffn，最后整体训练。

step1:模态表征

给定一个图像-文本对，提取文本、图像、图像和文本三种粒度的特征。

图像表征

将一个h*w *c的图分成p * p个patch，再进行展平，这样就会得到一个大小为n * p * p * c的图像序列（其中，n = h * w ）然后线性投影得到图像表征序列。
在图像表征开头化加一个可学习的i_cls
代码：

# https://github.com/microsoft/unilm/blob/master/vlmo/vlmo/modules/multiway_transformer.py#L167
 def visual_embed(self, _x):
        x = self.patch_embed(_x)
        x = x.flatten(2).transpose(1, 2)
        B, L, _ = x.shape

        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        
        if self.pos_embed is not None:
            x = x + self.pos_embed
        x = self.pos_drop(x)

        x_mask = torch.ones(x.shape[0], x.shape[1])

        return x, x_mask

文本表征

将bert输出的表征加头：一个序列的开始标记（[T_CLS]）和一个特殊的边界标记（[T_SEP]）被添加到文本序列中；
-

多模态表征

最后将文本和图像表征concate；得到H。

step2: Mixture-of-Modality-Experts Transformer

给定前一层的输出向量Hl−1，l∈[1，L]，每个MOME变压器块通过切换到不同的模态专家来捕获特定于模态的信息，并采用多头自匹配（MSA）共享，以对齐视觉和语言内容。

代码核心部分

class Block(nn.Module):
       ....
  self.gamma_1 = \
            nn.Parameter(layer_scale_init_values * torch.ones((dim)),requires_grad=True) \
            if layer_scale_init_values is not None else 1.0
        self.gamma_2 = \
            nn.Parameter(layer_scale_init_values * torch.ones((dim)),requires_grad=True) \
            if layer_scale_init_values is not None else 1.0

        self.max_text_len = max_text_len

    def forward(self, x, mask=None, modality_type=None, relative_position_bias=None):
        x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x), mask=mask, relative_position_bias=relative_position_bias))

        if modality_type == "image":
            x = x + self.drop_path(self.gamma_2 * self.mlp_imag(self.norm2_imag(x)))
        elif modality_type == "text":
            x = x + self.drop_path(self.gamma_2 * self.mlp_text(self.norm2_text(x)))
        else:
            if self.mlp_vl is None:
                x_text = x[:, : self.max_text_len]
                x_imag = x[:, self.max_text_len :]
                x_text = x_text + self.drop_path(self.gamma_2 * self.mlp_text(self.norm2_text(x_text)))
                x_imag = x_imag + self.drop_path(self.gamma_2 * self.mlp_imag(self.norm2_imag(x_imag)))
                x = torch.cat([x_text, x_imag], dim=1)
            else:
                x = x + self.drop_path(self.gamma_2 * self.mlp_vl(self.norm2_vl(x)))