LXMERT: Learning Cross-Modality Encoder Representations from Transformers-CSDN博客

本文链接：https://blog.csdn.net/weixin_42437114/article/details/122285279

Model Architecture

在这里插入图片描述

Input Embeddings: input embedding layers 负责将 sentence 和 image 分别转化为 word-level sentence embeddings 和 object-level image embeddings
- Word-Level Sentence Embeddings: 首先使用 WordPiece tokenizer 对句子进行分词，然后将 Word embedding 和 positional embedding 相加后得到 index-aware word embedding:
- Object-Level Image Embeddings: 首先由 Faster-RCNN 检测出 $m$ 个物体并返回物体的 position feature (i.e., bounding box coordinates) $p_j$ 和 2048- $d$ RoI feature $f_j$ ，然后再通过 FC 层得到 position-aware embedding：
  The layer normalization is applied to the projected features before summation so as to balance the energy of the two different types of features.
Encoders: single-modality encoders + cross-modality encoder. 它们主要基于 self-attention layers 和 cross-attention layers (multi-head attention)
- Single-Modality Encoders: language encoder + object-relationship encoder. 结构与 Transformer Encoder 的 Basic Block 相同
- Cross-Modality Encoder: 有两个 self-attention sub-layers、一个 bi-directional cross-attention sub-layer 和两个 feed-forward sub-layers 组成。其中 bi-directional cross-attention sub-layer 由两个 uni-directional cross-attention sub-layers 组成 (one from language to vision and one from vision to language). 设 $k - 1$ 层的 language features 为 ${h_i^{k-1}\}$ ，vision features 为 ${v_j^{k-1}\}$ ，则 cross-attention 可以表示为 (函数的第一个参数为 query，第二个参数为 key 和 value 集)：
  self-attention 可以表示为
Output Representations: 由 cross-modality encoder 生成的 feature sequences 即为 language and vision outputs. 在输入的最开始设置的特殊 token [CLS] 对应的输出即为 cross-modality output

$N_L,N_X,N_R$ 分别被设置为 9, 5, 5，hidden size 设为与 $\text{BERT}_{\text{BASE}}$ 相同的 768。可以看到，language encoder 使用了更多的层数来平衡从 Faster-RCNN 中抽取出的视觉特征。如果将一个 single modality layer 视为 cross-modality layer 的一半，则相当于设置了 $(9 + 5) / 2 + 5 = 12$ 个 cross-modality layers，这与 $\text{BERT}_{\text{BASE}}$ 一样

Pre-Training Strategies

Pre-Training Tasks

在这里插入图片描述

Language Task: Masked Cross-Modality language model (LM): 类似于 BERT，words 以 15% 的几率被随机遮掩，模型需要根据其余 words 和图像信息预测被遮掩的 words
Vision Task: Masked Object Prediction: objects 以 15% 的几率被随机遮掩 (i.e., masking RoI features with zeros)，模型需要根据其余物体和文字信息预测被遮掩的 objects. 按照预测方法可以分为两个子任务：(1) RoI-Feature Regression: 利用 L2 损失对 object RoI feature $f_j$ 进行回归；(2) Detected-Label Classification: 利用交叉熵损失对物体所属类别进行预测 (Although most of our pre-training images have object-level annotations, the ground truth labels of the annotated objects are inconsistent in different datasets (e.g., different number of label classes). For these reasons, we take detected labels output by Faster R-CNN)
Cross-Modality Tasks: (1) Cross-Modality Matching: 给定 image-text pair，以 50% 的几率将其中的句子随机替换为别的句子，然后额外训练一个分类器用于判断图片和句子是否匹配；(2) Image Question Answering (QA): 要求模型做 QA (设置了一个含有 9500 个候选答案的 answer table，大约覆盖了 image QA datasets 中 90% 的问题)

这些预训练任务的 loss 都被加到了一起进行训练 (We train the model for 20 epochs with a batch size of 256. We only pre-train with image QA task for the last 10 epochs, because this task converges faster and empirically needs a smaller learning rate. )

Pre-Training Data

我们对 5 个基于 MS COCO 或 Visual Genome 的 vision-and-language datasets 进行了集成 (只收集 train, dev set 的数据)，最终得到了 180K 张图片上的 9.18M 个 image-and-sentence pairs

Pre-Training Procedure

(1) We consistently keep 36 objects for each image to maximize the pre-training compute utilization by avoiding padding.
(2) 预训练时，encoder 和 embedding layers 中的参数均是从头开始训练。如果加载预训练的 BERT 模型参数作为初始化参数，效果会更差