Distilled Dual-Encoder Model for Vision-Language Understanding

最新推荐文章于 2024-05-27 09:51:21 发布

黑子小明

最新推荐文章于 2024-05-27 09:51:21 发布

阅读量2k

点赞数 1

分类专栏： NLP 文章标签：深度学习机器学习人工智能

NLP 专栏收录该内容

24 篇文章

订阅专栏

视觉语言理解的提取双编码器模型

Zekun Wang † ∗ , Wenhui Wang ‡ , Haichao Zhu † , Ming Liu † , Bing Qin † , Furu Wei ‡ † Harbin Institute of Technology, Harbin, China ‡ Microsoft Research, Beijing, China {zkwang,hczhu,mliu,qinb}@ir.hit.edu.cn {wenwan,fuwei}@microsoft.com

摘要

我们提出了一个跨模式注意力提取框架，用于训练视觉语言理解任务（如视觉推理和视觉问答）的双编码器模型。双编码器模型比融合编码器模型具有更快的推理速度，并且能够在推理过程中对图像和文本进行预计算。然而，双编码器模型中使用的浅层交互模块不足以处理复杂的视觉语言理解任务。为了了解图像和文本的深度交互，我们引入了跨模式注意力提取，它使用融合编码器模型的图像到文本和文本到图像注意力分布来指导我们的双编码器模型的训练。此外，我们还表明，将跨模式注意力提取应用于预训练和微调阶段可以实现进一步的改进。实验结果表明，提取的双编码器模型在视觉推理、视觉蕴涵和视觉问答任务方面取得了有竞争力的表现，同时推理速度比融合编码器模型快得多。我们的代码和模型将在https://github.com/kugwzk/蒸馏双编码器。

1引言

    视觉语言（VL）预训练模型（Li等人，2019；Lu等人，2019；Tan和Bansal，2019；Su等人，2020；Chen等人，2020；Li等人，2020，2021b；Zhang等人，2021；Kim等人，2021；Rad-ford等人，2021；Li等人，2021a）学习大规模图像-文本对的跨模态表示，可以直接微调以适应各种下游VL任务，例如视觉-语言理解/分类（视觉推理（Suhr等人，2019）、视觉问答（Goyal等人，2017）和图像-文本检索（Young等人，2014）。根据跨模态相互作用的建模方法，这些模型可分为两类。
    第一类是融合编码器模型（Lu等人，2019年；Chen等人，2020年；Li等人，2020年；Kim等人，2021；Li等人，2021a），使用有效但效率较低的Transformer（Vaswani等人，2017）编码器捕捉图像和文本与跨模式注意力的交互。这一类的大多数模型（李等人，2019年；陆等人，2019年；陈等人，2020年；李等人，2020年；张等人，2021）依赖现成的物体检测器来提取图像区域特征，这进一步阻碍了它们的效率。最近，ViLT（Kim等人，2021）丢弃了检测器，并使用视觉transformer（Dosovitskiy等人，2021）直接编码图像块，如文本tokens。它在提高效率的同时，在VL理解和重试任务方面取得了有竞争力的表现。然而，由于需要同时编码图像和文本，基于transformer的跨模式交互仍然是一个效率瓶颈，限制了其在具有大量图像或文本候选的任务中的应用。
    第二类作品，包括CLIP（Radford等人，2021）和ALIGN（Jia等人，2021），采用双编码器架构分别对图像和文本进行编码。跨模式交互通过浅层融合模型建模，通常是多层感知器（MLP）网络或点积，与融合编码器模型中的Transformer编码器相比，该模型非常轻。此外，分解编码支持图像和文本候选的离线计算和缓存，可以很好地扩展到大规模候选。这些变化降低了理解和检索任务中的推理速度，使模型在实际场景中实用。双编码器模型在图像文本检索任务上取得了良好的表现。然而，在视觉语言理解任务（如NLVR2）上，它们远远落后于需要复杂跨模式推理的融合编码器模型。
    在这项工作中，我们提出了一个跨模式注意力提取框架来训练双编码器视觉语言模型。提取的双编码器模型在视觉-语言理解任务中实现了具有竞争力的表现，其推理速度比融合编码器模型快得多。除了软标签提取（Hinton等人，2015），我们还引入了跨模式注意力分离，作为双编码器模型（学生）的细粒度监督，以更好地学习跨模式推理。具体来说，我们使用融合编码器模型（教师）的图像到文本和文本到图像注意力分布进行提取。
    我们的蒸馏框架可以应用于预训练和微调阶段。在预训练中，我们将提取目标应用于图像-文本对比学习和图像-文本匹配任务中。在微调阶段，微调教师模型的任务特定知识转移到学生模型。
    我们评估了我们的视觉语言理解任务和图像文本检索任务模型。
    实验结果表明，我们提取的双编码器模型在视觉蕴涵（99.9%）、视觉推理（97.8%）和视觉问答（95.5%）方面具有竞争力，同时推理速度比融合编码器-教师模型快3倍以上。
    此外，我们提出的跨模式注意力分散也提高了检索任务的表现，甚至在图像检索方面优于教师模型。与其他潜在特征相比，跨模式注意力有助于双编码器模型学习更好的跨模式推理能力，在VL理解任务中获得显著收益。此外，两段蒸馏得到的模型比单段蒸馏得到的模型具有更好的表现。

2相关工作

2.1视觉语言预训练

    语言和视觉预训练提高了下游自然语言处理任务（Radford等人，2018；Devlin等人，2019；Dong等人，2019；Liu等人，2019；Bao等人，2020；Lewis等人，2020；Raffel等人，2020；Con-neau和Lample，2019；Chi等人，2020；Conneau等人，2020；Chi等人，2021a，b；Ma等人，2021）和计算机的技术水平视觉任务（Dosovitskiy等人，2021；Touvron等人，2021；Bao等人，2021）。视觉-语言预训练（Lu等人，2019年；Tan和Bansal，2019年；Su等人，2020年；Gan等人，2020年；Li等人，2020年，2021b；Wang等人，2021a，c）也被证明在学习跨模态表达方面占优势。这些VL模型的架构分为两行。
    第一行工作（李等人，2019年；陆等人，2019年；谭和班萨尔，2019年；陈等人，2020年；周等人，2020年；张等人，2021；金等人，2021；李等人，2021a）使用融合编码器来学习跨模式交互。这些模型首先将图像-文本对编码为向量，然后使用多层Transformer（Vaswani等人，2017）网络融合视觉和文本表示。大多数以前的模型通过对象检测器提取视觉特征（例如，更快的R-CNN（Ren等人，2015）），该检测器需要使用一组固定的对象类（如视觉基因组）对昂贵的注释数据集进行预训练（Krishna等人，2017）。此外，目标检测器需要高分辨率的输入图像，并带来更多的计算成本。
    最近，黄等人（2020）和徐等人（2021）直接将图像像素作为输入，并将其输入卷积神经网络，以获得视觉网格特征，而不是以前的区域特征。李等人（2021a）使用视觉transformer（Dosovitskiy等人，2021）为多模式融合编码器提取图像特征。
    ViLT（Kim等人，2021）通过简单的嵌入层直接编码图像块。然后，多模态Transformer联合编码视觉和文本嵌入。它在VL任务上以更少的开销实现了有竞争力的表现。融合编码器模型显示出强大的跨模态建模能力，并在需要复杂跨模态推理的VL理解任务（如NLVR2）上取得了优异的结果（Suhr等人，2019）。
    然而，融合编码器模型仍然依赖于跨模式Transformer跨层同时编码和融合视觉和文本表示，需要大量计算预算，导致推理速度低。
    另一条生产线（Radford等人，2021；Jia等人，2021；Sun等人，2021）采用双编码器架构分别编码图像和文本，并采用点积或MLP网络模型图像和文本之间的交互。与融合编码器模型相比，双编码器模型具有计算效率优势。多头注意力机制仅适用于同一模态的tokens，并将融合编码器模型的 avatar 复杂性降低到 avatar ，其中N和M分别是视觉和文本特征的长度。此外，由于独立的编码器，视觉或文本表示可以预先计算并缓存在实际应用中。双编码器模型在图像文本检索方面取得了良好的表现。然而，浅层交互模块不足以处理复杂的VL理解任务，这些任务需要更深的跨模态交互，导致表现显著下降。为了改进复杂VL理解任务的双编码器模型，我们引入了跨模式注意力提取框架，以帮助模型学习更深层次的交互。

2.2知识提炼

知识提炼（KD）旨在将在强教师模型中学习到的知识转移到学生模型，使学生表现出竞争性。Hinton等人（2015）采用教师模型的软标签分布来训练学生模型。最近，可以通过模拟教师的中间表示来进一步改进学生模型，例如隐藏状态（Romero等人，2015）和注意力分布（Zagoruyko和Komodakis，2017）。
知识提取也被广泛用于压缩和改进跨学科的基于transformer的模型（孙等人，2019年；焦等人，2020年；王等人，2020a，b，2021b；Touvron等人，2021）。在这项工作中，我们利用融合编码器教师模型的跨模式注意力知识来指导双编码器模型的训练。我们的提取框架改进了复杂VL理解任务的双编码器模型。

3方法

在本节中，我们描述了用于训练双编码器模型的跨模式注意力提取框架。图1概述了我们的方法。我们采用融合编码器模型作为教师，引入跨模式注意力知识和软标签来训练双编码器学生模型。蒸馏目标适用于预训练和微调阶段，并帮助双编码器模型学习不同模式的交互。

3.1模型概述

    我们的提取框架可以采用不同的融合编码器模型作为指导。在这项工作中，我们采用ViLT（Kim等人，2021）作为教师模型进行实验，因为它简单有效。
    输入表示给定一个图像-文本对（v，t）作为输入，我们对图像v进行切片∈ R H×W×C到面片v p∈ R N×（P 2 C），其中N=HW/P 2是面片数，（H，W）是输入图像分辨率，（P，P）是每个面片的分辨率，C是通道数。
    输入文本通过分词（Wu等人，2016）标记为M个子词tokens序列，如BERT（Devlin等人，2019）所示。然后，我们将特殊tokens[I\\u-CLS]和[T\\u-CLS]分别预处理到图像块序列和文本子词tokens序列。
    我们线性投影图像面片v p以获得面片嵌入，最终视觉输入em-beddings avatar 通过：
avatar
计算，其中v∈ avatar ×dis线性投影，V pos∈ R（N+1）×dis a可学习的1D位置em层理，V型∈ R是视觉类型嵌入。
    文本输入嵌入 avatar 是通过将单词嵌入、文本位置嵌入和文本类型em-bedding相加得到的：
avatar
我们将 avatar 作为教师和学生模型的视觉和文本输入。
    老师：融合编码器模型输入表示H v 0和H t 0串联为 avatar ，然后将向量馈送到L-层跨模式Transformer编码器以获得上下文表示：
avatar
其中L∈ [1，L]。跨模式Transformer en编码器通过多头注意力机制融合不同模式的表示。具体来说，对于每个头部a，a∈ [1，A h]在层l中，通过
avatar
计算注意力分布A vl A，其中查询Q vl A和密钥K vl A分别通过使用参数 avatar 线性投影最后一个层的隐藏状态获得。d k是注意力头部大小。最后一个层的[I\\u CLS]和[T\\u CLS]tokens的输出向量被馈送到任务特定的层以获得预测。
    学生：双编码器模型双编码器模型通过基于视觉和文本转换器的编码器分别编码视觉嵌入（H v 0）和文本嵌入（H t 0）：
avatar

avatar
最后层的[I\\u CLS]和[t\\u CLS]tokens的输出向量用作图像和文本的最终表示。我们采用浅模f来融合这两种表示。对于视觉语言理解任务，如VQA，模块f是一个MLP网络。对于图像-文本检索，我们使用点积函数获得图像-文本对的相似性分数。
avatar
图1：我们的跨模式注意力提取框架概述。除了软标签外，我们还介绍了融合编码器模型（教师）的跨模式注意力知识，包括图像到文本和文本到图像的注意力分布，以指导双编码器模型（学生）的训练。

3.2蒸馏目标

跨模式注意力提取为了改进双编码器模型以捕捉图像和文本的深层交互，我们利用融合编码器模型的跨模式注意力知识来指导双编码器模型的训练。具体来说，我们使用图像到文本和文本到图像的注意力分布来训练双编码器模型。
融合编码器-教师模型通过多头注意力机制捕捉跨模式交互，如等式4所示。整个注意力分布
avatar
可以分为两部分。我们使用N和M表示图像和文本输入的长度。第一部分是单峰注意力
avatar
avatar ，它对同一模态的tokens内的交互进行建模。第二部分是跨模态注意力，包括图像-文本注意力分布 avatar 和文本-图像注意力分布 avatar 跨模态注意力分布捕捉视觉和文本特征向量的交互作用。由于双编码器的单独编码仅模拟同一模态的tokens的交互，我们引入跨模态注意力提取，以鼓励双编码器模型模拟融合编码器模型的图像和文本对齐。双编码器模型 avatar 的跨模式（图像到文本和文本到图像）注意力分布计算如下：
avatar

avatar
，其中 avatar 是视觉查询和自注意力模块的键。
avatar
0是文本输入的查询和键。我们以同样的方式重新计算了教师 avatar 的跨模式注意力分布，而不是直接将原始注意力分布拆分为VLT。跨模式注意力蒸馏损失通过以下公式计算：
avatar
2，其中D KL是库尔巴克-莱布尔散度。受王等人（2020b）的启发，我们只转移了教师模型最后一个层的跨模态注意力知识。
软标签提取除了模拟跨模式注意力分布外，我们还使用教师模型的预测作为软标签来提高学生的注意力。软标签损失计算为：
avatar
，其中z S，z T分别是学生和教师的预测对数。
avatar
表1：在预训练和微调期间用于不同任务的训练目标。

3.3两段蒸馏框架

我们使用提出的知识提取目标在两阶段框架下训练双编码器学生模型，包括预训练提取和微调提取。在这两个阶段中，融合编码器模型帮助双编码器模型学习跨模态交互。
如表1所示，我们根据任务的特点，对具有不同目标的模型进行了训练。

3.3.1预训练蒸馏

    在预训练期间，双编码器学生模型在大规模图像-文本对上接受训练的，以学习通用跨模态表示，包括图像-文本匹配、图像-文本对比和掩码式语言建模任务。预训练融合编码器模型ViLT（Kim等人，2021）用作教师模型。
    图像文本匹配图像文本匹配的目标是预测输入图像和文本是否匹配。继ViLT（Kim等人，2021）之后，我们用0.5的概率替换匹配图像以构建负对。
    我们利用ITM输入对上的跨模式注意力提取损失和软标签损失来训练双编码器模型。
    图像-文本对比学习（ITC）我们引入了对比损失和批量负抽样，以优化视觉和文本表征的共享空间。给定一批N个图像-文本对，我们可以得到N个匹配对和N 2− N个负对。图像-文本对比学习旨在从所有可能的配对中预测匹配配对。融合编码器模型需要对每一对进行联合编码以获得软标签，这导致了二次时间复杂度。因此，为了提高训练效率，我们采用了带有地面真实性标签的跨模式注意力提取。具体来说，我们只考虑在N个匹配对上计算的跨模式注意力分布。
    掩码式语言建模（MLM）掩码式语言建模的目标是从所有其他未屏蔽tokens中恢复掩码式tokens。
    我们使用15%的掩码概率，如BERT（De-vlin等人，2019）。为了提高训练速度，我们使用地面实况标签对传销任务的模型进行训练。

3.3.2微调蒸馏

    在微调过程中，我们使用微调后的ViLT作为教师模型，并对下游任务数据执行跨模式注意力提取。
    视觉语言理解对于视觉-语言理解任务，如视觉推理和VQA，我们使用跨模式注意力提取和软标记丢失来微调学生模型。
    图像文本检索对于检索任务，我们在教师模型和地面真值标签的跨模式注意力分布的监督下对学生进行训练，以进行有效的训练。

4个实验

4.1数据集

    根据之前的工作（陈等人，2020年；金等人，2021），我们在训练前使用了四个数据集：COCO（林等人，2014年）、概念性字幕（Sharma等人，2018年）、SBU字幕（或-donez等人，2011年）和视觉基因组（Krishna等人，2017年）。
    我们在三个视觉语言理解/分类数据集和一个图像文本检索数据集上评估了我们的双编码器模型。表2显示了四个数据集的统计信息。
    视觉推理NLVR2（Suhr等人，2019）数据集是一项视觉推理任务，旨在确定文本语句是否描述了一对图像。根据之前的工作（李等人，2020；金等人，2021），我们构建了两个图像-文本对作为输入，每个由一个图像和文本语句组成。将两对的最终表示输入分类器层以获得预测。
    视觉蕴涵SNLI-VE（Xie等人，2019）数据集旨在预测图像和文本描述之间的关系。与之前的工作一样，我们将SNLI-VE视为一个三向分类任务（Chen等人，2020；Li等人，2021a）。
    视觉问答任务要求模型基于图像回答问题。
    我们在广泛使用的VQAv2（Goyal等人，2017）数据集上进行了评估。继Anderson等人。
    （2018年），我们将该问题表述为一项分类任务，共有3129个候选答案。
    图像文本检索该任务由两个子任务组成：图像检索和文本检索。我们在Flickr30K（Plummer等人，2015）数据集上进行评估，并遵循Karpathy和Fei-Fei（2015）中的分割。
avatar
表2：不同下游视觉语言数据集的统计。

4.2实施细节

    我们的双编码器模型的Transformer架构与ViLT相同（Kim等人，2021）。视觉和文本Transformers都由12个层块和768个隐藏大小和12个注意力头组成。前馈网络的中间大小为3072。继Kim等人（2021）之后，图像的分辨率调整为384×640，面片大小为32×32。文本序列的最大长度设置为40。
    对于预训练，我们以1024个批量对模型进行200k步的训练。我们使用ViLT的预训练的权重（Kim等人，2021）来初始化双编码器模型的视觉和文本编码器。在微调过程中，我们对模型进行10个epochs的训练，批量大小为256，用于VQA和SNLI-VE。对于NLVR2，我们训练模型20个epochs，批量大小为128。对于Flickr30k，模型经过20个epochs的训练的，批量大小为512。我们应用RandAugment（Cubuk等人，2020年），没有颜色反转和剪切。对于这两个阶段，我们使用Adam（Kingma和Ba，2015），β1=0。9 , β 2 = 0 . 999用于优化。
    学习速率设置为1e-4，预热比为0。1和线性衰减。权重衰减设置为0。01 .

4.3结果

    视觉语言理解结果我们评估了视觉语言理解任务的模型，包括NLVR2、SNLI-VE和VQA。
    表3给出了三个任务的微调结果。与之前的双编码器模型（如CLIP）（Radford等人，2021）相比，我们的模型在三个视觉语言理解任务中实现了更好的表现，平均分数从57分提高到57分。83至73。85 . 此外，与融合编码器模型相比，我们的双编码器模型也实现了具有竞争力的表现。模型保留了99。SNLI-VE的准确率为9%，97。NLVR2和95的准确率为8%。5%的VQA表现，比教师模型（ViLT）快3倍以上。我们的模型在NLVR2任务上甚至优于PixelBERT-R50（Huang等人，2020）。使用双编码器架构比融合编码器模型需要更少的计算量，并实现更快的推理速度。此外，执行单独编码可以实现图像或文本表示的预计算和缓存，这对于大量图像和文本更有效。
    消融研究表3也显示了我们方法的消融结果。在预训练和微调阶段执行蒸馏都对我们的双编码器模型做出了积极贡献。与直接微调由ViLT初始化的双编码器模型相比，在微调期间使用跨模式注意力蒸馏带来了显著的改进。引入训练前提取进一步改进了模型。
    图像文本检索结果除了视觉语言理解任务外，我们还评估了我们在图像文本检索任务中的方法。我们的双编码器学生模型通过跨模式注意力提取和对比损失进行训练的。
    表4报告了在Flickr30K上微调的模型的结果。我们的双编码器模型以更快的推理速度实现了具有竞争力的性能。该模型在图像重建方面甚至优于融合编码器-教师模型（ViLT）。此外，实验结果表明，跨模式注意力提取也改进了检索任务的模型。
    推理速度我们评估了我们的双编码器模型和ViLT对视觉语言理解任务的推理能力。这两个模型在具有相同超参数的单个P100 GPU上进行评估。多亏了双编码器架构，我们的模型可以缓存图像表示以减少冗余计算。
    不同任务的平均推断时间和缓存时间如表5所示。我们的双编码器模型在三个任务中实现了更快的推理速度。预计算图像表示进一步提高了推理速度，这对于现实应用中的大量图像和文本是有效的。
avatar
表3：视觉语言理解任务的结果。“Std”表示具有原始地面实况标签的训练。“KD”表示使用我们的蒸馏目标训练的的模型。我们报告了NLVR2开发和公共测试集（test-P）、SNLI-VE验证和测试分割的准确性。我们报告了vqa测试开发拆分的vqa分数。†表明了我们对CLIP的微调结果（Radford等人，2021）。每个任务的结果在3次运行中取平均值。在NLVR2数据集上评估了推理速度。我们在具有相同超参数的单个P100 GPU上评估了我们的模型和ViLT。其他模型的推理加速来自Kim等人（2021）。
avatar
表4:Flickr30K上的检索结果。ViLT是融合编码器教师模型。我们的模型通过跨模式注意力提取和对比目标进行了微调。“ − “跨模态注意力”是没有跨模态注意力提取目标的模型训练的。在同一设置下，在单个P100 GPU上评估了我们的模型和ViLT的推理速度。
avatar
表5:ViLT和我们的模型在三种视觉语言理解任务上的平均推理时间。推理时间和缓存时间在单个P100 GPU上进行评估。
avatar
表6：使用不同提取知识的效果。“Attn”是注意力分布的缩写。“整体Attn”是“单峰Attn”和“跨峰Attn”的组合。结果是每个任务平均运行3次。
avatar
表7：不同层映射策略对蒸馏方法的影响。结果在3次运行中取平均值。

4.4讨论

    不同蒸馏知识的影响我们研究了蒸馏中使用的不同知识的影响。我们对微调过程中不同失真损失的视觉-语言理解任务进行了实验。双编码器学生模型由ViLT直接初始化。表6说明了跨任务的结果。首先，我们发现使用软标签蒸馏比地面真实值标签获得更好的表现。然而，使用软标签训练的的模型在NLVR2任务上的准确性仍然相对较低。我们进一步合并了融合编码器模型的中间表示，以提高双编码器模型的表现。我们比较了使用隐藏状态和不同的注意力分布。在三个任务中，使用注意力分布比隐藏状态带来更多改进。我们进一步探讨了注意力分布的哪一部分更关键，包括跨模态注意力和单峰注意力。如表6所示，模拟教师的跨模态注意力分布比单峰部分获得了更多的改善，这验证了跨模态交互对于视觉语言理解任务更为重要。我们还发现，仅使用跨模态注意力分布比使用整个注意力分布（跨模态+单峰）表现更好。
    受王等人（2020b）的启发，我们在教师和学生的最后一个层上执行了提出的知识提取方法。为了验证仅在最后一个层上提取的有效性，我们将其与层策略进行了比较。结果如表7所示。
    最后一个层蒸馏策略在NLVR2和SNLI-VE任务上获得了更好的表现。此外，仅使用最后一个层的注意力知识需要更少的计算量。因此，仅使用最后一个层是执行跨模式注意力提取的更实际的方法。

5结论

在这项工作中，我们引入了一个跨模式注意力提取框架，以提高双编码器模型在视觉语言理解任务中的表现。我们利用融合编码器模型的跨模式注意力知识，包括图像到文本和文本到图像的注意力分布，来指导双编码器模型的训练。实验结果表明，提取的双编码器模型在NLVR2、SNLI-VE和VQA上实现了具有竞争力的表现，同时具有比融合编码器模型更快的推理速度。

参考文献

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering . In 2018 IEEE Conference on Computer Vision and pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 6077–6086. computer Vision Foundation / IEEE Computer Society.

Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: BERT pre-training of image transformers . CoRR , abs/2106.08254.

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Song- hao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020. UniLMv2: Pseudo-masked language models for unified language model pre-training . In proceedings of the 37th International Conference on machine Learning, ICML 2020, 13-18 July 2020, virtual Event , volume 119 of Proceedings of Machine Learning Research , pages 642–652. PMLR.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: universal image-text representation learning . In Computer Vision - ECCV

2020 - 16th European Conference, Glasgow, UK, august 23-28, 2020, Proceedings, Part XXX , volume 12375 of Lecture Notes in Computer Science , pages 104–120. Springer.

Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian- Ling Mao, and Heyan Huang. 2020. Cross-lingual natural language generation via pre-training . In The Thirty-Fourth AAAI Conference on Artificial Intelli- gence, AAAI 2020, The Thirty-Second Innovative applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 , pages 7570– 7577. AAAI Press.

Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak- sham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021a. In- foXLM: An information-theoretic framework for cross-lingual language model pre-training . In proceedings of the 2021 Conference of the North american Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 3576–3588, Online. Association for Computational Linguistics.

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, and Furu Wei. 2021b. XLM-E: cross-lingual language model pre-training via ELECTRA . CoRR , abs/2106.16138.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the association for Computational Linguistics , pages 8440– 8451, Online. Association for Computational Lin- guistics.

Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining . In Advances in Neural Information Processing Systems 32: annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages 7057–7067.

Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space . In advances in Neural Information Processing Systems 33: Annual Conference on Neural Information processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual .

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Pa- pers) , pages 4171–4186. Association for Computa- tional Linguistics.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages 13042–13054.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net. Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-scale ad- versarial training for vision-and-language representation learning . In Advances in Neural Information Processing Systems 33: Annual Conference on Neu- ral Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual .

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325–6334. IEEE Computer So- ciety.

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network . CoRR , abs/1503.02531.

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transform- ers . CoRR , abs/2004.00849.

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision . In Proceedings of the 38th international Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 4904–4916. PMLR.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. Tinybert: Distilling BERT for natural language understanding . In Findings of the Association for Computational Linguistics: EMNLP 2020, online Event, 16-20 November 2020 , volume EMNLP

2020 of Findings of ACL , pages 4163–4174. association for Computational Linguistics.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- semantic alignments for generating image descriptions . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015 , pages 3128–3137. IEEE Computer Society.

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolu- tion or region supervision . In Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning research , pages 5583–5594. PMLR.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization . In 3rd international Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings .

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. visual genome: Connecting language and vision using crowdsourced dense image annotations . Int. J. Com- put. Vis. , 123(1):32–73.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension . In Proceedings of the 58th annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 7871–7880. Association for Computational Linguistics.

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caim- ing Xiong, and Steven C. H. Hoi. 2021a. Align before fuse: Vision and language representation learning with momentum distillation . CoRR , abs/2107.07651.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language . CoRR , abs/1908.03557.

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021b. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , pages 2592–2607. Association for Computational Linguistics.

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi- aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks . In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX , volume 12375 of Lecture Notes in Computer science , pages 121–137. Springer.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: common objects in context . In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , volume 8693 of Lecture Notes in Computer Science , pages 740–755. Springer.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach . CoRR , abs/1907.11692.

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visi- olinguistic representations for vision-and-language tasks . In Advances in Neural Information processing Systems 32: Annual Conference on Neural information Processing Systems 2019, NeurIPS 2019, december 8-14, 2019, Vancouver, BC, Canada , pages 13–23.

Shuming Ma, Li Dong, Shaohan Huang, Dong- dong Zhang, Alexandre Muzio, Saksham Sing- hal, Hany Hassan Awadalla, Xia Song, and Furu Wei. 2021. Deltalm: Encoder-decoder pre-training for language generation and translation by aug- menting pretrained multilingual encoders . CoRR , abs/2106.13736.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs . In Advances in Neural information Processing Systems 24: 25th Annual conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 december 2011, Granada, Spain , pages 1143–1151. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image- to-sentence models . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santi- ago, Chile, December 7-13, 2015 , pages 2641–2649. IEEE Computer Society.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. learning transferable visual models from natural language supervision . In Proceedings of the 38th international Conference on Machine Learning, ICML

2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 8748–8763. PMLR.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training .

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res. , 21:140:1–140:67.

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks . In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information processing Systems 2015, December 7-12, 2015, Mon- treal, Quebec, Canada , pages 91–99.

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Ka- hou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. Fitnets: Hints for thin deep nets . In 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings .

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning . In Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics, ACL 2018, Melbourne, Aus- tralia, July 15-20, 2018, Volume 1: Long Papers , pages 2556–2565. Association for Computational Linguistics.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: pre- training of generic visual-linguistic representations . In 8th International Conference on Learning Repre- sentations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net.

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs . In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, volume 1: Long Papers , pages 6418–6428. Association for Computational Linguistics.

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. Lightning- dot: Pre-training visual-semantic embeddings for real-time image-text retrieval . In Proceedings of the 2021 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 , pages 982–997. association for Computational Linguistics.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for BERT model compression . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 , pages 4322–4331. Association for Computational Linguis- tics.

Hao Tan and Mohit Bansal. 2019. LXMERT: learning cross-modality encoder representations from trans- formers . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 , pages 5099–5110. Association for Computational Linguis- tics.

Hugo Touvron, Matthieu Cord, Matthijs Douze, francisco Massa, Alexandre Sablayrolles, and Hervé Jé- gou. 2021. Training data-efficient image transform- ers & distillation through attention . In proceedings of the 38th International Conference on machine Learning, ICML 2021, 18-24 July 2021, virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 10347–10357. PMLR.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need . In Advances in Neural Information processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA , pages 5998–6008. Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiu- jun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020a. Minivlm: A smaller and faster vision-language model . CoRR , abs/2012.06946. Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. 2021a. Vlmo: Unified vision-language pre- training with mixture-of-modality-experts . CoRR , abs/2111.02358.

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021b. Minilmv2: Multi-head self- attention relation distillation for compressing pre- trained transformers . In Findings of the association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 , volume ACL/IJCNLP 2021 of Findings of ACL , pages 2140– 2151. Association for Computational Linguistics. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020b. Minilm: Deep self- attention distillation for task-agnostic compression of pre-trained transformers . In Advances in Neural Information Processing Systems 33: Annual conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual . Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu- lia Tsvetkov, and Yuan Cao. 2021c. Simvlm: Simple

visual language model pretraining with weak supervision . CoRR , abs/2108.10904.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.

Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin John- son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation . CoRR , abs/1609.08144.

Ning Xie, Farley Lai, Derek Doran, and Asim Ka- dav. 2019. Visual entailment: A novel task for fine-grained image understanding . CoRR , abs/1901.06706.

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Song- fang Huang, Wenming Xiao, and Fei Huang. 2021. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning . In Proceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), virtual Event, August 1-6, 2021 , pages 503–513. association for Computational Linguistics.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions . Trans. Assoc. Com- put. Linguistics , 2:67–78.

Sergey Zagoruyko and Nikos Komodakis. 2017. paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer . In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track proceedings . OpenReview.net.

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jian- feng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models . In IEEE conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579– 5588. Computer Vision Foundation / IEEE.

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. unified vision-language pre-training for image caption- ing and VQA . In The Thirty-Fourth AAAI conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial In- telligence, EAAI 2020, New York, NY, USA, february 7-12, 2020 , pages 13041–13049. AAAI Press.