从Transformers学习跨模态编码器表示《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》

LXMERT是一个基于Transformer的框架,用于学习视觉和语言的联系。模型包括对象关系、语言和交叉模态编码器,通过预训练任务如语言建模、目标预测和视觉问答来学习模态间的相互关系。经过微调后,LXMERT在VQA和GQA等数据集上取得最优结果,并在NLVR2上提升了22%的性能。
摘要由CSDN通过智能技术生成

目录

一、文献摘要介绍

二、网络框架介绍

三、实验分析

四、结论


这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收货。如有不足,随时欢迎交流和探讨。

一、文献摘要介绍

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via fifive diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pretraining strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders.

作者认为视觉和语言推理需要理解视觉概念,语言语义,最重要的是,这两种方式之间的对齐和关系。因此,我们提出了LXMERT(从Transformer学习跨模态编码器表示)框架来学习这些视觉和语言联系。在LXMERT中,我们建立了一个大型的Transformer模型,该模型由三个编码器组成:对象关系编码器,语言编码器和交叉模式编码器。接下来,为了使我们的模型具有连接视觉和语言语义的能力,我们通过五种多样的代表性预训练任务对模型进行了大量的图像和句子对训练:masked语言建模,masked对象预测(特征回归和标签分类),跨模态匹配和图像问题解答。这些任务有助于学习模态内和模态间的关系。在对我们的预训练参数进行微调之后,我们的模型在两个视觉问题回答数据集(即VQA和GQA)上获得了最新的结果。我们还通过将我们的预训练交叉模式模型适应于具有挑战性的视觉推理任务NLVR2来展示其可推广性,并将以前的最佳结果提高了22%的绝

  • 10
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值