从Transformers学习跨模态编码器表示《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》

最新推荐文章于 2024-02-23 23:38:39 发布

Tiám青年

最新推荐文章于 2024-02-23 23:38:39 发布

阅读量6.9k

点赞数 10

分类专栏：计算机视觉 VQA

本文链接：https://blog.csdn.net/xiasli123/article/details/104166051

版权

LXMERT是一个基于Transformer的框架，用于学习视觉和语言的联系。模型包括对象关系、语言和交叉模态编码器，通过预训练任务如语言建模、目标预测和视觉问答来学习模态间的相互关系。经过微调后，LXMERT在VQA和GQA等数据集上取得最优结果，并在NLVR2上提升了22%的性能。

摘要由CSDN通过智能技术生成

这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。

一、文献摘要介绍

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via fifive diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pretraining strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders.

作者认为视觉和语言推理需要理解视觉概念，语言语义，最重要的是，这两种方式之间的对齐和关系。因此，我们提出了LXMERT（从Transformer学习跨模态编码器表示）框架来学习这些视觉和语言联系。在LXMERT中，我们建立了一个大型的Transformer模型，该模型由三个编码器组成：对象关系编码器，语言编码器和交叉模式编码器。接下来，为了使我们的模型具有连接视觉和语言语义的能力，我们通过五种多样的代表性预训练任务对模型进行了大量的图像和句子对训练：masked语言建模，masked对象预测（特征回归和标签分类），跨模态匹配和图像问题解答。这些任务有助于学习模态内和模态间的关系。在对我们的预训练参数进行微调之后，我们的模型在两个视觉问题回答数据集（即VQA和GQA）上获得了最新的结果。我们还通过将我们的预训练交叉模式模型适应于具有挑战性的视觉推理任务NLVR2来展示其可推广性，并将以前的最佳结果提高了22％的绝

最低0.47元/天解锁文章

Tiám青年

关注

10
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
从Transformers学习跨模态编码器表示《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。一、文献摘要介绍Vision-and-language reasoning requires an understanding of visual concepts, language semantics, an...
复制链接

扫一扫

专栏目录