【论文笔记】LXMERT: Learning Cross-Modality Encoder Representations from Transformers-CSDN博客

本文链接：https://blog.csdn.net/gjh1716718326/article/details/122061252

LXMERT是一种预训练模型，旨在学习视觉和语言的交互，特别是单个图像及其描述句子的表示。模型由三个Transformer编码器组成：对象关系编码器、语言编码器和跨模态编码器。预训练通过五个任务完成，包括跨模态语言建模、目标预测、匹配和图像问答。实验表明，LXMERT在多个数据集上表现出色。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities.

做视觉文本的理解任务，需要模型能理解视觉概念和文本语义信息，但最重要的是视觉和文本的对齐问题。

数据库：VQA GQA NLVR

简介

we present one of the first works in building a pre-trained vision-and-language cross-modality framework and show its strong performance on several datasets.

本文的作者注意到在文本和视觉的专门领域内都诞生了很多性能表现十分优秀的预训练模型，但是在这两个领域的跨模态任务中还不存在预训练模型，因此提出了一种文本视觉的跨模态预训练模型。

Our new cross-modality model focuses on learning vision-and-language interactions, especially for representations of a single image and its descriptive sentence.
.
It consists of three Transformer (Vaswani et al., 2017) encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
.
In order to better learn the cross-modal alignments between vision and language, we next pre-train our model with five diverse representative tasks:

masked cross modality language modeling,

masked object prediction via RoI-feature regression,

masked object prediction via detected-label classification,

cross-modality matching,

image question answering.

模型功能：图片和描述性文字
.
模型结构：3个Transformer的编码器

object relationship encoder（关系）

language encoder（语言）

cross-modality encoder（跨模态）

.
模型预训练：5个训练任务

跨模态语言遮盖建模

目标预测-回归

目标预测-分类

跨模态匹配

图片问答

模型架构

our model takes two inputs: an image and its related sentence (e.g., a
caption or a question). Each image is represented as a sequence of objects, and each sentence is rep r