Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities.
做视觉文本的理解任务,需要模型能理解视觉概念和文本语义信息,但最重要的是视觉和文本的对齐问题。
数据库:VQA GQA NLVR
简介
we present one of the first works in building a pre-trained vision-and-language cross-modality framework and show its strong performance on several datasets.
本文的作者注意到在文本和视觉的专门领域内都诞生了很多性能表现十分优秀的预训练模型,但是在这两个领域的跨模态任务中还不存在预训练模型,因此提出了一种文本视觉的跨模态预训练模型。
Our new cross-modality model focuses on learning vision-and-language interactions, especially for representations of a single image and its descriptive sentence.
.
It consists of three Transformer (Vaswani et al., 2017) encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
.
In order to better learn the cross-modal alignments between vision and language, we next pre-train our model with five diverse representative tasks:
- masked cross modality language modeling,
- masked object prediction via RoI-feature regression,
- masked object prediction via detected-label classification,
- cross-modality matching,
- image question answering.
模型功能:图片和描述性文字
.
模型结构:3个Transformer的编码器
- object relationship encoder(关系)
- language encoder(语言)
- cross-modality encoder(跨模态)
.
模型预训练:5个训练任务
- 跨模态语言遮盖建模
- 目标预测-回归
- 目标预测-分类
- 跨模态匹配
- 图片问答
模型架构
our model takes two inputs: an image and its related sentence (e.g., a
caption or a question). Each image is represented as a sequence of objects, and each sentence is rep r