【论文笔记】LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities.


做视觉文本的理解任务,需要模型能理解视觉概念和文本语义信息,但最重要的是视觉和文本的对齐问题。

数据库:VQA GQA NLVR

简介

we present one of the first works in building a pre-trained vision-and-language cross-modality framework and show its strong performance on several datasets.


本文的作者注意到在文本和视觉的专门领域内都诞生了很多性能表现十分优秀的预训练模型,但是在这两个领域的跨模态任务中还不存在预训练模型,因此提出了一种文本视觉的跨模态预训练模型。

Our new cross-modality model focuses on learning vision-and-language interactions, especially for representations of a single image and its descriptive sentence.
.
It consists of three Transformer (Vaswani et al., 2017) encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
.
In order to better learn the cross-modal alignments between vision and language, we next pre-train our model with five diverse representative tasks:

  1. masked cross modality language modeling,
  2. masked object prediction via RoI-feature regression,
  3. masked object prediction via detected-label classification,
  4. cross-modality matching,
  5. image question answering.

模型功能:图片和描述性文字
.
模型结构:3个Transformer的编码器

  • object relationship encoder(关系)
  • language encoder(语言)
  • cross-modality encoder(跨模态)

.
模型预训练:5个训练任务

  • 跨模态语言遮盖建模
  • 目标预测-回归
  • 目标预测-分类
  • 跨模态匹配
  • 图片问答

模型架构

our model takes two inputs: an image and its related sentence (e.g., a
caption or a question). Each image is represented as a sequence of objects, and each sentence is rep r

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值