用于视觉问答的统一视觉语言预训练模型《Unified Vision-Language Pre-Training for VQA》

最新推荐文章于 2024-07-06 16:39:23 发布

Tiám青年

最新推荐文章于 2024-07-06 16:39:23 发布

阅读量3.7k

点赞数 1

分类专栏：计算机视觉 VQA

本文链接：https://blog.csdn.net/xiasli123/article/details/104159395

版权

本文介绍了一种统一的视觉语言预训练（VLP）模型，该模型适用于视觉语言生成和理解任务，通过预训练进行双向和seq2seq预测。实验表明，大规模无监督预训练能提升下游任务的效率和准确性。

摘要由CSDN通过智能技术生成

这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。

一、文献摘要介绍

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be finetuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0.

本文提出了统一的视觉语言预训练（VLP）模型。该模型的统一之处在于：（1）可以针对视觉语言生成（例如，图像描述）或理解（例如，视觉问题）任务进行微调，（2）使用共享的多层transformer网络进行建模编码和解码，这与许多现有方法不同，在现有方法中，使用单独的模型来实现编码器和解码器。在大量的图像-文本对上对统一VLP模型进行了预训练，使用以下两项任务的无监督学习目标：双向和序列对序列（seq2seq）掩码视觉-语言预测。两项任务的区别仅在于预测所基于的上下文。这是通过为共享的transformer网络使用特定的自注意掩码来控制的，下图是作者提出的用于一般视觉语言预训练的统一编码器-解码器模型。

二、网络框架介绍

我们将输入图像表示为 $\large I$ ，将关联/目标句子描述（单词）表示为 $\large S$ 。我们使用现成的物体检测器从图像中提取固定数量的N个物体区域，表示为

最低0.47元/天解锁文章

Tiám青年

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
2
评论
用于视觉问答的统一视觉语言预训练模型《Unified Vision-Language Pre-Training for VQA》

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。一、文献摘要介绍This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in th...
复制链接

扫一扫

专栏目录