【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

最新推荐文章于 2024-04-13 09:30:36 发布

烫烫烫烫的若愚

最新推荐文章于 2024-04-13 09:30:36 发布

阅读量882

点赞数

文章标签： bert 自然语言处理计算机视觉

本文链接：https://blog.csdn.net/gjh1716718326/article/details/122190180

版权

VL-BERT是为了解决视觉与语言任务中缺乏通用表示的问题而设计的。通过预训练在大规模视觉-语言语料库和纯文本数据上，它提升了聚合和对齐视觉-语言线索的能力。与LXMERT等两流模型不同，VL-BERT采用单流统一模型，参数更新包括快速R-CNN。预训练任务包括带视觉线索的掩蔽语言建模和带语言线索的掩蔽RoI分类。实验在VCR、VQA和引用表达理解等任务上展示了其效果。

摘要由CSDN通过智能技术生成

For tasks at the intersection of vision and language, there lacks such pre-trained generic feature representations.

motivation：这篇文章和unified的思想很接近，希望训练出能够适应各类下游任务的通用表示模型。

简介

To better exploit the generic representation, we pre-train VL-BERT at both large visual-linguistic corpus and text-only datasets. The pre-training loss on the visual-linguistic corpus is incurred via predicting randomly masked words or RoIs. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues. While the loss on the text-only corpus is of the standard MLM loss in BERT, improving the generalization on long and complex sentences.

这篇文章与类似原版BERT的相似度非常之高，类似的工作也很多，有比较多的内容我并没有记录。

值得一提的是，预训练语料不仅包含双模态数据，还包含纯文本数据。纯文本数据是为了提升模型对于长难句子的处理能力。

。