For tasks at the intersection of vision and language, there lacks such pre-trained generic feature representations.
motivation:这篇文章和unified的思想很接近,希望训练出能够适应各类下游任务的通用表示模型。
简介
To better exploit the generic representation, we pre-train VL-BERT at both large visual-linguistic corpus and text-only datasets. The pre-training loss on the visual-linguistic corpus is incurred via predicting randomly masked words or RoIs. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues. While the loss on the text-only corpus is of the standard MLM loss in BERT, improving the generalization on long and complex sentences.
这篇文章与类似原版BERT的相似度非常之高,类似的工作也很多,有比较多的内容我并没有记录。
- 值得一提的是,预训练语料不仅包含双模态数据,还包含纯文本数据。纯文本数据是为了提升模型对于长难句子的处理能力。
。