VisualBert : 适用很多种类的任务,结构简单
和VL-Bert的区别:
- 由token-enbedding和feature-embedding共同组成了一个embedding层
- position编码层被用于进行对齐
- VL-Bert晚于本篇文章
训练的三个阶段
Task-Agnostic Pre-Training Here we train VisualBERT on COCO using two visually-grounded language model objectives.
(1) Masked language modeling with the image. Some elements of text input are masked and must be predicted but vectors corresponding to image regions are not masked.
(2) Sentence-image prediction. For COCO, where there are multiple captions corresponding to one image, we provide a text segment consisting of two captions. One of the caption is describing the image, while the other has a 50% chance to be another corresponding caption and a 50% chance to be a randomly drawn caption. The model i