阅读笔记:VL-BERT: PRE-TRAINING OF G ENERICVISUAL -LINGUISTICR EPRESENTATIONS
Contribution
-
文章提出VL-BERT(single stream model),结合文本和图片进行end-to-end预训练,对一系列下游的图片—文本任务都有明显提升效果( image captioning、 visual question answering、 visual commonsense reasoning)
-
从前做text与image相关任务的方法都是:
combine base networks pretrained for image recognition and NLP respectively in a task-specific way. The task-specific model is directly finetuned for the specific target task, without any generic visual-linguistic pre-training.
因为没有使用image-text联合训练而存在的问题是:
The task-specific model may well suffer from overfitting
when the data for the target task is scarce. Also, due to the ta