This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that
(1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks
(2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models.
unified的含义:
- 下游任务微调即可以进行生成任务,又可以进行理解任务(比较全面)
- 此模型的编码器和解码器都是transformer结构的
简介
现有的方法
Although significant improvements have been reported on individual downstream tasks using different pre-trained models, it remains cha