XGPT: Cross-modal Generative Pre-Training for Image Captioning
Contribution
-
现有大多数VL pre-trained models基本都是Transformer-Encoder结构的,不适用于Vision-and-language generation tasks,因为:
On one hand, pre-trained models developed for understanding tasks only provides the encoder. To support generation tasks, separate decoders have to be trained, like the methods proposed by VideoBERT and CBT. On the other hand, existing VL pre-training objectives are almost all related to the masked region or span prediction, including VLP. None of the