阅读笔记:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Contribution
-
提出 ViLBERT 模型(two streams model),由两个BERT结构分别对text和image进行学习,通过cross-attention进行信息交流,在两个预训练任务(proxy tasks)上进行预训练。最后在4个task上进行finetune:visual question answering、visual commonsense reasoning, referring expressions、caption-based image retrieval
-
指出主流visual-text model的问题:
the dominant strategy is to start with separate language and vision models pretrained for other large-scale tasks and then learn grounding as part of task training – often resulting in myopic gr