今日论文阅读2022-11-10

最新推荐文章于 2024-07-14 18:24:56 发布

辉辉小学生

最新推荐文章于 2024-07-14 18:24:56 发布

阅读量363

点赞数

分类专栏：多模态paper 文章标签：论文阅读

本文链接：https://blog.csdn.net/huihuixiaoxue/article/details/127768088

版权

多模态paper 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

多模态预训练论文

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

vision-and-language tasks：

visual question answering,visual commonsense reasoning, referring expressions, and caption-based image retrieval and a special experiment setting

key technical innovation：

introducing separate streams for vision and language processing that communicate through co-attentional transformer layers.

why two-stream？

notes：

Given an image I represented as a set of region features v 1 , . . . , v T and a text input w 0 , . . . , w T , our model outputs fifinal representations h v 0 , . . . , h v T and h w 0 , . . . , h wT . Notice that

exchange between the two streams is restricted to be between specifific layers and that the text stream has signifificantly more processing before interacting with visual features – matching our intuitions that our chosen visual features are already fairly high-level and require

limited context-aggregation compared to words in a sentence.