VisualBERT: A Simple and Performant Baseline for Vision and Language

连理o

已于 2022-03-18 19:51:30 修改

阅读量1k

点赞数

分类专栏： # 多模态文章标签：深度学习自然语言处理机器学习

于 2022-01-01 20:38:44 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42437114/article/details/122269132

版权

多模态专栏收录该内容

13 篇文章 5 订阅

订阅专栏

目录

VisualBERT
Experiment
References

VisualBERT

在这里插入图片描述

Architecture

网络架构与 BERT 相同。文本部分的处理也与 BERT 相同，下面主要介绍视觉图像部分的处理：设 $F$ 为 visual embeddings 的集合， $f\in F$ 为一个 bounding region 对应的 feature，为如下 3 部分 embedding 的加和：(1) $f_o$ : bounding region 对应的 visual feature representation，由 Faster-RCNN 得到；(2) $f_s$ : 用于指示一个 embedding 是 image embedding 还是 text embedding 的 segment embedding；(3) $f_p$ : position embedding， $f_p$ 只在需要输入 word 和 bounding region 之间的对齐关系时才会被使用 (e.g. 在 VCR 中，数据集提供了 words 和 bounding regions 之间的对齐关系)，它是 bounding region 对应 words 的 position embeddings 之和；如果 visual embedding 维度与 text embedding 不同，则还需要一个全连接层将 visual embedding 投影到与 text embedding 相同的维度

Pre-training

Task-Agnostic Pre-Training: COCO 数据集中的每张图片都有 5 个独立的 captions，VisualBERT 在 COCO 数据集上使用了两个 visually-grounded language model objectives:
- (1) Masked language modeling with the image: 遮盖一些 word embedding 并要求模型由视觉信息和其他文本信息预测被遮盖的内容
- (2) Sentence-image prediction. For COCO, where there are multiple captions corresponding to one image, we provide a text segment consisting of two captions. One of the caption is describing the image, while the other has a 50% chance to be another corresponding caption and a 50% chance to be a randomly drawn caption. The model is trained to distinguish these two situations.
Task-Specific Pre-Training: 在 fine-tune 之前，先在下游任务的数据集上使用 masked language modeling with the image objective 进行预训练，这能帮助模型适应新的 target domain

Experiment

Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler.

References

VisualBERT: A Simple and Performant Baseline for Vision and Language

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。