Are Vision-Language Transformers Learning Multimodal Representations? AProbing Perspective.

最新推荐文章于 2024-07-01 23:26:13 发布

辉辉小学生

最新推荐文章于 2024-07-01 23:26:13 发布

阅读量132

点赞数

分类专栏：多模态paper 文章标签： transformer 深度学习自然语言处理

本文链接：https://blog.csdn.net/huihuixiaoxue/article/details/125337808

版权

多模态paper 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

Abstract :

backgroud: the development of transformer -based Vision-Language models.

purpose: better understand the representations produced by those models

details: compare pre-trained and finetuned representations at a vision, language and multimodal

level: use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias.

results:

Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size.

On semantically adversarial examples, we find that those models are able to pinpoint finegrained multimodal differences.

we also notice that fine-tuning a Vision-Language model on multimodal tasks
does not necessarily improve its multimodal ability.

Introduction:

VL tasks, such as visual question answering, cross modal retrieval or gener ation, are notoriously difficult because of the necessity for models to build sensible multimodal representations that can relate fine-grained elements of the text and the picture.

how multimodal information is encoded in the representations learned by those models？

how affected they are by various bias and properties of their training data？

previous studies have shed light on some particular as pects of transformer-based VL models, they lack a more sys tematic analysis of monomodal biases that impede the nature of the learned representations .

work:

studying the multimodal capacity of VL representations

exploring what information is learned and forgotten between pre-training and
fine-tuning(this could show the current limits of the pre-training process)

we probe three VL models: UNITER, LXMERT and ViLT(both pre-trained and fine-tuned
models)

findings:

UNITER reaches better overall results on the language modality.

ViLT reaches better results on the vision modality.

while the models show their ability to identify colors, they do not yet have multimodal capacity to distinguish object size and position.

Related Work 略

Methodology

VLMpre -> pres1

fine-tune-T -> pres2

pres1(or pres2) -> probing-task-P -> related information

辉辉小学生

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Are Vision-Language Transformers Learning Multimodal Representations? AProbing Perspective.

呜呜呜我好累后面写不动了直接懒了截图ppt了呜呜好嘞
复制链接

扫一扫