【视觉问答】Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem

薄荷奶绿Yena

已于 2023-12-19 21:19:46 修改

阅读量945

点赞数 23

分类专栏：视觉对话文章标签：计算机视觉自然语言处理 python

于 2023-12-08 20:49:57 首次发布

本文链接：https://blog.csdn.net/nbwjszd/article/details/134859658

版权

原文标题： Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem
原文代码： 暂无
发布年度： 2022
发布期刊： TPAMI

摘要

Several studies have recently pointed that existing Visual Question Answering (VQA) models heavily suffer from the language prior problem, which refers to capturing superficial statistical correlations between the question type and the answer whereas ignoring the image contents. Numerous efforts have been dedicated to strengthen the image dependency by creating the delicate models or introducing the extra visual annotations. However, these methods cannot sufficiently explore how the visual cues explicitly affect the learned answer representation, which is vital for language reliance alleviation. Moreover, they generally emphasize the class-level discrimination of the learned answer representation, which overlooks the more fine-grained instance-level patterns and demands further optimization. In this paper, we propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration, which can better investigate the fine-grained visual effects and mitigate the language prior problem by learning the instance-level characteristics. Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents, based on which the collaborative learning of intra-instance invariance and inter-instance discrimination is implemented by two well-designed discriminators. Besides, we implement the information bottleneck modulator on latent space for further bias alleviation and representation calibration. We impose our visual perturbation-aware framework to three orthodox baselines and the experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify its robustness on the balanced VQA benchmark.

背景

现有的 VQA 模型存在问题（即问题类型）和答案之间存在表面捷径的问题。结果，视觉作用被削弱，VQA退化为纯粹的语言匹配问题。目前主流方法通过增强图像依赖性来缓解语言先验问题，可以分为两类：基于视觉注释的方法和无视觉注释的方法。前者明确利用外部视觉注释来指导视觉内容的学习，但收集人工注释既昂贵又耗时。相反，无视觉注释的方法成为范式。
但仍然存在以下限制：
1）使用视觉增强策略减少对语言的依赖，但无法充分确定这些视觉线索如何影响学习到的答案表示。
2）现有方法通常强调答案表示的类间区分，而忽视了更精细的内部表示，如实例间判别性和实例内不变性，因此可能导致较差的性能。

创新点

本文提出了一个简单而有效的视觉扰动感知校准框架，用于减轻VQA中的语言依赖。这是首次尝试从实例级判别特征表示的角度克服语言先验问题。

本文设计了四个组件，包括掩模扰动控制器、信息瓶颈调制器、类感知判别器和关系感知判别器。在视觉扰动控制器中，在原始图像特征的基础上自动构建两种具有不同扰动程度的手动生成的图像特征。当遇到硬扰动图像特征和原始图像特征时，类感知鉴别器负责通过区分它们的语义差异来捕获实例间判别。当遇到软扰动图像，关系感知鉴别器致力于学习实例内不变相关性。为了使学习到的潜在表示包含最少的足够信息并免受输入偏差的影响，进一步应用变分信息瓶颈调制器来更好地促进两个判别器的学习。
视觉扰动感知学习策略与模型无关，可以轻松地合并到现有最先进的 VQA 模型中，以减少语言依赖并提高其推理性能。

VQA任务定义：

给定一批由 B 个图像 Vi、问题 Qi 和真实答案集 Ai 三元组组成的数据样本，表示为 $\{V_i, Q_i, A_i \}^B_{i=1}$ ，VQA 模型旨在学习映射函数 $H_{vqa}$ 产生准确的答案。它通常涉及三个部分：视觉和文本编码器、VQA 基础模型和答案分类器。

作为视觉和文本编码器，每个图像 $v_i$ 通过预训练的 Faster-RCNN 模型 $U_v$ 编码为视觉嵌入矩阵 $V_i = U_v(v_i)$

最低0.47元/天解锁文章

薄荷奶绿Yena

关注

23
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
【视觉问答】Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem

Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem
复制链接

扫一扫