【文献阅读】GQA-OOD——测试低频样本问答的数据集和评估方法（Corentin Kervadec等人，ArXiv，2020，有代码）

本文探讨了视觉问答(VQA)领域中模型面对罕见事件的可靠性问题，指出数据集不平衡导致的模型偏见，以及在低频概念上的泛化难题。介绍了基于GQA数据集的精校正评估基准GQA-OOD，旨在精确评估模型错误分布，对比多种VQA模型在低频样本上的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、背景

文章题目：《Roses are red, violets are blue ... but should VQA expect them to ?》

这篇文章的作者都是来自于法国的Orange Labs，记录这篇文章的原因是题目起的非常艺术，不像一般起名都是“基于。。。的方法”之类。

文章下载地址：https://arxiv.org/pdf/2006.05121.pdf

文献引用格式：Corentin Kervadec, Grigory Antipov, Moez Baccouche and Christian Wolf. "Roses are red, violets are blue ... but should VQA expect them to ?". arXiv preprint, arXiv: 2006.05121, 2020.

项目地址：作者目前只公布了自己的数据集https://github.com/gqaood/GQA-OOD

二、文章导读

先放上文章的摘要部分：

To be reliable on rare events is an important requirement for systems based on machine learning. In this work we focus on Visual Question Answering (VQA), where, in spite of recent efforts, datasets remain imbalanced, causing shortcomings of current models: tendencies to overly exploit dataset biases and struggles to generalise to unseen associations of concepts. We focus on a systemic evaluation of model error distributions and address fundamental questions: How is the prediction error distributed? What is the prediction accuracy on infrequent vs. frequent concepts? In this work, we design a new benchmark based on a fine-grained reorganization of the GQA dataset [1], which allows to precisely answer these questions. It introduces distributions shifts in both validation and test splits, which are defined on question groups and are thus tailored to each question. We performed a large-scale study and we experimentally demonstrate that several state-of-the-art VQA models, even those specifically designed for bias reduction, fail to address questions involving infrequent concepts. Furthermore, we show that the high accuracy obtained on the frequent concepts alone is mechanically increasing overall accuracy, covering up the true behavior of current VQA models.

稀有事件的可信是机器学习系统的一个重要需求。这项工作中，针对VQA，尽管获得了一些成功，但是数据集中仍旧不平衡，造成了当前模型的一些问题：过度利用数据集偏见的趋势，并努力推及到看不见的概念关联。作者就模型误差分布评估一些基本问题展开研究：预测误差是如何分布的？不频繁与频繁出现样本的预测准确度怎样？在本文的工作中，设计了一个基于GQA数据集精校正的评估基准，使得VQA能够准确回答这些低频出现的问题。再测试和验证阶段引入了分布转换（distributions shifts），该转换是根据问题组得到，并对每种问题进行了约束。作者将该数据集用于多个VQA模型，尽管这些模型的设计考虑了语言偏见，但是仍旧没有很好的解决低频问题。另外，研究表明现有模型通过对常见问题的设计改进作为提高总体精度的机制，这掩盖了当前VQA的真实行为。

三、文章详细介绍

现有VQA领域的最大问题还是数据集的不平衡，一些常见的表述总会带来语言偏见，比如红玫瑰“red rose”（玫瑰一定就是红的吗），与上下文无关的偏见，比如城市里的一只斑马“zebra in a city”（斑马和城市有关吗）。这些表述都会使得模型过度依赖偏见，阻碍了模型的泛化能力。尽管针对此问题目前已有了部分研究，但是对于误差分布的系统性评估仍旧缺乏（Despite a general consensus on this diagnostic, systemic evaluations of error distributions are rare）。目前还有很多遗留问题：误差是如何分布的？正确预测是因为真正的推理还是语言偏见？对于高频和低频样本的预测精度会有什么区别？如何在分布外（低频样本）验证模型（Out Of Distribution (OOD)-settings）？

因此本研究针对这类问题，设计了一个新的基准。这个新的基准包含了：

(i) a new fine-grained reorganization of the GQA dataset [1] introducing distribution shifts in both validation and test sets。一个新的精矫正的GQA数据集改进版，其在验证和测试集中引入了分布转换。

(ii) a set of evaluation metrics; 一系列评估方法

(iii) new evaluation plots illustrating the reasoning behavior of VQA models on different operating points. 新的评估准则，说明VQA在不同要点上的推理行为。

这里选择GQA的原因是该数据集在问题组中有结构化信息，这种结构化信息能够准确捕捉语言偏见，以此来选择具有较强偏见的组并创建分布转换，为每一个问题添加约束。这个精校正的GQA数据集制作方式如下图所示：

这篇文章的主要贡献在于：

1. We propose and make public a new fine-grained re-organization of the GQA dataset and a set of the respective evaluation metrics allowing to precisely evaluate the reasoning behavior of VQA models and to characterise and visualise their generalisation behavior on different operating points w.r.t distribution shifts. 重组并精校正了GQA数据集，并提出一系列的评估方法，使得VQA模型的推理行为能够准确评估，并且能够特征化和可视化他们的泛化行为。

2. Compared to competing benchmarks, our dataset features distribution shifts for both, validation and test, allowing to validate models under OOD conditions. 与其他的基准进行比较，该数据集在验证集和测试集上的分布变化使其能够在分布外（低频样本）的条件下验证模型。

3. In a large study, we evaluate several recent VQA models and show that they struggle to generalise in OOD conditions; we also test several SOTA bias reduction methods and show that there is still room for improvement in addressing bias in VQA. 在多个VQA模型上进行了评估，结果表明在分布外（低频样本）条件下，这些模型的泛化能力仍有待提高；同时还测试了这些模型减少偏见的方法，表明VQA在解决偏见方面仍旧有很大的提升空间。

1. 相关工作

VQA corpuses（VQA数据集）：最早的大规模数据集是VQA 1.0，后面发现该数据集具有较大的语言偏见，提出了VQA2.0。同时还有一个合成的CLEVR数据集也是同时提出的。后来有人将CLEVR对应到真实世界图像上，提出了GQA数据集，其包含约170万个问题。

Attempts to reduce the bias-dependency（减少语言偏见）：VQA的问题还是泛化能力太低，有人实验发现，即使不用图像，只用问题，也能够完成问答。因此就有相关研究针对语言偏见而展开。例如利用对抗的思想来正则化训练【11】；RUBi减少语言偏见；【4】采用相似策略来正则化模型；其他工作【12】，【13】则突出视觉区域。这些方法虽然表现不错，但是他们的评估都是在未知分布上进行（unseen distributions）。

[11] Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems, pages 1541–1551, 2018.
[4] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4060–4073, 2019.
[12] Jialin Wu and Raymond Mooney. Self-critical reasoning for robust visual question answering. In Advances in Neural Information Processing Systems, pages 8601–8611, 2019.
[13] Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision, pages 2591–2600, 2019.

Reinventing VQA evaluation（VQA评估）：VQA的模型设计越来越复杂，但是偏见也越来越严重，也急需新的方法来评估VQA模型。早期的评估方法就是基于语料库（lexical database），计算预测结果的软评估得分；后来这种方法被硬得分替代掉，虽然更容易偏见，但是却易于实践；GQA的作者根据他们的数据集提出了一些新的评估方法：一致性consistency, 合理性plausibilitiy, 可验证性validity, 分布distribution，这类方法虽然更立体，但是却无法评估在分布外（低频样本）的预测答案的能力；

2. GQA-OOD：a benchmark for OOD settings

作者新制作的数据集起名为GQA-OOD，OOD的意思其实就是低频样本（rare events），需要注意的是这些低频样本也有可能在训练集中出现，比如问题玫瑰啥颜色“What color is this rose?”，一般是红色，但是在低频样本里面，却有可能是蓝色，这需要模型进行推理得到正确答案。另外，这种低频样本的分布变化不是全局的，而是取决于上下文，比如问题改成紫罗兰，那么一般答案是蓝色，低频样本答案则为红色。

GQA-OOD benchmark由数据集和评估标准组成。

Question groups（问题组）：为了构造分布转换，作者使用了GQA数据集中的local groups。这些local groups精确定义了问题的类型，比如用“what color”，“where is”提问的，还同时明确了问题的相关概念，比如“zebra”，“violet”。总计有37，000个local groups和相关的132，000个问题。

Measuring group imbalance（不平衡性的测量）：这里的平衡性作者采用了香农熵来测定。

OOD setting and metrics（分布外设定和评估）：作者通过选择答案类别子集（根据每一个问题组的频率）的方式来引入分布的变化，并根据用于评估的类别来引入三个不同的指标，即：

• Acc-tail: 低频样本的精度

• Acc-head: 每个local group中的高频分布的精度

• Acc-all: 整个GQA-OOD数据集上的总体精度

3. 实验

参与比较的模型包括：

Question Prior — this blind baseline returns the most probable answer estimated from training set statistics. Following [10, 2], we use the question types priors when evaluating on VQA-CP and VQA2. For GQA, we use the training global group priors. 作者使用了全局组的问题先验。

LSTM [5] — this blind baseline takes GloVe embeddings [19] and encodes them using an LSTM [20] followed by a
feed-forward network. 问题嵌入采用GloVe，编码采用LSTM

BUTD [21] — a strong VQA baseline based on object-level attention, in particular, bounding boxes with dense visual
feature vectors extracted from the image using an object detector. VQA模型，使用了目标层级的注意力。

MCAN [22] — this SOTA approach is based on a Transformer [18] architecture and designed to model both intramodality interactions and the inter-modality ones. It allows complex multi-hop reasoning through several stacked
self-attention blocks. In our experiments, we use the 6-layers version of MCAN. 基于Transaformer结构，并结合了模态内和模态间的交互，推理则是通过多个堆叠的自注意力块。本实验采用了6个注意力块

RUBi [3] — adds a question-only branch to the base model during training to prevent it from learning question biases.
This branch is omitted during evaluation. To better analyze bias dependencies, we also study a modified version of
RUBi, which we refer to as RUBi+QB below. In this variant, the question-only branch is kept during evaluation. 一种减少语言偏见的方法

BP [4] — is similar to RUBi but differs by directly taking training set statistics to infer question type biases during
training3. The question type biases are fused with the base model predictions using a product of experts [4], and
removed during testing. 使用PoE网络融合问题的类型偏见与模型预测。

LMH [4] — is an improved version of BP [4]. In this version, the question bias is dynamically weighted by the base
model in order to control its influence. In the original setup, an entropy penalty is added to the loss to prevent the model
to ignore the bias. Nevertheless, when training on GQA, we obtain better results without this penalty (see supp material
section B for details). 是BP的改进版本。

（1）预测误差分布的分析

基于GQA-OOD数据集，不同算法的三种评价指标的结果如下图所示：

Models fail on rare question-answer pairs：很明显的是，两个盲模型blind models，即question prior和LSTM，他们在acc-tail和acc-head之间的差异非常大（即∆），说明他们非常依赖于语言偏见。BUTD和MCAN也相对比较大，但是MCAN表现比BUTD稍好，说明Transformer结构的表现更好一些。

Overall accuracy is dominated by frequent question-answer pairs：总体精度指标acc-all，并不反映模型的真实能力，因为它的增长是因为对高频样本的正确预测。

Bias-reduction methods are not efficient on rare samples：令人惊讶的是，三种减少语言偏见的模型都没能增加acc-tail的得分，甚至造成了acc-head得分的恶化。而RUBi+QB模型，却比RUBi模型要好，说明阻止学习高频样本的模式并不一定会提高低频样本的性能。

Visualising the generalisation behavior：不同VQA模型的泛化能力如下图所示：