【文献阅读】EST-VQA——基于事实的场景文本VQA（X. Wang等人，CVPR，2020）

最新推荐文章于 2023-09-11 19:21:12 发布

全部梭哈迟早暴富

最新推荐文章于 2023-09-11 19:21:12 发布

阅读量1.6k

点赞数 1

分类专栏： # 视觉问答阅读科研论文阅读视觉问答(VQA)相关

本文链接：https://blog.csdn.net/z704630835/article/details/107319894

版权

科研论文阅读同时被 3 个专栏收录

73 篇文章 9 订阅

订阅专栏

视觉问答阅读

49 篇文章 25 订阅

订阅专栏

视觉问答(VQA)相关

27 篇文章 28 订阅

订阅专栏

一、背景

文章题目：《On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering》

这篇文章是关于场景文本的研究。

文章下载地址：https://openaccess.thecvf.com/content_CVPR_2020/papers/Wang_On_the_General_Value_of_Evidence_and_Bilingual_Scene-Text_Visual_CVPR_2020_paper.pdf

文献引用格式：Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, Liangwei Wang. "On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering." In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

项目地址：暂时没有公开，只给了项目主页www.est-vqa.org

二、文章导读

先看一下摘要：

Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method’s ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analyses are provided that show the value of the dataset. The dataset is available at www.est-vqa.org.

VQA现在的问题还是在于泛化性。他们是学习图像数据和文本数据之间一种内在的巧合关联，而不是深层关系。本文提出一种新的数据集，每一个问题都用两种语言来表述。数据集主要是场景文本的问题。

三、文章详细介绍

其实VQA泛化性差的原因还是在于数据偏见。为了提高模型泛化性，一般也就需要提高模型的推理能力，而不仅仅只依靠数据偏见来驱动模型，据此，作者设计的模型思路如下图所示：

本文基于场景文本的VQA来做的动机有二，一是VQA目前在文本类的问题上表现很差，并且当前算法的性能和实际应用的要求还相差甚远；二是因为VQA的问答和答案很少有歧义，一般都是有具体的场景文本能够指示答案。

基于文本的VQA目前还是具有很多挑战的，如下图：

比如(a)可以不根据任何文本进行回答的，（b）有着不止一个正确答案，（c）需要先验知识来回答，（d）的答案则不能根据图片中的文本直接获得。

当前的VQA的泛化性，严重依赖于训练集中的答案空间的构建。比如下图所示：

（b）表明了传统VQA方法对于图像特征变化非常敏感即使没有改变图像的语义，（c）和（d）表明当没有出现文本时，传统的VQA方法倾向于根据语言偏见给出一个答案。

这篇文章提出了EST-VQA（Evidence-based Scene Text Visual Question Answering），文章主要贡献有以下三个方面：

• Dataset: The EST-VQA dataset provides questions, images and answers, but also a bounding box for each question that indicates the area of the image that informs the answer. We refer to such bounding boxes as evidence. The dataset is intended to enable the development of text VQA methods that are closer to the levels of performance required by practical applications, but also to encourage the development of general VQA methods that generalize. 数据集。EST-VQA数据集虽然提供了问题，图像和答案，但是问题都对应图像中的一片区域，这篇区域用bounding box圈了出来，作者讲这些bounding boxes称作事实。

• Evaluation Metric: We introduce an Evidence-based Evaluation (EvE) metric, which will require a VQA model to provide evidence to support the predicted answer. For this purpose, a new VQA model is also proposed. Under this new metric, it is anticipated that it will be much more difficult for naive classification models to achieve inflated performance. 评价方法。引入了一种基于事实的评估方法。

• Bilingual: To the best of our knowledge, the proposed EST-VQA is the first bilingual scene text VQA dataset that includes both English and Chinese question and answer pairs. The fact that the proposed dataset embodies questions in two languages further rewards methods that generalize well. It is more difficult for a method to exploit superficial correlations in questions expressed in multiple languages. The languages chosen are also particularly grammatically distinct, and reflect culturally distinct populations, which leads to different question statistics, and further encourages generalization. 双语性。EST-VQA是首个既包含英语又包含中文的数据集。

1. 相关工作

基于文本的VQA：目前已经有一些这样的工作。包括ST-VQA，OCR-VQA，LoRRA（网络结构如下所示）。

数据集的大小：前面提到的三个模型，他们都提出了相应的数据集，其中【4】是ST-VQA，【24】是OCR-VQA，【29】Text-VQA，他们的大小如下表所示：

2. EST-VQA数据集

图像：一共20757张，英文的场景文本图像来自于Total-Text，ICDAR 2013，ICDAR 2015，CTW1500，MLT，COCO Text。中文的场景文本图像来自于LSVT。

问题和答案：英文问题一共15056个，中文问题一共是13006个，问题和答案可以是不同语言。比如用英文提问的商店的名字，若商店名字是中文的话就可以用中文回答。另外问题和答案必须和场景文本有关。另外标注人还需要把和答案相关区域的bounding box标记出来作为事实。常用的英文问题和中文问题分布如下：

另外，问题和答案的长度分布如下：

最终，问题和答案的数量如下表统计所示：

评估方法：传统的VQA都是当作分类问题来处理的，那么答案只能是训练集中出现过的答案，无法得知模型是否学到了有用的知识。所以作者提出基于事实的评估方式，它需要模型提供一个事实来支持答案的正确性。评估方法包括两部分，一部分是核查答案，使用归一化的编辑相似性度量（normalized Levenshtein similarity score）；第二个是核查事实，使用IoU评估方式。

3. 模型

作者提出的模型叫QA R-CNN，结构如下图所示：

模型包含两部分，一个是关注模块Focusing Module（FM），一个是推理模块Reasoning Module（RM）。FM主要是由Faster R-CNN组成，传统的Faster RCNN输出的是boxes和features，这里作者还为每一个box输出一个关注得分；英文问题的词嵌入使用GloVe，中文问题词嵌入使用Word2Vec，然后将嵌入输入LSTM获得文本特征。然后将两种特征连接，区分box区域和非box区域。RM模块则是精校正的部分，与LoRRA的结构类似，OCR词嵌入的获得通过FastText模型提取，然后再将词嵌入对图像特征和问题嵌入做进一步的融合，用于进一步的分类。

后面作者针对设计的三种任务做了实验，这里就不展开了，最后给出一些实验结果的样本：

四、小结

全部梭哈迟早暴富

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
2
评论
【文献阅读】EST-VQA——基于事实的场景文本VQA（X. Wang等人，CVPR，2020）

一、背景文章题目：《On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering》这篇文章是关于场景文本的研究。文章下载地址：https://openaccess.thecvf.com/content_CVPR_2020/papers/Wang_On_the_General_Value_of_Evidence_and_Bilingual_Scene-Text_Visual_CVPR_202
复制链接

扫一扫