论文-《From Recognition to Cognition: Visual Commonsense Reasoning》笔记

最新推荐文章于 2024-06-11 14:52:11 发布

Vivinia_Vivinia

最新推荐文章于 2024-06-11 14:52:11 发布

阅读量1.2k

点赞数 2

分类专栏：论文文章标签： VCR VQA

本文链接：https://blog.csdn.net/hester_hester/article/details/103299934

版权

论文提出视觉常识推理任务，要求机器不仅回答问题，还需提供推理依据。VCR数据集包含290k个电影场景问题，对抗性匹配算法解决数据偏差。R2C模型尝试模拟从识别到认知的多层次推理，但仍有30%-40%差距于人类表现。

摘要由CSDN通过智能技术生成

论文下载

摘要（Abstract）：

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.

Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).

To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

对于人类来说，随便瞥一眼就能获取到很多图片信息，这些信息不仅仅是像素点显示的，还有图像之外隐藏的知识类信息，但是这个任务对机器来说很难，这里作者将这个任务定义为视觉常识推理，要求机器不仅回答出正确答案，还要对这个答案给出证明。

作者提出一个新的数据集VCR，包含290k个多选QA，这些问题来源于110k个电影场景。生成大量的有意义并且高质量的问题的关键是对抗性匹配，这是一种通过将丰富的注释转换为偏差极小的多选问题的方法。VCR数据集对人类来说比较简单，准确率可以超过90%，但是对于机器来说比较困难，准确率约为45%。

为了使机器能够达到认知的层面，作者提出一个新的方法，叫做Recognition to Cognition Networks (R2C)，为基础、情景化、推理建立了必要的分层模型，缩小了人类和机器在识别VCR上的差距。