摘要(Abstract):
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.
Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).
To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
对于人类来说,随便瞥一眼就能获取到很多图片信息,这些信息不仅仅是像素点显示的,还有图像之外隐藏的知识类信息,但是这个任务对机器来说很难,这里作者将这个任务定义为视觉常识推理,要求机器不仅回答出正确答案,还要对这个答案给出证明。
作者提出一个新的数据集VCR,包含290k个多选QA,这些问题来源于110k个电影场景。生成大量的有意义并且高质量的问题的关键是对抗性匹配,这是一种通过将丰富的注释转换为偏差极小的多选问题的方法。VCR数据集对人类来说比较简单,准确率可以超过90%,但是对于机器来说比较困难,准确率约为45%。
为了使机器能够达到认知的层面,作者提出一个新的方法,叫做Recognition to Cognition Networks (R2C),为基础、情景化、推理建立了必要的分层模型,缩小了人类和机器在识别VCR上的差距。
介绍(Introduction):
视觉理解要求实现识别和认知的无缝集成。除了识别层次的感知(例如检测物体及其属性