论文-《From Recognition to Cognition: Visual Commonsense Reasoning》笔记

论文提出视觉常识推理任务,要求机器不仅回答问题,还需提供推理依据。VCR数据集包含290k个电影场景问题,对抗性匹配算法解决数据偏差。R2C模型尝试模拟从识别到认知的多层次推理,但仍有30%-40%差距于人类表现。
摘要由CSDN通过智能技术生成

论文下载

摘要(Abstract):

    Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.

    Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).

    To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

       对于人类来说,随便瞥一眼就能获取到很多图片信息,这些信息不仅仅是像素点显示的,还有图像之外隐藏的知识类信息,但是这个任务对机器来说很难,这里作者将这个任务定义为视觉常识推理,要求机器不仅回答出正确答案,还要对这个答案给出证明。

       作者提出一个新的数据集VCR,包含290k个多选QA,这些问题来源于110k个电影场景。生成大量的有意义并且高质量的问题的关键是对抗性匹配,这是一种通过将丰富的注释转换为偏差极小的多选问题的方法。VCR数据集对人类来说比较简单,准确率可以超过90%,但是对于机器来说比较困难,准确率约为45%。

       为了使机器能够达到认知的层面,作者提出一个新的方法,叫做Recognition to Cognition Networks (R2C),为基础、情景化、推理建立了必要的分层模型,缩小了人类和机器在识别VCR上的差距。

 

介绍(Introduction):

       视觉理解要求实现识别和认知的无缝集成。除了识别层次的感知(例如检测物体及其属性࿰

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值