近日看了CVPR 2018的一篇论文,IQA:Visual question answering in interactive envionments,主要描述的是用一个代理与视频内容进行交互,回答基于视频的问题。之前未看过这方面的论文,并且网上也没有关于这篇论文的解说,所以在此记录一些个人心得体会,如有错误,还望各位老师给予批评指正!
人工智能社区的一个长期目标是创建能够在现实世界中执行手工任务并能通过自然语言与人类交流的代理人。例如,一个家用机器人可能会提出以下问题:我们需要买更多的牛奶吗?这将需要它导航到厨房,打开冰箱,看看牛奶罐里的牛奶,或者我们有多少盒饼干?这将需要代理导航到橱柜,打开其中的几个,并计算cookie盒的数量。为了实现这一目标,VQA这个有关视觉内容的问题受到了计算机视觉和自然语言处理的高度重视。虽然现如今VQA方面已经有了很大的进展,但研究主要集中在被动回答关于视觉的问题上,没有能力与生成内容的环境进行交互,一个只能被动回答问题的代理人在帮助人类完成任务的能力上是有限的。
一.文章概况
摘要:
We introduce Interactive Question Answering (
IQA
),the task of answering questions that require an autonomous
agent to interact with a dynamic visual environment.
IQA
presents the agent with a scene and a question, like: “Are
there any apples in the fridge?” The agent must navigate
around the scene, acquire visual understanding of scene el
ements, interact with objects (e.g. open refrigerators) and
plan for a series of actions conditioned on the question.
Popular reinforcement learning approaches with a single
controller perform poorly on
IQA
owing to the large and
diverse state space. We propose the Hierarchical Interac
tive Memory Network (
HIMN
), consisting of a factorized
set of controllers, allowing the system to operate at mul
tiple levels of temporal abstraction. To evaluate
HIMN
,
we introduce
IQUAD V
1
, a new dataset built upon AI2-
THOR [
35
], a simulated photo-realistic environment of con-
fifigurable indoor scenes with interactive objects.
IQUAD V
1
has 75,000 questions, each paired with a unique scene con-
fifiguration. Our experiments show that our proposed model
outperforms popular single controller based methods on IQUAD V
1