强化学习@AAAI2019

最新推荐文章于 2025-01-01 00:10:39 发布

hanx0204

最新推荐文章于 2025-01-01 00:10:39 发布

阅读量1w

点赞数 1

分类专栏：强化学习文章标签：机器学习

本文链接：https://blog.csdn.net/qq_33254440/article/details/108965312

版权

强化学习专栏收录该内容

8 篇文章

订阅专栏

Fully Convolutional Network with Multi-Step Reinforcement Learning for Image Processing
具有多步强化学习的全卷积网络用于图像处理
Ryosuke Furuta@The University of TokyoNaoto Inoue@The University of TokyoToshihiko Yamasaki@The University of Tokyo
古田凉介@东京大学井上直人@东京大学山崎俊彦@东京大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
This paper tackles a new problem setting: reinforcement learning with pixel-wise rewards (pixelRL) for image processing. After the introduction of the deep Q-network deep RL has been achieving great success. However the applications of deep RL for image processing are still limited. Therefore we extend deep RL to pixelRL for various image processing applications. In pixelRL each pixel has an agent and the agent changes the pixel value by taking an action. We also propose an effective learning method for pixelRL that significantly improves the performance by considering not only the future states of the own pixel but also those of the neighbor pixels. The proposed method can be applied to some image processing tasks that require pixel-wise manipulations where deep RL has never been applied.
本文解决了一个新的问题设置：使用像素级奖励（pixelRL）进行图像处理的强化学习。在引入深层Q网络之后，深层RL取得了巨大的成功。但是，深度RL在图像处理中的应用仍然受到限制。因此，我们针对各种图像处理应用将深层RL扩展到pixelRL。在pixelRL中，每个像素都有一个代理，代理通过采取行动来更改像素值。我们还提出了一种有效的PixelRL学习方法，它不仅考虑了自己像素的未来状态，而且还考虑了相邻像素的未来状态，从而显着提高了性能。所提出的方法可以应用于一些图像处理任务，这些任务需要对像素进行逐个操作，而深度RL从未应用。

Communication-Efficient Stochastic Gradient MCMC for Neural Networks
用于神经网络的通信有效的随机梯度MCMC
Chunyuan Li@Microsoft ResearchChangyou Chen@State University of New York at BuffaloYunchen Pu@FacebookRicardo Henao@Duke UniversityLawrence Carin@Duke University
李春元@微软研究院陈昌友@纽约州立大学布法罗分校Yunchen Pu @ Facebook里卡多·贺瑙@杜克大学劳伦斯·卡林@杜克大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Learning probability distributions on the weights of neural networks has recently proven beneficial in many applications. Bayesian methods such as Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) offer an elegant framework to reason about model uncertainty in neural networks. However these advantages usually come with a high computational cost. We propose accelerating SG-MCMC under the masterworker framework: workers asynchronously and in parallel share responsibility for gradient computations while the master collects the final samples. To reduce communication overhead two protocols (downpour and elastic) are developed to allow periodic interaction between the master and workers. We provide a theoretical analysis on the finite-time estimation consistency of posterior expectations and establish connections to sample thinning. Our experiments on various neural networks demonstrate that the proposed algorithms can greatly reduce training time while achieving comparable (or better) test accuracy/log-likelihood levels relative to traditional SG-MCMC. When applied to reinforcement learning it naturally provides exploration for asynchronous policy optimization with encouraging performance improvement.
最近证明，在许多应用中，基于神经网络权重的学习概率分布是有益的。贝叶斯方法，例如随机梯度马尔可夫链蒙特卡洛（SG-MCMC）提供了一个优雅的框架来推理神经网络中的模型不确定性。然而，这些优点通常伴随着高计算成本。我们建议在masterworker框架下加速SG-MCMC：异步和并行地工作的工人分担梯度计算的责任，而master收集最终样本。为了减少通信开销，开发了两种协议（倾盆大雨和弹性协议），以允许主服务器和工作人员之间进行定期交互。我们提供对后验期望的有限时间估计一致性的理论分析，并建立与样本稀疏的联系。我们在各种神经网络上的实验表明，与传统的SG-MCMC相比，提出的算法可以大大减少训练时间，同时达到可比（或更高）的测试准确性/对数似然率水平。当将其应用于强化学习时，它自然可以为异步策略优化提供探索，并可以提高性能。

Reinforcement Learning for Improved Low Resource Dialogue Generation
强化学习以改善低资源对话的产生
Ana V. González-Garduño@University of Copenhagen
Ana V.González-Garduño@哥本哈根大学
Doctoral Consortium Track Abstracts
博士联合会文摘
In this thesis I focus on language independent methods of improving utterance understanding and response generation and attempt to tackle some of the issues surrounding current systems. The aim is to create a unified approach to dialogue generation inspired by developments in both goal oriented and open ended dialogue systems. The main contributions in this thesis are: 1) Introducing hybrid approaches to dialogue generation using retrieval and encoder-decoder architectures to produce fluent but precise utterances in dialogues 2) Proposing supervised semi-supervised and Reinforcement Learning methods for domain adaptation in goal oriented dialogue and 3) Introducing models that can adapt cross lingually.
在这篇论文中，我将重点放在与语言无关的方法上，以改善话语理解和响应生成，并尝试解决当前系统周围的一些问题。目的是创建一种统一的方法来进行对话，该方法受目标导向和开放式对话系统的发展启发。本论文的主要贡献是：1）引入了使用检索和编码器-解码器体系结构的混合方法来生成对话，以在对话中产生流利但精确的语音。2）提出了面向目标的对话中领域自适应的监督半监督和强化学习方法。 3）介绍可以跨语言适应的模型。

Generating Multiple Diverse Responses for Short-Text Conversation
生成用于文本对话的多种多样的响应
Jun Gao@Soochow UniversityWei Bi@Tencent AI LabXiaojiang Liu@Tencent AI LabJunhui Li@Soochow UniversityShuming Shi@Tencent AI Lab
高军@苏州大学魏毕@腾讯AI实验室刘小江@腾讯AI实验室李俊慧@苏州大学Shuming Shi @腾讯AI实验室
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
Neural generative models have become popular and achieved promising performance on short-text conversation tasks. They are generally trained to build a 1-to-1 mapping from the input post to its output response. However a given post is often associated with multiple replies simultaneously in real applications. Previous research on this task mainly focuses on improving the relevance and informativeness of the top one generated response for each post. Very few works study generating multiple accurate and diverse responses for the same post. In this paper we propose a novel response generation model which considers a set of responses jointly and generates multiple diverse responses simultaneously. A reinforcement learning algorithm is designed to solve our model. Experiments on two short-text conversation tasks validate that the multiple responses generated by our model obtain higher quality and larger diversity compared with various state-ofthe-art generative models.
神经生成模型已经变得很流行，并且在短文本对话任务中取得了可喜的表现。通常对他们进行培训，以建立从输入职位到其输出响应的一对一映射。但是，给定帖子通常在实际应用中与多个回复同时关联。以前对此任务的研究主要集中在提高针对每个帖子生成的头一个响应的相关性和信息性。很少有作品会研究针对同一帖子产生多种准确多样的回复。在本文中，我们提出了一种新颖的响应生成模型，该模型共同考虑一组响应并同时生成多个不同的响应。设计了强化学习算法来求解我们的模型。在两个短文本对话任务上的实验证明，与各种最新的生成模型相比，我们的模型所生成的多个响应可以获得更高的质量和更大的多样性。

End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks
通过屏障功能进行端到端安全加固学习，以实现对安全至关重要的连续控制任务
Richard Cheng@California Institute of TechnologyGábor Orosz@University of MichiganRichard M. Murray@California Institute of TechnologyJoel W. Burdick@California Institute of Technology
理查德·郑（Richard Cheng）@加州理工学院加博尔·奥罗斯（GáborOrosz）@密歇根大学理查德·默里（Richard M.Murray）@加州理工学院乔尔·伯迪克（Joel W. Burdick）@加州理工学院
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Reinforcement Learning (RL) algorithms have found limited success beyond simulated applications and one main reason is the absence of safety guarantees during the learning process. Real world systems would realistically fail or break before an optimal controller can be learned. To address this issue we propose a controller architecture that combines (1) a model-free RL-based controller with (2) model-based controllers utilizing control barrier functions (CBFs) and (3) online learning of the unknown system dynamics in order to ensure safety during learning. Our general framework leverages the success of RL algorithms to learn high-performance controllers while the CBF-based controllers both guarantee safety and guide the learning process by constraining the set of explorable polices. We utilize Gaussian Processes (GPs) to model the system dynamics and its uncertainties.
强化学习（RL）算法在模拟应用之外的成功程度有限，主要原因之一是学习过程中缺乏安全保证。在学习最佳控制器之前，现实世界的系统实际上会失败或崩溃。为了解决此问题，我们提出了一种控制器架构，该架构将（1）基于模型的无RL控制器与（2）利用控制屏障功能（CBF）的基于模型的控制器和（3）在线学习未知系统动力学相结合确保学习期间的安全。我们的通用框架利用RL算法的成功来学习高性能控制器，而基于CBF的控制器既可以保证安全性，又可以通过限制可探索的策略集来指导学习过程。我们利用高斯过程（GPs）对系统动力学及其不确定性进行建模。

On Reinforcement Learning for Full-Length Game of StarCraft
关于《星际争霸》全长游戏的强化学习
Zhen-Jia Pang@Nanjing UniversityRuo-Ze Liu@Nanjing UniversityZhou-Yu Meng@Nanjing UniversityYi Zhang@Nanjing UniversityYang Yu@Nanjing UniversityTong Lu@Nanjing University
庞振佳@南京大学刘若泽@南京大学孟梦玉@南京大学张艺@南京大学杨宇@南京大学童璐@南京大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
StarCraft II poses a grand challenge for reinforcement learning. The main difficulties include huge state space varying action space long horizon etc. In this paper we investigate a set of techniques of reinforcement learning for the full-length game of StarCraft II. We investigate a hierarchical approach where the hierarchy involves two levels of abstraction. One is the macro-actions extracted from expert’s demonstration trajectories which can reduce the action space in an order of magnitude yet remain effective. The other is a two-layer hierarchical architecture which is modular and easy to scale. We also investigate a curriculum transfer learning approach that trains the agent from the simplest opponent to harder ones. On a 64×64 map and using restrictive units we train the agent on a single machine with 4 GPUs and 48 CPU threads. We achieve a winning rate of more than 99% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat model we can achieve over 93% winning rate against the most difficult noncheating built-in AI (level-7) within days. We hope this study could shed some light on the future research of large-scale reinforcement learning.
《星际争霸2》对强化学习提出了巨大挑战。主要困难包括巨大的状态空间，变化的动作空间，长远的视野等。在本文中，我们研究了一套完整的《星际争霸2》游戏的强化学习技术。我们研究了一种分层方法，其中分层涉及两个抽象级别。一种是从专家的演示轨迹中提取的宏动作，可以将动作空间减少一个数量级，但仍然有效。另一个是两层的分层体系结构，它是模块化的并且易于扩展。我们还研究了一种课程转移学习方法，该方法可以将特工从最简单的对手训练到更难的对手。在64×64的地图上，使用限制性单元，我们在一台具有4个GPU和48个CPU线程的计算机上训练代理。相对于难度为1级的内置AI，我们的获胜率超过99％。通过课程转移学习算法和混合战斗模型，我们可以在几天之内针对最困难的非作弊内置AI（第7级）获得超过93％的获胜率。我们希望这项研究可以为将来的大规模强化学习研究提供一些启示。

A Comparative Analysis of Expected and Distributional Reinforcement Learning
预期强化学习与分布强化学习的比较分析
Clare Lyle@McGill UniversityMarc G. Bellemare@Google BrainPablo Samuel Castro@Google
克莱尔·莱尔（Clare Lyle）@麦吉尔大学（Marc G.Bellemare）@ Google BrainPablo Samuel Castro @ Google
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Since their introduction a year ago distributional approaches to reinforcement learning (distributional RL) have produced strong results relative to the standard approach which models expected values (expected RL). However aside from convergence guarantees there have been few theoretical results investigating the reasons behind the improvements distributional RL provides. In this paper we begin the investigation into this fundamental question by analyzing the differences in the tabular linear approximation and non-linear approximation settings. We prove that in many realizations of the tabular and linear approximation settings distributional RL behaves exactly the same as expected RL. In cases where the two methods behave differently distributional RL can in fact hurt performance when it does not induce identical behaviour. We then continue with an empirical analysis comparing distributional and expected RL methods in control settings with non-linear approximators to tease apart where the improvements from distributional RL methods are coming from.
自从一年前引入强化学习的分布式方法（分布式RL）以来，相对于对期望值进行建模的标准方法（预期的RL）产生了强大的成果。但是，除了收敛性保证外，几乎没有理论结果来研究分布RL提供的改进背后的原因。在本文中，我们通过分析表格线性逼近和非线性逼近设置的差异来开始对此基本问题的研究。我们证明，在表格和线性逼近设置的许多实现中，分布RL的行为与预期RL完全相同。在两种方法的行为不同的情况下，分布RL实际上不会引起相同的行为，因此会损害性能。然后，我们继续进行经验分析，将控制设置中的分布和预期RL方法与非线性逼近器进行比较，以弄清楚分布RL方法的改进来自何处。

Towards Sequence-to-Sequence Reinforcement Learning for Constraint Solving with Constraint-Based Local Search
面向序列到序列强化学习以基于约束的局部搜索来解决约束
Helge Spieker@Simula Research Laboratory
Helge Spieker @ Simula研究实验室
Student Abstracts
学生文摘
This paper proposes a framework for solving constraint problems with reinforcement learning (RL) and sequence-tosequence recurrent neural networks. We approach constraint solving as a declarative machine learning problem where for a variable-length input sequence a variable-length output sequence has to be predicted. Using randomly generated instances and the number of constraint violations as a reward function a problem-specific RL agent is trained to solve the problem. The predicted solution candidate of the RL agent is verified and repaired by CBLS to ensure solutions that satisfy the constraint model. We introduce the framework and its components and discuss early results and future applications.
本文提出了一种使用强化学习（RL）和序列顺序递归神经网络解决约束问题的框架。我们将约束求解作为一种声明性机器学习问题，其中对于可变长度输入序列，必须预测可变长度输出序列。使用随机生成的实例和违反约束的次数作为奖励函数，可以训练特定于问题的RL代理来解决问题。 CBLS对RL代理的预测解决方案候选进行验证和修复，以确保解决方案满足约束模型。我们介绍了框架及其组件，并讨论了早期结果和将来的应用。

Learning to Communicate and Solve Visual Blocks-World Tasks
学习交流和解决视觉障碍-世界任务
Qi Zhang@University of MichiganRichard Lewis@University of MichiganSatinder Singh@University of MichiganEdmund Durfee@University of Michigan
张琦@密歇根大学理查德·刘易斯@密歇根大学Satinder Singh @密歇根大学埃德蒙·杜尔费@密歇根大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We study emergent communication between speaker and listener recurrent neural-network agents that are tasked to cooperatively construct a blocks-world target image sampled from a generative grammar of blocks configurations. The speaker receives the target image and learns to emit a sequence of discrete symbols from a fixed vocabulary. The listener learns to construct a blocks-world image by choosing block placement actions as a function of the speaker’s full utterance and the image of the ongoing construction. Our contributions are (a) the introduction of a task domain for studying emergent communication that is both challenging and affords useful analyses of the emergent protocols; (b) an empirical comparison of the interpolation and extrapolation performance of training via supervised (contextual) Bandit and reinforcement learning; and (c) evidence for the emergence of interesting linguistic properties in the RL agent protocol that are distinct from the other two.
我们研究说话者和听者递归神经网络代理之间的紧急通信，这些代理负责合作构造从块配置的生成语法中采样的块世界目标图像。说话者接收目标图像，并学习从固定词汇表中发出一系列离散符号。聆听者通过根据说话者的全部说话和正在进行的构造的图像选择砌块放置动作来学习构造一个砌块世界的图像。我们的贡献是（a）引入了一个任务域来研究紧急通信，这既具有挑战性，又可以对紧急协议进行有用的分析；（b）通过有监督的（上下文的）强盗和强化学习对训练的内插和外插性能的经验比较；（c）有证据表明RL代理协议中出现了有趣的语言属性，这些属性不同于其他两个属性。

Model Learning for Look-Ahead Exploration in Continuous Control
连续控制中的超前探索模型学习
Arpit Agarwal@Carnegie Mellon UniversityKatharina Muelling@Carnegie Mellon UniversityKaterina Fragkiadaki@Carnegie Mellon University
卡耐基梅隆大学Arpit Agarwal @卡耐基梅隆大学卡塔琳娜·穆林@卡耐基梅隆大学卡特琳娜Fragkiadaki @卡耐基梅隆大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We propose an exploration method that incorporates lookahead search over basic learnt skills and their dynamics and use it for reinforcement learning (RL) of manipulation policies. Our skills are multi-goal policies learned in isolation in simpler environments using existing multigoal RL formulations analogous to options or macroactions. Coarse skill dynamics i.e. the state transition caused by a (complete) skill execution are learnt and are unrolled forward during lookahead search. Policy search benefits from temporal abstraction during exploration though itself operates over low-level primitive actions and thus the resulting policies does not suffer from suboptimality and inflexibility caused by coarse skill chaining. We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods and converges to better policies than methods that use options or parametrized skills as building blocks of the policy itself as opposed to guiding exploration. We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods and converges to better policies than methods that use options or parameterized skills as building blocks of the policy itself as opposed to guiding exploration.
我们提出了一种探索方法，该方法结合了对基本学习技能及其动态的超前搜索，并将其用于操纵策略的强化学习（RL）。我们的技能是在简单的环境中，使用类似于期权或宏观行动的现有多目标RL公式，在孤立的环境中单独学习多目标策略。粗略的技能动态，即可以了解（完全）技能执行所导致的状态转换，并在超前搜索期间将其展开。策略搜索受益于探索过程中的时间抽象，尽管它本身是在低级别的原始动作上运行的，因此，由此产生的策略不会遭受因技能链粗糙而导致的次优和灵活性不足。我们表明，提出的探索策略能够比当前最新的RL方法更快地有效学习复杂的操作策略，并且与使用选项或参数化技能作为策略本身的构成部分的方法相比，可以收敛到更好的策略指导探索。我们表明，与当前使用最新的RL方法相比，所提出的探索策略可更快地有效学习复杂的操作策略，并且与使用选项或参数化技能作为策略本身的构成部分的方法相比，可以收敛到更好的策略指导探索。

Interactive Semantic Parsing for If-Then Recipes via Hierarchical Reinforcement Learning
通过分层强化学习对If-Then食谱进行交互式语义解析
Ziyu Yao@The Ohio State UniversityXiujun Li@Microsoft ResearchJianfeng Gao@Microsoft ResearchBrian Sadler@Army Research LaboratoryHuan Sun@The Ohio State University
姚子瑜@俄亥俄州立大学李秀军@微软研究院高建峰@微软研究院布莱恩·萨德勒@陆军研究实验室孙欢@俄亥俄州立大学
AAAI Technical Track: Human-AI Collaboration
AAAI技术专栏：人与人工智能的协作
Given a text description most existing semantic parsers synthesize a program in one shot. However it is quite challenging to produce a correct program solely based on the description which in reality is often ambiguous or incomplete. In this paper we investigate interactive semantic parsing where the agent can ask the user clarification questions to resolve ambiguities via a multi-turn dialogue on an important type of programs called “If-Then recipes.” We develop a hierarchical reinforcement learning (HRL) based agent that significantly improves the parsing performance with minimal questions to the user. Results under both simulation and human evaluation show that our agent substantially outperforms non-interactive semantic parsers and rule-based agents.1
给定文本描述，大多数现有的语义解析器可以一枪合成一个程序。但是，仅根据描述来制作正确的程序是非常具有挑战性的，而实际上却常常是模棱两可或不完整的。在本文中，我们研究了交互式语义解析，其中代理可以通过对一种称为“ If-Then配方”的重要程序进行多轮对话来询问用户澄清问题以解决歧义。我们开发了一种基于分层强化学习（HRL）的代理，该代理以对用户的问题最少的方式显着提高了解析性能。通过仿真和人工评估得出的结果表明，我们的代理大大优于非交互式语义解析器和基于规则的代理。1

Learning Representations in Model-Free Hierarchical Reinforcement Learning
无模型层次强化学习中的学习表示
Jacob Rafati@University of California MercedDavid C. Noelle@University of California Merced
雅各布·拉法蒂（Jacob Rafati）@加州大学默塞德大学戴维·诺埃尔@加州大学默塞德大学
Student Abstracts
学生文摘
Common approaches to Reinforcement Learning (RL) are seriously challenged by large-scale applications involving huge state spaces and sparse delayed reward feedback. Hierarchical Reinforcement Learning (HRL) methods attempt to address this scalability issue by learning action selection policies at multiple levels of temporal abstraction. Abstraction can be had by identifying a relatively small set of states that are likely to be useful as subgoals in concert with the learning of corresponding skill policies to achieve those subgoals. Many approaches to subgoal discovery in HRL depend on the analysis of a model of the environment but the need to learn such a model introduces its own problems of scale. Once subgoals are identified skills may be learned through intrinsic motivation introducing an internal reward signal marking subgoal attainment. We present a novel model-free method for subgoal discovery using incremental unsupervised learning over a small memory of the most recent experiences of the agent. When combined with an intrinsic motivation learning mechanism this method learns subgoals and skills together based on experiences in the environment. Thus we offer an original approach to HRL that does not require the acquisition of a model of the environment suitable for large-scale applications. We demonstrate the efficiency of our method on a variant of the rooms environment.
强化学习（RL）的通用方法受到涉及巨大状态空间和稀疏延迟奖励反馈的大规模应用的严重挑战。分层强化学习（HRL）方法试图通过在多个时间抽象级别上学习动作选择策略来解决此可扩展性问题。可以通过识别相对较小的一组状态来进行抽象，这些状态可能与学习相应的技能策略以实现这些子目标一起用作子目标。 HRL中的子目标发现的许多方法都依赖于对环境模型的分析，但是学习这种模型的需求引入了自身的规模问题。一旦确定了子目标，就可以通过内在动机引入内部奖励信号来标记子目标的实现来学习技能。我们介绍了一种新的无模型方法，用于在目标代理的最新经验的小内存中使用增量无监督学习进行次目标发现。当与内在动机学习机制结合时，该方法将根据环境中的经验一起学习子目标和技能。因此，我们为HRL提供了一种原始方法，不需要获取适用于大规模应用的环境模型。我们展示了我们的方法在不同房间环境中的有效性。

DBA: Dynamic Multi-Armed Bandit Algorithm
DBA：动态多武装强盗算法
Sadegh Nobari@Rakuten Inc.
萨迪·诺巴里（Sadegh Nobari）@ Rakuten Inc.
Demonstration Track Abstracts
示范曲目摘要
We introduce Dynamic Bandit Algorithm (DBA) a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit aka Bandit. Bandit makes real-time decisions based on the prior observations. However Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result Bandit cannot quickly enough make profitable decisions when the trend is changing. Unlike Bandit DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore DBA remains as almost as light as Bandit in terms of computations. Therefore DBA can be easily deployed in production as a light process similar to The Bandit. We demonstrate how critical and beneficial is the main focus of DBA i.e. the ability to quickly finding the most profitable option in real-time over its stateof-the-art competitors. Our experiments are augmented with a visualization mechanism that explains the profitability of the decisions made by each algorithm in each step by animations. Finally we observe that DBA can substantially outperform the original Bandit by close to 3 times for a set Key Performance Indicator (KPI) in a case of having 3 arms.
我们引入了动态强盗算法（DBA）的实用解决方案，以改善普遍采用的增强学习算法Multi-Arm Bandit aka Bandit的缺点。 Bandit根据先前的观察做出实时决策。然而，强盗对先验的看法有很大的偏见，即它无法迅速适应不断变化的趋势。结果，当趋势发生变化时，强盗无法足够快地做出有利可图的决策。与Bandit不同，DBA致力于快速适应自身，以及早发现这些趋势。此外，在计算方面，DBA几乎和Bandit一样轻。因此，与The Bandit相似，DBA可以轻松地作为生产过程轻松部署到生产中。我们展示了DBA的主要重点是多么关键和有益，即能够比其最先进的竞争对手迅速实时地找到最有利可图的选择。我们的实验增加了可视化机制，该机制解释了动画中每个算法在每个步骤中做出的决策的获利能力。最后，我们观察到，在拥有3组武器的情况下，对于设定的关键绩效指标（KPI），DBA可以比原始Bandit显着提高近3倍。

Learning to Teach in Cooperative Multiagent Reinforcement Learning
在协作式多主体强化学习中学习教学
Shayegan Omidshafiei@Massachusetts Institute of TechnologyDong-Ki Kim@Massachusetts Institute of TechnologyMiao Liu@IBMGerald Tesauro@IBM ResearchMatthew Riemer@IBM ResearchChristopher Amato@Northeastern UniversityMurray Campbell@IBM ResearchJonathan P. How@Massachusetts Institute of Technology
Shayegan Omidshafiei @麻省理工学院金东基@麻省理工学院刘Mass @ IBMGerald Tesauro @ IBM ResearchMatthew Riemer @ IBM Research克里斯托弗·阿马托@东北大学Murray Campbell @ IBM Research乔纳森·P。
AAAI Technical Track: Multiagent Systems
AAAI技术专题：多代理系统
Collective human knowledge has clearly benefited from the fact that innovations by individuals are taught to others through communication. Similar to human social groups agents in distributed learning systems would likely benefit from communication to share knowledge and teach skills. The problem of teaching to improve agent learning has been investigated by prior works but these approaches make assumptions that prevent application of teaching to general multiagent problems or require domain expertise for problems they can apply to. This learning to teach problem has inherent complexities related to measuring long-term impacts of teaching that compound the standard multiagent coordination challenges. In contrast to existing works this paper presents the first general framework and algorithm for intelligent agents to learn to teach in a multiagent environment. Our algorithm Learning to Coordinate and Teach Reinforcement (LeCTR) addresses peer-to-peer teaching in cooperative multiagent reinforcement learning. Each agent in our approach learns both when and what to advise then uses the received advice to improve local learning. Importantly these roles are not fixed; these agents learn to assume the role of student and/or teacher at the appropriate moments requesting and providing advice in order to improve teamwide performance and learning. Empirical comparisons against state-of-the-art teaching methods show that our teaching agents not only learn significantly faster but also learn to coordinate in tasks where existing methods fail.
人类的集体知识显然得益于这样一个事实，即通过交流将个人的创新教给他人。与人类社会团体类似，分布式学习系统中的代理人可能会从交流中受益，以分享知识和教授技能。先前的工作已经研究了改善代理学习的教学问题，但是这些方法做出的假设妨碍了将教学应用于一般的多代理问题，或者需要针对这些问题的领域专业知识。这种学习教学问题具有与衡量教学的长期影响相关的固有复杂性，这些影响使标准的多主体协调挑战变得复杂。与现有工作相反，本文介绍了第一个通用框架和算法，用于智能代理在多代理环境中学习教学。我们的协调学习和强化教学算法（LeCTR）解决了协作式多主体强化学习中的点对点教学。我们方法中的每个代理都学习建议的时间和内容，然后使用收到的建议来改进本地学习。重要的是，这些角色不是固定的。这些代理人学会在要求和提供建议的适当时候承担学生和/或老师的角色，以改善整个团队的表现和学习。与最先进的教学方法进行的经验比较表明，我们的教学人员不仅学习速度显着提高，而且还学会在现有方法失败的情况下进行协调。

Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications
逆向强化学习的机器教学：算法与应用
Daniel S. Brown@University of Texas at AustinScott Niekum@University of Texas at Austin
丹尼尔·布朗（Daniel S.Brown）@德克萨斯大学奥斯汀分校斯科特·尼克（Scott Niekum）@德克萨斯大学奥斯汀分校
AAAI Technical Track: Reasoning under Uncertainty
AAAI技术专题：不确定性下的推理
Inverse reinforcement learning (IRL) infers a reward function from demonstrations allowing for policy improvement and generalization. However despite much recent interest in IRL little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decisionmaking task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximallyinformative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.
逆向强化学习（IRL）从示威中推断出奖励功能，从而可以改进和推广政策。但是，尽管最近对IRL的兴趣很大，但仍很少做任何工作来理解教特定的顺序决策任务所需的最少演示集。我们将查找IRL的最大信息量演示的问题形式化，将其作为机器教学问题，其目标是找到指定演示者的奖励对等类别所需的最少演示次数。我们通过显示对集合覆盖问题的简化来扩展先前在顺序决策任务的算法教学中的工作，这可以确定有效的近似算法来确定最大信息量的演示集。我们将提出的机器教学算法应用于两个新颖的应用程序：为使用主动IRL学习策略所需的查询数量提供一个下限，并开发一种新颖的IRL算法，该算法可以从信息性演示中比标准IRL方法更有效地学习。

Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach
重新思考强化学习中的折扣因素：一种决策理论方法
Silviu Pitis@University of Toronto
Silviu Pitis @多伦多大学
AAAI Technical Track: Reasoning under Uncertainty
AAAI技术专题：不确定性下的推理
Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP) either in continuous settings with fixed discount factor γ < 1 or in episodic settings with γ = 1. While this has proven effective for specific tasks with welldefined objectives (e.g. games) it has never been established that fixed discounting is suitable for general purpose use (e.g. as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular our framework admits a state-action dependent “discount” factor that is not constrained to be less than 1 so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings we show that there exists a unique “optimizing MDP” with fixed γ < 1 whose optimal value function matches the true utility of the optimal policy and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White’s RL task formalism (2017) and other recent departures from the traditional RL and is relevant to task specification in RL inverse RL and preference-based RL.
传统上，强化学习（RL）代理的任务是在固定折扣因子γ<1的连续设置或在γ= 1的情境设置下，最大化马尔可夫决策过程（MDP）的价值函数。具有明确目标的任务（例如游戏）从未确定固定折扣适合一般用途（例如，作为人类偏好的模型）。本文描述了使用七个公理集进行顺序决策的合理性，并得出了一种将传统固定贴现法推广的贴现形式。尤其是，我们的框架承认，只要最终存在长期贴现，就可以将国家行动相关的“折扣”因子限制为小于1。尽管这扩大了连续设置中可能的偏好结构的范围，但我们证明存在一个固定的γ<1的唯一“优化MDP”，其最优价值函数与最优政策的真实效用相匹配，并且我们量化了价值和效用之间的差异。次优政策。我们的工作可以看作是玛莎·怀特（Martha White）的RL任务形式主义（2017）以及其他最近脱离传统RL的规范化理由，并且与RL逆向RL和基于偏好的RL中的任务规范有关。

Deep Reinforcement Learning via Past-Success Directed Exploration
通过过去的成功探索进行深度强化学习
Xiaoming Liu@Army Engineering UniversityZhixiong Xu@Army Engineering UniversityLei Cao@Army Engineering UniversityXiliang Chen@Army Engineering UniversityKai Kang@Army Engineering University
刘晓明@陆军工程大学许志雄@陆军工程大学曹雷@陆军工程大学陈锡良@陆军工程大学康凯@陆军工程大学
Student Abstracts
学生文摘
The balance between exploration and exploitation has always been a core challenge in reinforcement learning. This paper proposes “past-success exploration strategy combined with Softmax action selection”(PSE-Softmax) as an adaptive control method for taking advantage of the characteristics of the online learning process of the agent to adapt exploration parameters dynamically. The proposed strategy is tested on OpenAI Gym with discrete and continuous control tasks and the experimental results show that PSE-Softmax strategy delivers better performance than deep reinforcement learning algorithms with basic exploration strategies.
探索与开发之间的平衡一直是强化学习中的核心挑战。提出了“结合Softmax动作选择的过去成功探索策略”（PSE-Softmax）作为一种自适应控制方法，以利用Agent在线学习过程的特点动态地适应探索参数。所提出的策略在具有离散和连续控制任务的OpenAI Gym上进行了测试，实验结果表明，与采用基本探索策略的深度强化学习算法相比，PSE-Softmax策略具有更好的性能。

Switch-Based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning
基于交换机的主动Deep Dyna-Q：针对任务完成对话策略学习的高效自适应计划
Yuexin Wu@Carnegie Mellon UniversityXiujun Li@Microsoft ResearchJingjing Liu@MicrosoftJianfeng Gao@Microsoft ResearchYiming Yang@Carnegie Mellon University
吴跃新@卡内基梅隆大学@ Xiujun Li @微软研究院刘静静@ MicrosoftJianfeng Gao @微软研究院杨明明@卡内基梅隆大学
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences. The Dyna-Q algorithm extends Q-learning by integrating a world model and thus can effectively boost training efficiency using simulated experiences generated by the world model. The effectiveness of Dyna-Q however depends on the quality of the world model - or implicitly the pre-specified ratio of real vs. simulated experiences used for Q-learning. To this end we extend the recently proposed Deep Dyna-Q (DDQ) framework by integrating a switcher that automatically determines whether to use a real or simulated experience for Q-learning. Furthermore we explore the use of active learning for improving sample efficiency by encouraging the world model to generate simulated experiences in the stateaction space where the agent has not (fully) explored. Our results show that by combining switcher and active learning the new framework named as Switch-based Active Deep Dyna-Q (Switch-DDQ) leads to significant improvement over DDQ and Q-learning baselines in both simulation and human evaluations.1
通过强化学习来训练任务完成对话代理通常需要大量的真实用户体验。 Dyna-Q算法通过集成世界模型扩展了Q学习，因此可以使用世界模型生成的模拟经验来有效地提高训练效率。然而，Dyna-Q的有效性取决于世界模型的质量-或隐含地预先指定了真实与真实比率。用于Q学习的模拟体验。为此，我们通过集成一个自动确定是使用真实体验还是模拟体验进行Q学习的切换器，扩展了最近提出的Deep Dyna-Q（DDQ）框架。此外，我们通过鼓励世界模型在没有（完全）探究代理的状态行动空间中生成模拟经验，来探索使用主动学习来提高样本效率。我们的研究结果表明，通过将切换器和主动学习相结合，新框架名为基于切换的主动式深部Dyna-Q（Switch-DDQ）导致在仿真和人工评估方面对DDQ和Q学习基线的显着改进.1

Asynchronous Proximal Stochastic Gradient Algorithm for Composition Optimization Problems
异步近邻随机梯度算法求解成分优化问题
Pengfei Wang@Zhejiang UniversityRisheng Liu@Dalian University of TechnologyNenggan Zheng@Zhejiang UniversityZhefeng Gong@Zhejiang University
王鹏飞@浙江大学刘瑞生@大连理工大学郑能干@浙江大学龚he峰@浙江大学
AAAI Technical Track: Constraint Satisfaction and Optimization
AAAI技术专栏：约束满足与优化
In machine learning research many emerging applications can be (re)formulated as the composition optimization problem with nonsmooth regularization penalty. To solve this problem traditional stochastic gradient descent (SGD) algorithm and its variants either have low convergence rate or are computationally expensive. Recently several stochastic composition gradient algorithms have been proposed however these methods are still inefficient and not scalable to large-scale composition optimization problem instances. To address these challenges we propose an asynchronous parallel algorithm named Async-ProxSCVR which effectively combines asynchronous parallel implementation and variance reduction method. We prove that the algorithm admits the fastest convergence rate for both strongly convex and general nonconvex cases. Furthermore we analyze the query complexity of the proposed algorithm and prove that linear speedup is accessible when we increase the number of processors. Finally we evaluate our algorithm Async-ProxSCVR on two representative composition optimization problems including value function evaluation in reinforcement learning and sparse mean-variance optimization problem. Experimental results show that the algorithm achieves significant speedups and is much faster than existing compared methods.
在机器学习研究中，可以将许多新兴应用（重新）公式化为具有不平滑正则损失的成分优化问题。为了解决这个问题，传统的随机梯度下降（SGD）算法及其变体收敛速度低或计算量大。最近，已经提出了几种随机的成分梯度算法，但是这些方法仍然效率低下并且不能扩展到大规模的成分优化问题实例。为了解决这些挑战，我们提出了一种名为Async-ProxSCVR的异步并行算法，该算法有效地结合了异步并行实现和方差减少方法。我们证明，对于强凸和一般非凸情况，该算法均能以最快的速度收敛。此外，我们分析了所提出算法的查询复杂度，并证明了当我们增加处理器数量时，线性加速是可访问的。最后，我们针对两个代表性的构图优化问题（包括强化学习中的值函数评估和稀疏均方差优化问题）对算法Async-ProxSCVR进行了评估。实验结果表明，该算法具有明显的加速效果，并且比现有的比较方法要快得多。

Attention-Aware Sampling via Deep Reinforcement Learning for Action Recognition
通过深度强化学习进行动作识别的注意感知采样
Wenkai Dong@National Laboratory of Pattern RecognitionZhaoxiang Zhang@National Laboratory of Pattern RecognitionTieniu Tan@National Laboratory of Pattern Recognition
董文凯@模式识别国家重点实验室张兆祥@模式识别国家重点实验室谭天牛@模式识别国家重点实验室
AAAI Technical Track: Vision
AAAI技术轨道：愿景
Deep learning based methods have achieved remarkable progress in action recognition. Existing works mainly focus on designing novel deep architectures to achieve video representations learning for action recognition. Most methods treat sampled frames equally and average all the frame-level predictions at the testing stage. However within a video discriminative actions may occur sparsely in a few frames and most other frames are irrelevant to the ground truth and may even lead to a wrong prediction. As a result we think that the strategy of selecting relevant frames would be a further important key to enhance the existing deep learning based action recognition. In this paper we propose an attentionaware sampling method for action recognition which aims to discard the irrelevant and misleading frames and preserve the most discriminative frames. We formulate the process of mining key frames from videos as a Markov decision process and train the attention agent through deep reinforcement learning without extra labels. The agent takes features and predictions from the baseline model as input and generates importance scores for all frames. Moreover our approach is extensible which can be applied to different existing deep learning based action recognition models. We achieve very competitive action recognition performance on two widely used action recognition datasets.
基于深度学习的方法在动作识别方面取得了显着进步。现有作品主要集中在设计新颖的深度架构上，以实现学习动作识别的视频表示。大多数方法均等地对待采样的帧，并在测试阶段平均所有帧级别的预测。但是，在视频中，判别动作可能很少出现在几帧中，而其他大多数帧与地面真实情况无关，甚至可能导致错误的预测。因此，我们认为选择相关框架的策略将是增强现有基于深度学习的动作识别的另一个重要关键。在本文中，我们提出了一种用于动作识别的注意感知采样方法，该方法旨在丢弃无关紧要的框架，并保留最具区分性的框架。我们将从视频中挖掘关键帧的过程公式化为马尔可夫决策过程，并通过无需额外标签的深度强化学习来训练注意主体。该代理将来自基线模型的特征和预测作为输入，并为所有框架生成重要性得分。此外，我们的方法是可扩展的，可以应用于现有的基于深度学习的不同动作识别模型。我们在两个广泛使用的动作识别数据集上获得了非常有竞争力的动作识别性能。

Generation of Policy-Level Explanations for Reinforcement Learning
强化学习的策略级解释的生成
Nicholay Topin@Carnegie Mellon UniversityManuela Veloso@Carnegie Mellon University
Nicholay Topin @卡内基梅隆大学Manuela Veloso @卡内基梅隆大学
AAAI Technical Track: Human-AI Collaboration
AAAI技术专栏：人与人工智能的协作
Though reinforcement learning has greatly benefited from the incorporation of neural networks the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features making it unsuitable for explaining a sequence of decisions. To address this need we introduce Abstracted Policy Graphs which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions O(|F|2|tr samples|). By applying our method to a family of domains we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.
尽管强化学习已从神经网络的合并中受益匪浅，但无法验证此类系统的正确性限制了它们的使用。可解释深度学习的当前工作着眼于仅根据输入特征来解释单个决策，从而使其不适合解释一系列决策。为了满足这一需求，我们引入了抽象策略图，它们是抽象状态的马尔可夫链。该表述简明扼要地概括了一项政策，以便可以在预期的未来过渡中解释各个决策。此外，我们提出了一种方法，可在给定学习价值函数和训练期间使用的一组观察到的可能潜在脱离政策的过渡的情况下，为确定性策略生成这些抽象的政策图。由于没有限制如何生成值函数，因此我们的方法与许多现有的强化学习方法兼容。我们证明了我们方法的最坏情况下的时间复杂度在特征数量上为二次方，在提供的跃迁数量O（| F | 2 | tr sample | |中）是线性的。通过将我们的方法应用于一系列领域，我们证明了我们的方法在实践中可以很好地扩展，并可以生成可靠地捕获这些领域内关系的抽象策略图。

Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning
对话的产生：从模仿学习到反强化学习
Ziming Li@University of AmsterdamJulia Kiseleva@University of AmsterdamMaarten de Rijke@University of Amsterdam
李自明@阿姆斯特丹大学朱莉娅·基瑟列娃@阿姆斯特丹大学马滕·德·里克@阿姆斯特丹大学
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
The performance of adversarial dialogue generation models relies on the quality of the reward signal produced by the discriminator. The reward signal from a poor discriminator can be very sparse and unstable which may lead the generator to fall into a local optimum or to produce nonsense replies. To alleviate the first problem we first extend a recently proposed adversarial dialogue generation method to an adversarial imitation learning solution. Then in the framework of adversarial inverse reinforcement learning we propose a new reward model for dialogue generation that can provide a more accurate and precise reward signal for generator training. We evaluate the performance of the resulting model with automatic metrics and human evaluations in two annotation settings. Our experimental results demonstrate that our model can generate more high-quality responses and achieve higher overall performance than the state-of-the-art.
对抗对话生成模型的性能取决于鉴别器产生的奖励信号的质量。来自鉴别能力差的奖励信号可能非常稀疏和不稳定，这可能导致生成器陷入局部最优状态或产生无意义的回复。为了缓解第一个问题，我们首先将最近提出的对抗对话生成方法扩展到对抗模仿学习解决方案。然后，在对抗性逆强化学习的框架中，我们提出了一种新的对话生成奖励模型，该模型可以为生成器训练提供更准确准确的奖励信号。我们通过两个注释设置中的自动指标和人工评估来评估结果模型的性能。我们的实验结果表明，与最新技术相比，我们的模型可以产生更多的高质量响应并获得更高的总体性能。

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
通过引导协变量转变进行非策略深度强化学习
Carles Gelada@Google BrainMarc G. Bellemare@Google Brain
Carles Gelada @ Google BrainMarc G. Bellemare @ Google Brain
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.’s solution is appealing it cannot easily be transferred to nonlinear function approximation. First it requires a projection step onto the probability simplex; second even though the operator describing the expected behavior of the off-policy learning algorithm is convergent it is not known to be a contraction mapping and hence may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain where we find performance gains for our approach.
在本文中，我们将回顾由Hallak等人率先提出的强化学习的非政策修正方法（COP-TD）。（2017）。在这种方法下，对价值功能的在线更新将进行加权，以避免非政策性学习中常见的分歧问题。尽管Hallak等人的解决方案颇具吸引力，但仍无法轻松地将其转换为非线性函数逼近。首先，它要求在概率单纯形上进行投影；第二，即使描述偏离策略学习算法的预期行为的运算符是收敛的，也不知道它是收缩映射，因此在实践中可能更加不稳定。我们通过在COP-TD中引入折扣因素来解决这两个问题。我们分析了贴现COP-TD的行为，并从理论角度发现了更好的行为。我们还提出了一种替代的软规范化代价，该代价可以在线上最小化，并且不需要明确的投影步骤。我们通过对两种技术的经验评估来补充我们的分析，该评估是在Atari域的Pong游戏的非政策环境中进行的，我们发现折现的COP-TD在实践中比软归一化惩罚更好。最后，我们在Atari域的5个游戏中对折价的COP-TD进行了更广泛的评估，从中我们发现我们的方法可以提高性能。

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient
通过Minimax深度确定性策略梯度进行稳健的多Agent强化学习
Shihui Li@Carnegie MellonYi Wu@University of California BerkeleyXinyue Cui@Tsinghua UniversityHonghua Dong@Tsinghua UniversityFei Fang@Carnegie MellonStuart Russell@University of California Berkeley
李世辉@卡内基梅隆大学吴义@加州大学伯克利分校崔新月@清华大学洪宏@清华大学方芳@卡内基梅隆大学斯图尔特·拉塞尔@加州大学伯克利分校
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Despite the recent advances of deep reinforcement learning (DRL) agents trained by DRL tend to be brittle and sensitive to the training environment especially in the multi-agent scenarios. In the multi-agent setting a DRL agent’s policy can easily get stuck in a poor local optima w.r.t. its training partners – the learned policy may be only locally optimal to other agents’ current policies. In this paper we focus on the problem of training robust DRL agents with continuous actions in the multi-agent learning setting so that the trained agents can still generalize when its opponents’ policies alter. To tackle this problem we proposed a new algorithm MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) with the following contributions: (1) we introduce a minimax extension of the popular multi-agent deep deterministic policy gradient algorithm (MADDPG) for robust policy learning; (2) since the continuous action space leads to computational intractability in our minimax learning objective we propose Multi-Agent Adversarial Learning (MAAL) to efficiently solve our proposed formulation. We empirically evaluate our M3DDPG algorithm in four mixed cooperative and competitive multi-agent environments and the agents trained by our method significantly outperforms existing baselines.
尽管最近取得了进步，但由DRL训练的深度强化学习（DRL）代理往往是脆弱的，并且对培训环境敏感，尤其是在多代理场景中。在多主体设置中，DRL代理的政策很容易陷入糟糕的本地最佳状态。与其培训合作伙伴-学到的政策可能只适合于其他代理商当前的政策。在本文中，我们着重于在多主体学习设置中以连续动作训练健壮的DRL主体的问题，以便受过训练的主体在其对手的政策发生变化时仍能一概而论。为了解决此问题，我们提出了一种新算法MiniMax多主体深度确定性策略梯度（M3DDPG），其具有以下贡献：（1）我们引入了流行的多主体深度确定性策略梯度算法（MADDPG）的minimax扩展，以实现健壮策略学习; （2）由于连续动作空间在我们的minimax学习目标中导致计算难点，因此我们提出了多智能体对抗学习（MAAL）以有效地解决我们提出的问题。我们在四个混合的协作和竞争性多代理环境中对M3DDPG算法进行了经验评估，通过我们的方法训练的代理大大优于现有的基准。

Attention Guided Imitation Learning and Reinforcement Learning
注意导向的模仿学习与强化学习
Ruohan Zhang@University of Texas at Austin
张若涵@德克萨斯大学奥斯汀分校
Doctoral Consortium Track Abstracts
博士联合会文摘
We propose a framework that uses learned human visual attention model to guide the learning process of an imitation learning or reinforcement learning agent. We have collected high-quality human action and eye-tracking data while playing Atari games in a carefully controlled experimental setting. We have shown that incorporating a learned human gaze model into deep imitation learning yields promising results.
我们提出了一个框架，该框架使用学习到的人类视觉注意力模型来指导模仿学习或强化学习代理的学习过程。在精心控制的实验环境中玩Atari游戏时，我们已经收集了高质量的人类动作和眼动追踪数据。我们已经表明，将学习到的人眼模型融入深度模仿学习中会产生可喜的结果。

A Theory of State Abstraction for Reinforcement Learning
强化学习的状态抽象理论
David Abel@Brown University
大卫·阿贝尔@布朗大学
Doctoral Consortium Track Abstracts
博士联合会文摘
Reinforcement learning presents a challenging problem: agents must generalize experiences efficiently explore the world and learn from feedback that is delayed and often sparse all while making use of a limited computational budget. Abstraction is essential to all of these endeavors. Through abstraction agents can form concise models of both their surroundings and behavior supporting effective decision making in diverse and complex environments. To this end the goal of my doctoral research is to characterize the role abstraction plays in reinforcement learning with a focus on state abstraction. I offer three desiderata articulating what it means for a state abstraction to be useful and introduce classes of state abstractions that provide a partial path toward satisfying these desiderata. Collectively I develop theory for state abstractions that can 1) preserve near-optimal behavior 2) be learned and computed efficiently and 3) can lower the time or data needed to make effective decisions. I close by discussing extensions of these results to an information theoretic paradigm of abstraction and an extension to hierarchical abstraction that enjoys the same desirable properties.
强化学习提出了一个具有挑战性的问题：代理商必须有效地概括经验，探索世界，并在有限的计算预算下，从被延迟且经常稀疏的反馈中学习。对于所有这些努力，抽象都是必不可少的。通过抽象，代理可以形成其周围环境和行为的简洁模型，从而支持在复杂多样的环境中进行有效的决策。为此，我的博士研究的目标是表征抽象在强化学习中的作用，并着重于状态抽象。我提供了三个desiderata，阐明了状态抽象有用的含义，并介绍了状态抽象类，这些类提供了满足这些desiderata的部分路径。我集体开发了状态抽象的理论，该理论可以：1）保持接近最优的行为2）被有效学习和计算，以及3）可以减少做出有效决策所需的时间或数据。最后，我将讨论这些结果的扩展到抽象的信息理论范式，以及对具有相同期望属性的分层抽象的扩展。

QUOTA: The Quantile Option Architecture for Reinforcement Learning
QUOTA：用于强化学习的分位数选项体系结构
Shangtong Zhang@University of AlbertaHengshuai Yao@Huawei Technologies
张尚通@阿尔伯塔大学姚恒帅@华为技术
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
In this paper we propose the Quantile Option Architecture (QUOTA) for exploration based on recent advances in distributional reinforcement learning (RL). In QUOTA decision making is based on quantiles of a value distribution not only the mean. QUOTA provides a new dimension for exploration via making use of both optimism and pessimism of a value distribution. We demonstrate the performance advantage of QUOTA in both challenging video games and physical robot simulators.
在本文中，我们基于分布强化学习（RL）的最新进展，提出了分位数期权体系结构（QUOTA）进行探索。在QUOTA中，决策不仅基于均值，而且还基于值分布的分位数。 QUOTA通过利用价值分配的乐观和悲观主义为探索提供了新的维度。我们在具有挑战性的视频游戏和物理机器人模拟器中展示了QUOTA的性能优势。

Surveys without Questions: A Reinforcement Learning Approach
无问题的调查：强化学习方法
Atanu R Sinha@Adobe ResearchDeepali Jain@GoogleNikhil Sheoran@Adobe ResearchSopan Khosla@Adobe ResearchReshmi Sasidharan@Adobe Research
Atanu R Sinha @ Adobe ResearchDeepali Jain @ GoogleNikhil Sheoran @ Adobe ResearchSopan Khosla @ Adobe ResearchReshmi Sasidharan @ Adobe Research
AAAI Technical Track: AI and the Web
AAAI技术专题：AI和Web
The ‘old world’ instrument survey remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers’ online interactions; failing to capture a customer’s interactions over time since the rating is a one-time snapshot; and inability to tie back customers’ ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data typically collected for every customer’s online interactions by developing an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL as proxy ratings. Our approach does not need any survey data for training. Yet on validation against actual survey data proxy ratings yield reasonable performance results. Additionally we offer a new way to draw insights from values of the value function which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one customer-level and the other aggregate-level for click actions across customers. Both are defined around proportion of all pairwise successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum proxy ratings computed unobtrusively from clickstream for every action for each customer and for every session can offer interpretable and more insightful alternative to surveys.
“旧世界”工具调查仍然是企业获取客户与企业在线互动时获得的满意度和体验等级的一种选择工具。尽管浏览的途径已从电子邮件和链接发展到弹出窗口，但浏览缺陷仍然存在。其中包括-依靠很少的受访者的评分来推断所有客户的在线互动；由于该评分是一次性快照，因此未能及时捕获客户的互动情况；并且由于所提供的评分与所有互动相关，因此无法将客户的评分与特定互动相关联。为了克服这些缺陷，我们通过开发一种基于强化学习（RL）的方法，从通常为每个客户的在线互动收集的点击流数据中提取代理评级。我们介绍了一种将RL的价值函数生成的价值解释为代理评级的新方法。我们的方法不需要任何调查数据即可进行培训。然而，在根据实际调查数据进行验证时，代理评级会得出合理的绩效结果。此外，我们提供了一种从价值函数的价值中汲取见解的新方法，该方法允许将特定的互动与其代理评级相关联。我们引入了两个新的指标来表示评分-一个针对客户的点击操作的客户级别，另一个针对客户的点击级别的汇总级别。两者都是围绕所有成对的连续动作的比例定义的，这些动作显示出代理评级的提高。这种直观的客户级别指标可以衡量评分随时间变化的动态，并且比调查中的客户评分更好地预测购买。汇总级别的指标可精确定位有助于或损害体验的操作。总而言之，从点击流中毫不费力地为每个客户和每个会话的每个操作计算的代理评级可以为调查提供可解释且更深入的选择。

MaMiC: Macro and Micro Curriculum for Robotic Reinforcement Learning
MaMiC：机器人强化学习的宏观和微观课程
Manan Tomar@Indian Institute of Technology MadrasAkhil Sathuluri@Indian Institute of Technology MadrasBalaraman Ravindran@Indian Institute of Technology Madras
Manan Tomar @印度理工学院MadrasAkhil Sathuluri @印度理工学院Madras Balaraman Ravindran @印度理工学院Madras
Student Abstracts
学生文摘
Generating a curriculum for guided learning involves subjecting the agent to easier goals first and then gradually increasing their difficulty. This work takes a similar direction and proposes a dual curriculum scheme for solving robotic manipulation tasks with sparse rewards called MaMiC. It includes a macro curriculum scheme which divides the task into multiple subtasks followed by a micro curriculum scheme which enables the agent to learn between such discovered subtasks. We show how combining macro and micro curriculum strategies help in overcoming major exploratory constraints considered in robot manipulation tasks without having to engineer any complex rewards and also illustrate the meaning and usage of the individual curricula. The performance of such a scheme is analysed on the Fetch environments.
生成用于指导学习的课程包括先使代理人达到较容易的目标，然后逐渐增加其难度。这项工作的方向相似，并提出了一种双重课程计划，称为MaMiC，用于解决具有稀疏奖励的机器人操纵任务。它包括一个宏课程计划，该计划将任务分为多个子任务，然后是一个微型课程计划，使代理能够在这些发现的子任务之间学习。我们将展示宏观和微观课程策略的组合如何帮助克服机器人操作任务中考虑的主要探索性约束，而无需设计任何复杂的奖励，并说明单个课程的含义和用法。在Fetch环境中分析了这种方案的性能。

Trainable Undersampling for Class-Imbalance Learning
可训练的欠采样，用于班级不平衡学习
Minlong Peng@Fudan UniversityQi Zhang@Fudan UniversityXiaoyu Xing@Fudan UniversityTao Gui@Fudan UniversityXuanjing Huang@Fudan UniversityYu-Gang Jiang@Fudan UniversityKeyu Ding@iFLYTEK Co. LtdZhigang Chen@iFLYTEK Co. Ltd
彭敏龙@复旦大学张琦@复旦大学邢小玉@复旦大学陶贵@复旦大学黄轩jing @复旦大学姜玉刚@复旦大学丁u @ iFLYTEK Co.陈志刚@ iFLYTEK Co.有限公司
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Undersampling has been widely used in the class-imbalance learning area. The main deficiency of most existing undersampling methods is that their data sampling strategies are heuristic-based and independent of the used classifier and evaluation metric. Thus they may discard informative instances for the classifier during the data sampling. In this work we propose a meta-learning method built on the undersampling to address this issue. The key idea of this method is to parametrize the data sampler and train it to optimize the classification performance over the evaluation metric. We solve the non-differentiable optimization problem for training the data sampler via reinforcement learning. By incorporating evaluation metric optimization into the data sampling process the proposed method can learn which instance should be discarded for the given classifier and evaluation metric. In addition as a data level operation this method can be easily applied to arbitrary evaluation metric and classifier including non-parametric ones (e.g. C4.5 and KNN). Experimental results on both synthetic and realistic datasets demonstrate the effectiveness of the proposed method.
欠采样已广泛应用于班级不平衡学习领域。大多数现有欠采样方法的主要缺陷是它们的数据采样策略是基于启发式的，并且与所使用的分类器和评估指标无关。因此，它们可以在数据采样期间为分类器丢弃信息量丰富的实例。在这项工作中，我们提出了一种基于欠采样的元学习方法来解决此问题。该方法的关键思想是参数化数据采样器并对其进行训练，以优化评估指标上的分类性能。我们通过强化学习解决了用于训练数据采样器的不可微优化问题。通过将评估指标优化合并到数据采样过程中，所提出的方法可以了解对于给定的分类器和评估指标应丢弃哪个实例。另外，作为数据级别的操作，此方法可以轻松地应用于任意评估指标和分类器（包括非参数指标）（例如C4.5和KNN）。综合和实际数据集上的实验结果证明了该方法的有效性。

Natural Option Critic
自然选择评论家
Saket Tiwari@University of Massachusetts AmherstPhilip S. Thomas@University of Massachusetts Amherst
Saket Tiwari @麻省大学阿默斯特分校Philip S. Thomas @麻省大学阿默斯特分校
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
The recently proposed option-critic architecture (Bacon Harb and Precup 2017) provides a stochastic policy gradient approach to hierarchical reinforcement learning. Specifically it provides a way to estimate the gradient of the expected discounted return with respect to parameters that define a finite number of temporally extended actions called options. In this paper we show how the option-critic architecture can be extended to estimate the natural gradient (Amari 1998) of the expected discounted return. To this end the central questions that we consider in this paper are: 1) what is the definition of the natural gradient in this context 2) what is the Fisher information matrix associated with an option’s parameterized policy 3) what is the Fisher information matrix associated with an option’s parameterized termination function and 4) how can a compatible function approximation approach be leveraged to obtain natural gradient estimates for both the parameterized policy and parameterized termination functions of an option with per-time-step time and space complexity linear in the total number of parameters. Based on answers to these questions we introduce the natural option critic algorithm. Experimental results showcase improvement over the vanilla gradient approach.
最近提出的选择批评体系（Bacon Harb and Precup 2017）为分层强化学习提供了一种随机策略梯度方法。具体来说，它提供了一种方法，用于估计相对于定义有限数量的称为期权的时间扩展动作的参数的预期折现收益率的梯度。在本文中，我们展示了如何扩展期权评论体系以估计预期折现收益的自然梯度（Amari 1998）。为此，我们在本文中考虑的主要问题是：1）在这种情况下自然梯度的定义是什么？2）与期权的参数化策略关联的Fisher信息矩阵是什么3）与Fisher参数矩阵关联的是什么带有选项的参数化终止函数，以及4）如何利用兼容函数逼近方法来获得选项的参数化策略和参数化终止函数的自然梯度估计，且每步时间和空间复杂度在总数上呈线性关系参数。根据这些问题的答案，我们介绍自然选择评论家算法。实验结果表明，与香草梯度法相比有所改进。

Reinforcement Learning under Threats
威胁下的强化学习
Victor Gallego@Instituto de Ciencias MatemáticasRoi Naveiro@Instituto de Ciencias MatemáticasDavid Rios Insua@Instituto de Ciencias Matemáticas
维克多·加勒戈（Victor Gallego）@西恩西亚斯·马特马提卡斯大学（Roi Naveiro）@西恩西亚斯·马特马提卡斯大学（David Rios Insua）@西恩西亚斯·马特马提卡斯大学
Student Abstracts
学生文摘
In several reinforcement learning (RL) scenarios mainly in security settings there may be adversaries trying to interfere with the reward generating process. However when non-stationary environments as such are considered Q-learning leads to suboptimal results (Busoniu Babuska and De Schutter 2010). Previous game-theoretical approaches to this problem have focused on modeling the whole multi-agent system as a game. Instead we shall face the problem of prescribing decisions to a single agent (the supported decision maker DM) against a potential threat model (the adversary). We augment the MDP to account for this threat introducing Threatened Markov Decision Processes (TMDPs). Furthermore we propose a level-k thinking scheme resulting in a new learning framework to deal with TMDPs. We empirically test our framework showing the benefits of opponent modeling.
在主要在安全性设置中的几种强化学习（RL）场景中，可能有对手试图干扰奖励生成过程。然而，当这样的非平稳环境被考虑时，Q学习导致次优的结果（Busoniu Babuska和De Schutter 2010）。以前针对该问题的博弈论方法集中于将整个多代理系统建模为一个游戏。相反，我们将面临针对潜在威胁模型（对手）向单个代理（受支持的决策者DM）规定决策的问题。我们扩大了MDP，以解决引入威胁马尔可夫决策过程（TMDP）的这种威胁。此外，我们提出了一个k级思维方案，从而产生了一个新的学习框架来应对TMDP。我们通过经验测试我们的框架，以显示对手建模的好处。

Goal-Oriented Dialogue Policy Learning from Failures
从失败中学习目标导向的对话政策
Keting Lu@University of Science and Technology of ChinaShiqi Zhang@State University of New York BinghamtonXiaoping Chen@University of Science and Technology of China
卢克婷@中国科学技术大学张世琦@纽约州立大学宾汉顿分校陈小平@中国科学技术大学
AAAI Technical Track: Humans and AI
AAAI技术专题：人类与人工智能
Reinforcement learning methods have been used for learning dialogue policies. However learning an effective dialogue policy frequently requires prohibitively many conversations. This is partly because of the sparse rewards in dialogues and the very few successful dialogues in early learning phase. Hindsight experience replay (HER) enables learning from failures but the vanilla HER is inapplicable to dialogue learning due to the implicit goals. In this work we develop two complex HER methods providing different tradeoffs between complexity and performance and for the first time enabled HER-based dialogue policy learning. Experiments using a realistic user simulator show that our HER methods perform better than existing experience replay methods (as applied to deep Q-networks) in learning rate.
强化学习方法已用于学习对话策略。然而，学习有效的对话政策经常需要进行过多的对话。部分原因是对话中的奖励稀少，而早期学习阶段很少有成功的对话。后视经验重播（HER）可以从失败中学习，但是由于隐含的目标，香草HER不适用于对话学习。在这项工作中，我们开发了两种复杂的HER方法，在复杂性和性能之间提供了不同的权衡，并且首次启用了基于HER的对话策略学习。使用现实的用户模拟器进行的实验表明，我们的HER方法在学习率方面比现有的体验重播方法（应用于深度Q网络）表现更好。

Neural Machine Translation with Adequacy-Oriented Learning
具有适当性学习的神经机器翻译
Xiang Kong@Carnegie Mellon UniversityZhaopeng Tu@Tencent AI LabShuming Shi@Tencent AI LabEduard Hovy@Carnegie Mellon UniversityTong Zhang@Tencent AI Lab
香江@卡内基梅隆大学涂鹏@腾讯AI LabShuming Shi @腾讯AI LabEduard Hovy @卡内基梅隆大学Tong Tong @腾讯AI Lab
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
Although Neural Machine Translation (NMT) models have advanced state-of-the-art performance in machine translation they face problems like the inadequate translation. We attribute this to that the standard Maximum Likelihood Estimation (MLE) cannot judge the real translation quality due to its several limitations. In this work we propose an adequacyoriented learning mechanism for NMT by casting translation as a stochastic policy in Reinforcement Learning (RL) where the reward is estimated by explicitly measuring translation adequacy. Benefiting from the sequence-level training of RL strategy and a more accurate reward designed specifically for translation our model outperforms multiple strong baselines including (1) standard and coverage-augmented attention models with MLE-based training and (2) advanced reinforcement and adversarial training strategies with rewards based on both word-level BLEU and character-level CHRF3. Quantitative and qualitative analyses on different language pairs and NMT architectures demonstrate the effectiveness and universality of the proposed approach.
尽管神经机器翻译（NMT）模型在机器翻译中具有先进的性能，但它们仍面临翻译不足等问题。我们将其归因于标准最大似然估计（MLE）由于其一些限制而无法判断实际翻译质量。在这项工作中，我们通过将翻译作为强化学习（RL）中的一种随机策略，提出了面向NMT的面向充分性的学习机制，在该学习中，奖励是通过明确衡量翻译的充分性来估算的。得益于RL策略的序列级培训和专为翻译而设计的更准确的奖励，我们的模型优于多个强大的基线，包括（1）基于MLE的培训和（2）高级强化和对抗性培训的标准和覆盖增强的注意力模型基于单词级BLEU和字符级CHRF3的奖励策略。对不同语言对和NMT体系结构的定量和定性分析证明了该方法的有效性和普遍性。

Combo-Action: Training Agent For FPS Game with Auxiliary Tasks
组合动作：带有辅助任务的FPS游戏培训代理
Shiyu Huang@Tsinghua UniversityHang Su@Tsinghua UniviersityJun Zhu@Tsinghua UniversityTing Chen@Tsinghua University
黄诗瑜@清华大学杭苏@清华大学朱军@清华大学陈婷@清华大学
AAAI Technical Track: Applications
AAAI技术专题：应用
Deep reinforcement learning (DRL) has achieved surpassing human performance on Atari games using raw pixels and rewards to learn everything. However first-person-shooter (FPS) games in 3D environments contain higher levels of human concepts (enemy weapon spatial structure etc.) and a large action space. In this paper we explore a novel method which can plan on temporally-extended action sequences which we refer as Combo-Action to compress the action space. We further train a deep recurrent Q-learning network model as a high-level controller called supervisory network to manage the Combo-Actions. Our method can be boosted with auxiliary tasks (enemy detection and depth prediction) which enable the agent to extract high-level concepts in the FPS games. Extensive experiments show that our method is efficient in training process and outperforms previous stateof-the-art approaches by a large margin. Ablation study experiments also indicate that our method can boost the performance of the FPS agent in a reasonable way.
深度强化学习（DRL）使用原始像素和奖励来学习一切，从而在Atari游戏中取得了超越人类的表现。但是，在3D环境中的第一人称射击（FPS）游戏包含较高级别的人类概念（敌人的武器空间结构等）和较大的动作空间。在本文中，我们探索了一种可以对时间扩展动作序列进行规划的新颖方法，我们将其称为组合动作以压缩动作空间。我们进一步训练了深度循环Q学习网络模型，将其作为称为监督网络的高级控制器来管理组合动作。我们的方法可以通过辅助任务（敌人检测和深度预测）来增强，这些任务使代理能够提取FPS游戏中的高级概念。大量的实验表明，我们的方法在训练过程中是有效的，并且在很大程度上优于以前的最新方法。消融研究实验还表明，我们的方法可以合理地提高FPS剂的性能。

SDRL: Interpretable and Data-Efficient Deep Reinforcement Learning Leveraging Symbolic Planning
SDRL：利用符号规划的可解释且数据有效的深度强化学习
Daoming Lyu@Auburn UniversityFangkai Yang@Maana Inc.Bo Liu@Auburn UniversitySteven Gustafson@Maana Inc.
刘道明@奥本大学@杨芳凯@ Maana Inc.刘波@奥本大学史蒂文·古斯塔夫森@ Maana Inc.
AAAI Technical Track: Knowledge Representation and Reasoning
AAAI技术专场：知识表示与推理
Deep reinforcement learning (DRL) has gained great success by learning directly from high-dimensional sensory inputs yet is notorious for the lack of interpretability. Interpretability of the subtasks is critical in hierarchical decision-making as it increases the transparency of black-box-style DRL approach and helps the RL practitioners to understand the high-level behavior of the system better. In this paper we introduce symbolic planning into DRL and propose a framework of Symbolic Deep Reinforcement Learning (SDRL) that can handle both high-dimensional sensory inputs and symbolic planning. The task-level interpretability is enabled by relating symbolic actions to options.This framework features a planner – controller – meta-controller architecture which takes charge of subtask scheduling data-driven subtask learning and subtask evaluation respectively. The three components cross-fertilize each other and eventually converge to an optimal symbolic plan along with the learned subtasks bringing together the advantages of long-term planning capability with symbolic knowledge and end-to-end reinforcement learning directly from a high-dimensional sensory input. Experimental results validate the interpretability of subtasks along with improved data efficiency compared with state-of-the-art approaches.
深度强化学习（DRL）通过直接从高维感官输入中学习而获得了巨大的成功，但由于缺乏可解释性而臭名昭著。子任务的可解释性在分层决策中至关重要，因为它增加了黑匣子式DRL方法的透明度，并有助于RL练习者更好地理解系统的高级行为。在本文中，我们将符号规划引入DRL，并提出了一个符号深度强化学习（SDRL）框架，该框架可以处理高维感官输入和符号规划。通过将符号动作与选项相关联，可以实现任务级的可解释性。此框架具有计划程序-控制器-元控制器体系结构，该体系结构分别负责子任务调度，数据驱动的子任务学习和子任务评估。这三个组成部分互为因果，最终与所学习的子任务融合为最佳的符号计划，将长期规划能力与符号知识的优势融合在一起，并直接从高维感官输入端到端进行强化学习。与最新方法相比，实验结果验证了子任务的可解释性以及改进的数据效率。

Deliberate Attention Networks for Image Captioning
用于图像字幕的故意注意网络
Lianli Gao@University of Electronic Science and Technology of ChinaKaixuan Fan@University of Electronic Science and Technology of ChinaJingkuan Song@University of Electronic Science and Technology of ChinaXianglong Liu@Beihang UniversityXing Xu@University of Electronic Science and Technology of ChinaHeng Tao Shen@University of Electronic Science and Technology of China
高连立@中国电子科技大学范范璇@中国电子科技大学宋景宽@中国电子科技大学刘向龙@北航大学徐旭@中国电子科技大学沉恒涛中国电子科技
AAAI Technical Track: Vision
AAAI技术轨道：愿景
In daily life deliberation is a common behavior for human to improve or refine their work (e.g. writing reading and drawing). To date encoder-decoder framework with attention mechanisms has achieved great progress for image captioning. However such framework is in essential an one-pass forward process while encoding to hidden states and attending to visual features but lacks of the deliberation action. The learned hidden states and visual attention are directly used to predict the final captions without further polishing. In this paper we present a novel Deliberate Residual Attention Network namely DA for image captioning. The first-pass residual-based attention layer prepares the hidden states and visual attention for generating a preliminary version of the captions while the second-pass deliberate residual-based attention layer refines them. Since the second-pass is based on the rough global features captured by the hidden layer and visual attention in the first-pass our DA has the potential to generate better sentences. We further equip our DA with discriminative loss and reinforcement learning to disambiguate image/caption pairs and reduce exposure bias. Our model improves the state-of-the-arts on the MSCOCO dataset and reaches 37.5% BELU-4 28.5% METEOR and 125.6% CIDEr. It also outperforms the-state-ofthe-arts from 25.1% BLEU-4 20.4% METEOR and 53.1% CIDEr to 29.4% BLEU-4 23.0% METEOR and 66.6% on the Flickr30K dataset.
在日常生活中，思考是人类改善或完善工作的一种常见行为（例如，写作阅读和绘画）。迄今为止，具有注意力机制的编码器-解码器框架在图像字幕方面取得了长足的进步。然而，这样的框架本质上是一个单向前进的过程，同时编码为隐藏状态并具有视觉特征，但是缺乏审议作用。所学习的隐藏状态和视觉注意力可直接用于预测最终字幕，而无需进一步完善。在本文中，我们提出了一个新颖的故意残留注意力网络，即用于图像字幕的DA。第一遍基于残差的注意层准备隐藏状态和视觉注意，以生成字幕的初步版本，而第二遍基于残差的注意层对其进行细化。由于第二遍基于隐藏层捕获的粗略全局特征和第一遍中的视觉注意力，因此我们的DA有可能生成更好的语句。我们进一步为DA配备判别性损失和强化学习，以消除图像/字幕对的歧义并减少曝光偏差。我们的模型改进了MSCOCO数据集的最新技术，并达到了37.5％的BELU-4、28.5％的流星和125.6％的CIDEr。在Flickr30K数据集上，它的性能也从25.1％BLEU-4 20.4％METEOR和53.1％CIDEr到29.4％BLEU-4 23.0％METEOR和66.6％，优于最新技术。

Querying NoSQL with Deep Learning to Answer Natural Language Questions
通过深度学习查询NoSQL以回答自然语言问题
Sebastian Blank@inovex GmbHFlorian Wilhelm@inovex GmbHHans-Peter Zorn@inovex GmbHAchim Rettinger@Karlsruhe Institute of Technology
塞巴斯蒂安·布兰克（Sebastian Blank）@ inovex GmbH弗洛里安·威廉（Florian Wilhelm）@ inovex GmbH汉斯·彼得·佐恩（Hans-Peter Zorn）@ inovex GmbHAchim Rettinger @卡尔斯鲁厄理工学院
IAAI Technical Papers: Emerging Papers
IAAI技术论文：新兴论文
Almost all of today’s knowledge is stored in databases and thus can only be accessed with the help of domain specific query languages strongly limiting the number of people which can access the data. In this work we demonstrate an end-to-end trainable question answering (QA) system that allows a user to query an external NoSQL database by using natural language. A major challenge of such a system is the non-differentiability of database operations which we overcome by applying policy-based reinforcement learning. We evaluate our approach on Facebook’s bAbI Movie Dialog dataset and achieve a competitive score of 84.2% compared to several benchmark models. We conclude that our approach excels with regard to real-world scenarios where knowledge resides in external databases and intermediate labels are too costly to gather for non-end-to-end trainable QA systems.
如今几乎所有知识都存储在数据库中，因此只能在特定于域的查询语言的帮助下进行访问，从而严重限制了可以访问数据的人数。在这项工作中，我们演示了一个端到端的可训练问题解答（QA）系统，该系统允许用户使用自然语言来查询外部NoSQL数据库。这种系统的主要挑战是数据库操作的不可区分性，我们通过应用基于策略的强化学习克服了这一难题。我们在Facebook的bAbI电影对话数据集上评估了我们的方法，与几种基准模型相比，我们获得了84.2％的竞争得分。我们得出的结论是，对于知识驻留在外部数据库中且中间标签的成本太高而无法收集非端到端可培训QA系统的现实情况，我们的方法是出色的。

Composable Modular Reinforcement Learning
可组合模块化强化学习
Christopher Simpkins@Georgia Institute of TechnologyCharles Isbell@Georgia Institute of Technology
克里斯托弗·辛普金斯（Christopher Simpkins）@乔治亚理工学院查尔斯·伊斯贝尔@乔治亚理工学院
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Modular reinforcement learning (MRL) decomposes a monolithic multiple-goal problem into modules that solve a portion of the original problem. The modules’ action preferences are arbitrated to determine the action taken by the agent. Truly modular reinforcement learning would support not only decomposition into modules but composability of separately written modules in new modular reinforcement learning agents. However the performance of MRL agents that arbitrate module preferences using additive reward schemes degrades when the modules have incomparable reward scales. This performance degradation means that separately written modules cannot be composed in new modular reinforcement learning agents as-is – they may need to be modified to align their reward scales. We solve this problem with a Q-learningbased command arbitration algorithm and demonstrate that it does not exhibit the same performance degradation as existing approaches to MRL thereby supporting composability.
模块化强化学习（MRL）将一个整体的多目标问题分解为一些模块，这些模块可以解决一部分原始问题。仲裁模块的操作首选项以确定代理采取的操作。真正的模块化强化学习将不仅支持分解成模块，而且还支持新的模块化强化学习代理中单独编写的模块的可组合性。但是，当模块具有无与伦比的奖励等级时，使用附加奖励计划仲裁模块偏好的MRL代理的性能会降低。这种性能下降意味着单独编写的模块不能按原样包含在新的模块化强化学习代理中-可能需要对其进行修改以调整其奖励等级。我们使用基于Q学习的命令仲裁算法解决了这一问题，并证明了它不会表现出与现有MRL方法相同的性能下降，从而支持可组合性。

Hybrid Reinforcement Learning with Expert State Sequences
具有专家状态序列的混合强化学习
Xiaoxiao Guo@IBM ResearchShiyu Chang@IBM ResearchMo Yu@IBM T. J. WatsonGerald Tesauro@IBM ResearchMurray Campbell@IBM Research
郭晓晓@ IBM Research Shiyu Chang @ IBM Research Mo Yu @ IBM T.J.WatsonGerald Tesauro @ IBM ResearchMurray Campbell @ IBM Research
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Existing imitation learning approaches often require that the complete demonstration data including sequences of actions and states are available. In this paper we consider a more realistic and difficult scenario where a reinforcement learning agent only has access to the state sequences of an expert while the expert actions are unobserved. We propose a novel tensor-based model to infer the unobserved actions of the expert state sequences. The policy of the agent is then optimized via a hybrid objective combining reinforcement learning and imitation learning. We evaluated our hybrid approach on an illustrative domain and Atari games. The empirical results show that (1) the agents are able to leverage state expert sequences to learn faster than pure reinforcement learning baselines (2) our tensor-based action inference model is advantageous compared to standard deep neural networks in inferring expert actions and (3) the hybrid policy optimization objective is robust against noise in expert state sequences.
现有的模仿学习方法通常要求提供包括动作和状态序列的完整演示数据。在本文中，我们考虑了一个更现实和更困难的场景，其中强化学习代理仅可以访问专家的状态序列，而无法观察到专家的行为。我们提出了一种新颖的基于张量的模型来推断专家状态序列的未观察到的动作。然后，通过结合强化学习和模仿学习的混合目标来优化代理的策略。我们在说明性领域和Atari游戏中评估了我们的混合方法。实验结果表明（1）代理能够利用状态专家序列比纯强化学习基线更快地学习（2）我们的基于张量的动作推理模型在推断专家动作方面优于标准深度神经网络，并且（3 ）混合策略优化目标对专家状态序列中的噪声具有鲁棒性。

Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning
信息缺失：一种多代理深度强化学习的有效培训方法
Woojun Kim@Korea Advanced Institute of Science and Technology (KAIST)Myungsik Cho@Korea Advanced Institute of Science and Technology (KAIST)Youngchul Sung@Korea Advanced Institute of Science and Technology (KAIST)
Woojun Kim @韩国高等科学技术学院（KAIST）赵明三@韩国高等科学技术学院（KAIST）Youngchul Sung @韩国高等科学技术学院（KAIST）
AAAI Technical Track: Multiagent Systems
AAAI技术专题：多代理系统
In this paper we propose a new learning technique named message-dropout to improve the performance for multi-agent deep reinforcement learning under two application scenarios: 1) classical multi-agent reinforcement learning with direct message communication among agents and 2) centralized training with decentralized execution. In the first application scenario of multi-agent systems in which direct message communication among agents is allowed the messagedropout technique drops out the received messages from other agents in a block-wise manner with a certain probability in the training phase and compensates for this effect by multiplying the weights of the dropped-out block units with a correction probability. The applied message-dropout technique effectively handles the increased input dimension in multi-agent reinforcement learning with communication and makes learning robust against communication errors in the execution phase. In the second application scenario of centralized training with decentralized execution we particularly consider the application of the proposed messagedropout to Multi-Agent Deep Deterministic Policy Gradient (MADDPG) which uses a centralized critic to train a decentralized actor for each agent. We evaluate the proposed message-dropout technique for several games and numerical results show that the proposed message-dropout technique with proper dropout rate improves the reinforcement learning performance significantly in terms of the training speed and the steady-state performance in the execution phase.
在本文中，我们提出了一种称为消息丢失的新学习技术，以在两种应用场景下提高多智能体深度强化学习的性能：1）经典的多智能体强化学习，并在智能体之间进行直接消息交流； 2）分散式集中训练执行。在多代理系统的第一个应用场景中，允许代理之间进行直接消息通信，而消息丢弃技术则在训练阶段以一定的概率以逐块方式丢弃从其他代理接收的消息，并通过以下方式对此影响进行补偿：将丢失的块单元的权重乘以校正概率。所应用的消息丢失技术有效地处理了带有通信的多主体强化学习中增加的输入维度，并使学习在执行阶段中能够针对通信错误进行鲁棒的处理。在具有分散执行力的集中式培训的第二种应用场景中，我们特别考虑将建议的消息丢弃应用到多代理深度确定性策略梯度（MADDPG），该方案使用集中式批判员为每个代理训练分散的参与者。我们对几种游戏评估了所提出的信息缺失技术，数值结果表明，具有适当的缺失率的信息缺失技术在执行阶段的训练速度和稳态性能方面显着提高了强化学习性能。

Trust Region Evolution Strategies
信任区域演化策略
Guoqing Liu@University of Science and Technology of ChinaLi Zhao@Microsoft ResearchFeidiao Yang@Chinese Academy of SciencesJiang Bian@Microsoft ResearchTao Qin@Microsoft Research AsiaNenghai Yu@University of Science and Technology of ChinaTie-Yan Liu@Microsoft
刘国庆@中国科学技术大学赵莉@微软研究院费迪iao @中国科学院江边@微软研究院陶琴@微软研究院亚洲分会于能海@中国科学技术大学刘铁岩@微软
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Evolution Strategies (ES) a class of black-box optimization algorithms has recently been demonstrated to be a viable alternative to popular MDP-based RL techniques such as Qlearning and Policy Gradients. ES achieves fairly good performance on challenging reinforcement learning problems and is easier to scale in a distributed setting. However standard ES algorithms perform one gradient update per data sample which is not very efficient. In this paper with the purpose of more efficient using of sampled data we propose a novel iterative procedure that optimizes a surrogate objective function enabling to reuse data sample for multiple epochs of updates. We prove monotonic improvement guarantee for such procedure. By making several approximations to the theoretically-justified procedure we further develop a practical algorithm called Trust Region Evolution Strategies (TRES). Our experiments demonstrate the effectiveness of TRES on a range of popular MuJoCo locomotion tasks in the OpenAI Gym achieving better performance than ES algorithm.
进化策略（ES）一类黑箱优化算法最近被证明是一种可行的替代方法，可以替代基于流行的基于MDP的RL技术（例如Qlearning和Policy Gradients）。 ES在挑战性强化学习问题上取得了相当不错的成绩，并且在分布式环境中更易于扩展。然而，标准的ES算法对每个数据样本执行一次梯度更新，这不是很有效。在本文中，为了更有效地使用采样数据，我们提出了一种新颖的迭代过程，该过程优化了替代目标函数，从而能够将数据采样重用于多个更新周期。我们证明了这种程序的单调改进保证。通过对理论上合理的过程进行一些近似，我们进一步开发了一种实用的算法，称为信任区域演进策略（TRES）。我们的实验证明了TRES在OpenAI健身房中一系列流行的MuJoCo运动任务上的有效性，其性能优于ES算法。

Human-in-the-Loop Feature Selection
人在环特征选择
Alvaro H. C. Correia@Universidade de São PauloFreddy Lecue@INRIA
Alvaro H.C. Correia @圣保罗大学弗雷迪·莱库@ INRIA
AAAI Technical Track: Human-AI Collaboration
AAAI技术专栏：人与人工智能的协作
Feature selection is a crucial step in the conception of Machine Learning models which is often performed via datadriven approaches that overlook the possibility of tapping into the human decision-making of the model’s designers and users. We present a human-in-the-loop framework that interacts with domain experts by collecting their feedback regarding the variables (of few samples) they evaluate as the most relevant for the task at hand. Such information can be modeled via Reinforcement Learning to derive a per-example feature selection method that tries to minimize the model’s loss function by focusing on the most pertinent variables from a human perspective. We report results on a proof-of-concept image classification dataset and on a real-world risk classification task in which the model successfully incorporated feedback from experts to improve its accuracy.
特征选择是机器学习模型概念中至关重要的一步，而机器学习模型通常是通过数据驱动的方法来执行的，而这种方法忽略了利用模型设计者和用户的人为决策的可能性。我们提供了一个与人联系的领域中的环环相扣的框架，可收集领域专家的反馈意见，这些反馈是他们认为与手头任务最相关的变量（少数样本）。可以通过强化学习对此类信息进行建模，以得出每个示例的特征选择方法，该方法通过从人的角度关注最相关的变量来尝试最小化模型的损失函数。我们在概念验证图像分类数据集和实际风险分类任务中报告结果，在该任务中，模型成功地结合了专家的反馈以提高其准确性。

Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation
用于局部连贯的视觉故事生成的分层结构强化学习
Qiuyuan Huang@Microsoft Research AIZhe Gan@MicrosoftAsli Celikyilmaz@Microsoft ResearchDapeng Wu@University of FloridaJianfeng Wang@Microsoft ResearchXiaodong He@JD AI Research
黄秋元@微软研究院AIZhe Gan @ MicrosoftAsli Celikyilmaz @微软研究院Dapeng Wu @佛罗里达大学王建峰@微软研究院Xiaodong He @ JD AI Research
AAAI Technical Track: Vision
AAAI技术轨道：愿景
We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i.e. topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results from both automatic and human evaluations demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a strong flat deep reinforcement learning baseline.
我们提出了一种层次结构的强化学习方法，以解决为视觉故事任务生成连贯的多句子故事的计划制定中的挑战。在我们的框架内，在给定图像序列的情况下生成故事的任务被划分为两级分层解码器。高级解码器通过为每个图像依次生成语义概念（即主题）来构造计划。低级解码器使用语义合成网络为每个图像生成句子，该语义合成网络有效地使以主题为条件的句子生成为基础。使用增强学习对两个解码器进行端到端联合培训。我们根据视觉故事（VIST）数据集评估模型。来自自动评估和人工评估的经验结果表明，与强大而平坦的深度强化学习基线相比，所提出的分层结构强化训练可显着提高性能。

Strategic Tasks for Explainable Reinforcement Learning
可解释性强化学习的战略任务
Rey Pocius@Oregon State UniversityLawrence Neal@Oregon State UniversityAlan Fern@Oregon State University
Rey Pocius @俄勒冈州立大学Lawrence Neal @俄勒冈州立大学艾伦·费尔恩@俄勒冈州立大学
Student Abstracts
学生文摘
Commonly used sequential decision making tasks such as the games in the Arcade Learning Environment (ALE) provide rich observation spaces suitable for deep reinforcement learning. However they consist mostly of low-level control tasks which are of limited use for the development of explainable artificial intelligence(XAI) due to the fine temporal resolution of the tasks. Many of these domains also lack built-in high level abstractions and symbols. Existing tasks that provide for both strategic decision-making and rich observation spaces are either difficult to simulate or are intractable. We provide a set of new strategic decision-making tasks specialized for the development and evaluation of explainable AI methods built as constrained mini-games within the StarCraft II Learning Environment.
诸如Arcade Learning Environment（ALE）中的游戏之类的常用顺序决策任务提供了适合深度强化学习的丰富观察空间。但是，它们主要由低级控制任务组成，由于这些任务的时间分辨率较高，因此在开发可解释的人工智能（XAI）时用途有限。这些域中的许多域还缺少内置的高级抽象和符号。提供战略决策和丰富观察空间的现有任务很难模拟或难以处理。我们提供了一系列新的战略决策任务，这些任务专门用于开发和评估可解释的AI方法，这些方法是在《星际争霸II》学习环境中构建的受约束的小型游戏。

Differentiated Distribution Recovery for Neural Text Generation
用于神经文本生成的差异化分布恢复
Jianing Li@Chinese Academy of SciencesYanyan Lan@Chinese Academy of SciencesJiafeng Guo@Chinese Academy of SciencesJun Xu@Chinese Academy of SciencesXueqi Cheng@Chinese Academy of Sciences
李健宁@中国科学院兰艳艳@中国科学院郭家峰@中国科学院徐俊@中国科学院郑学启@中国科学院
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
Neural language models based on recurrent neural networks (RNNLM) have significantly improved the performance for text generation yet the quality of generated text represented by Turing Test pass rate is still far from satisfying. Some researchers propose to use adversarial training or reinforcement learning to promote the quality however such methods usually introduce great challenges in the training and parameter tuning processes. Through our analysis we find the problem of RNNLM comes from the usage of maximum likelihood estimation (MLE) as the objective function which requires the generated distribution to precisely recover the true distribution. Such requirement favors high generation diversity which restricted the generation quality. This is not suitable when the overall quality is low since high generation diversity usually indicates lot of errors rather than diverse good samples. In this paper we propose to achieve differentiated distribution recovery DDR for short. The key idea is to make the optimal generation probability proportional to the β-th power of the true probability where β > 1. In this way the generation quality can be greatly improved by sacrificing diversity from noises and rare patterns. Experiments on synthetic data and two public text datasets show that our DDR method achieves more flexible quality-diversity trade-off and higher Turing Test pass rate as compared with baseline methods including RNNLM SeqGAN and LeakGAN.
基于递归神经网络（RNNLM）的神经语言模型显着提高了文本生成的性能，但以图灵测试通过率表示的生成文本的质量仍远远不能令人满意。一些研究人员建议使用对抗训练或强化学习来提高质量，但是这种方法通常会在训练和参数调整过程中带来巨大挑战。通过我们的分析，我们发现RNNLM的问题来自最大似然估计（MLE）作为目标函数的使用，该函数需要生成的分布来精确地恢复真实分布。这种要求有利于高世代多样性，这限制了世代质量。当整体质量较低时，这是不合适的，因为高世代多样性通常表示很多错误，而不是多样化的好样本。在本文中，我们建议简短地实现差异化分布恢复DDR。关键思想是使最佳生成概率与真实概率的β幂成正比，其中β>1。这样，通过牺牲噪声和稀有模式的多样性，可以大大提高生成质量。在合成数据和两个公共文本数据集上进行的实验表明，与包括RNNLM SeqGAN和LeakGAN的基线方法相比，我们的DDR方法在质量-多样性折衷和更高的图灵测试通过率上更为灵活。

Logic-Based Sequential Decision-Making
基于逻辑的顺序决策
Daoming Lyu@Auburn UniversityFangkai Yang@Maana Inc.Bo Liu@Auburn UniversityDaesub Yoon@Electronics and Telecommunications Research Institute
柳道明@奥本大学@杨芳凯@ Maana Inc.刘波@奥本大学尹达sub @电子与电信研究所
Student Abstracts
学生文摘
Deep reinforcement learning (DRL) has gained great success by learning directly from high-dimensional sensory inputs yet is notorious for the lack of interpretability. Interpretability of the subtasks is critical in hierarchical decision-making as it increases the transparency of black-box-style DRL approach and helps the RL practitioners to understand the high-level behavior of the system better. In this paper we introduce symbolic planning into DRL and propose a framework of Symbolic Deep Reinforcement Learning (SDRL) that can handle both high-dimensional sensory inputs and symbolic planning. The task-level interpretability is enabled by relating symbolic actions to options. This framework features a planner – controller – meta-controller architecture which takes charge of subtask scheduling data-driven subtask learning and subtask evaluation respectively. The three components cross-fertilize each other and eventually converge to an optimal symbolic plan along with the learned subtasks bringing together the advantages of long-term planning capability with symbolic knowledge and end-to-end reinforcement learning directly from a high-dimensional sensory input. Experimental results validate the interpretability of subtasks along with improved data efficiency compared with state-of-the-art approaches.
深度强化学习（DRL）通过直接从高维感官输入中学习而获得了巨大的成功，但由于缺乏可解释性而臭名昭著。子任务的可解释性在分层决策中至关重要，因为它增加了黑匣子式DRL方法的透明度，并有助于RL练习者更好地理解系统的高级行为。在本文中，我们将符号规划引入DRL，并提出了一个符号深度强化学习（SDRL）框架，该框架可以处理高维感官输入和符号规划。通过将符号操作与选项相关联，可以实现任务级的可解释性。该框架具有计划者-控制器-元控制器体系结构，该体系结构分别负责子任务调度，数据驱动的子任务学习和子任务评估。这三个组成部分互为因果，最终与所学习的子任务融合为最佳的符号计划，将长期规划能力与符号知识的优势融合在一起，并直接从高维感官输入端到端进行强化学习。与最新方法相比，实验结果验证了子任务的可解释性以及改进的数据效率。

A Topic-Aware Reinforced Model for Weakly Supervised Stance Detection
用于弱监督姿态检测的主题感知增强模型
Penghui Wei@Chinese Academy of SciencesWenji Mao@Chinese Academy of SciencesGuandan Chen@Chinese Academy of Sciences
魏鹏辉@中国科学院温文ji @中国科学院陈冠旦@中国科学院
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
Analyzing public attitudes plays an important role in opinion mining systems. Stance detection aims to determine from a text whether its author is in favor of against or neutral towards a given target. One challenge of this task is that a text may not explicitly express an attitude towards the target but existing approaches utilize target content alone to build models. Moreover although weakly supervised approaches have been proposed to ease the burden of manually annotating largescale training data such approaches are confronted with noisy labeling problem. To address the above two issues in this paper we propose a Topic-Aware Reinforced Model (TARM) for weakly supervised stance detection. Our model consists of two complementary components: (1) a detection network that incorporates target-related topic information into representation learning for identifying stance effectively; (2) a policy network that learns to eliminate noisy instances from auto-labeled data based on off-policy reinforcement learning. Two networks are alternately optimized to improve each other’s performances. Experimental results demonstrate that our proposed model TARM outperforms the state-of-the-art approaches.
分析公众态度在意见挖掘系统中起着重要作用。姿态检测旨在从文本中确定其作者是赞成还是反对某个目标。这项任务的一个挑战是文本可能不会明确表达对目标的态度，但是现有方法仅利用目标内容来构建模型。而且，尽管已经提出了弱监督的方法来减轻手动注释大规模训练数据的负担，但是这些方法面临着嘈杂的标签问题。为了解决本文中的上述两个问题，我们提出了一种用于主题监督的姿态检测的主题感知增强模型（TARM）。我们的模型包括两个互补的组成部分：（1）一个检测网络，它将与目标相关的主题信息合并到表示学习中，以有效地识别姿势；（2）一个策略网络，该策略网络基于非策略强化学习来学习从自动标记的数据中消除嘈杂的实例。交替优化了两个网络，以提高彼此的性能。实验结果表明，我们提出的模型TARM优于最新方法。

How to Combine Tree-Search Methods in Reinforcement Learning
如何在强化学习中结合树型搜索方法
Yonathan Efroni@Technion – Israel Institute of TechnologyGal Dalal@Bruno Scherrer@French Institute for Research in Computer Science and AutomationShie Mannor@Technion – Israel Institute of Technology
Yonathan Efroni @ Technion-以色列理工学院Gal Dalal @ Bruno Scherrer @法国计算机科学与自动化研究所Shie Mannor @ Technion-以色列理工学院
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero (Silver et al. 2017b)). Referring to the planning problem as tree search a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here we question the potency of this approach. Namely the latter procedure is non-contractive in general and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a γh-contracting procedure where γ is the discount factor and h is the tree depth. To establish our results we first introduce a notion called multiple-step greedy consistency. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.
强化学习中广泛使用了有限水平的超前策略，并显示出令人印象深刻的经验成功。通常，前瞻性策略是通过特定的计划方法（例如，蒙特卡洛树搜索）（例如，在AlphaZero中（Silver等人，2017b））实施的。将规划问题称为树搜索，在这些实现中的合理做法是仅在叶子处备份值，而在根部获得的信息除了用于更新策略外，不利用任何信息。在这里，我们质疑这种方法的效力。即，后一种过程通常是非契约性的，并且不能保证其收敛性。我们提出的增强功能简单明了：使用最佳树路径的返回值来备份根的后代中的值。这导致了γh收缩过程，其中γ是折现因子，h是树的深度。为了建立我们的结果，我们首先引入一个称为多步贪婪一致性的概念。然后，在存在注入到树搜索阶段和值估计阶段的噪声的情况下，为上述增强的两个算法实例提供收敛速率。

Multi-Task Deep Reinforcement Learning with PopArt
PopArt的多任务深度强化学习
Matteo Hessel@DeepMindHubert Soyer@DeepMindLasse Espeholt@DeepMindWojciech Czarnecki@DeepMindSimon Schmitt@DeepMindHado van Hasselt@DeepMind
Matteo Hessel @ DeepMindHubert Soyer @ DeepMindLasse Espeholt @ DeepMindWojciech Czarnecki @ DeepMindSimon施密特@ DeepMindHado van Hasselt @ DeepMind
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
The reinforcement learning (RL) community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time each new task requiring to train a brand new agent instance. This means the learning algorithm is general but each solution is not; each agent can only solve the one task it was trained on. In this work we study the problem of learning to master not one but multiple sequentialdecision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent’s updates so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge this was the first time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform DeepMind Lab.
强化学习（RL）社区在设计能够在特定任务上超越人类表现的算法方面取得了长足进步。在每个新任务都需要训练一个全新的代理实例时，这些算法大多被训练为一个任务。这意味着学习算法是通用的，但每种解决方案都不是。每个特工只能解决一个受过训练的任务。在这项工作中，我们研究了一次学会掌握多个顺序决策任务的问题。多任务学习中的一个普遍问题是，必须在竞争单个学习系统有限资源的多个任务的需求之间找到平衡。许多学习算法会因要解决的任务集中的某些任务而分神。例如，由于任务中奖励的密度或大小，此类任务对学习过程显得更加重要。这导致算法以普遍性为代价将重点放在那些显着的任务上。我们建议自动调整每个任务对代理更新的贡献，以使所有任务对学习动态产生相似的影响。这导致了在学习玩57种Atari游戏中的所有游戏方面的最先进表现。令人兴奋的是，我们的方法学会了一项经过训练的策略-具有一组权重-超出了人类平均表现。据我们所知，这是单个代理在此多任务域上首次超过人员级别的性能。在3D强化学习平台DeepMind Lab中，相同的方法还展示了一组30项任务的最新性能。

Learning Vine Copula Models for Synthetic Data Generation
学习用于合成数据生成的藤蔓Copula模型
Yi Sun@Massachusetts Institute of TechnologyAlfredo Cuesta-Infante@Universidad Rey Juan CarlosKalyan Veeramachaneni@Massachusetts Institute of Technology
孙毅@麻省理工学院Alfredo Cuesta-Infante @ Rey Juan CarlosKalyan Veeramachaneni @麻省理工学院
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
A vine copula model is a flexible high-dimensional dependence model which uses only bivariate building blocks. However the number of possible configurations of a vine copula grows exponentially as the number of variables increases making model selection a major challenge in development. In this work we formulate a vine structure learning problem with both vector and reinforcement learning representation. We use neural network to find the embeddings for the best possible vine model and generate a structure. Throughout experiments on synthetic and real-world datasets we show that our proposed approach fits the data better in terms of loglikelihood. Moreover we demonstrate that the model is able to generate high-quality samples in a variety of applications making it a good candidate for synthetic data generation.
葡萄系模型是一种灵活的高维依赖模型，仅使用双变量构建块。但是，随着变量数量的增加，藤蔓copula可能的配置数量呈指数增长，这使得模型选择成为开发中的主要挑战。在这项工作中，我们用矢量和强化学习表示形式来制定藤蔓结构学习问题。我们使用神经网络找到最佳藤模型的嵌入并生成结构。在合成和真实数据集上的整个实验中，我们表明，我们提出的方法在对数似然方面更适合数据。此外，我们证明了该模型能够在各种应用中生成高质量的样本，使其成为合成数据生成的理想候选者。

Adversarial Actor-Critic Method for Task and Motion Planning Problems Using Planning Experience
使用计划经验的任务和动作计划问题的对抗性Actor-Critic方法
Beomjoon Kim@Massachusetts Institute of TechnologyLeslie Pack Kaelbling@Massachusetts Institute of TechnologyTomás Lozano-Pérez@Massachusetts Institute of Technology
Beomjoon Kim @麻省理工学院Leslie Pack Kaelbling @麻省理工学院TomásLozano-Pérez@麻省理工学院
AAAI Technical Track: Robotics
AAAI技术专栏：机器人技术
We propose an actor-critic algorithm that uses past planning experience to improve the efficiency of solving robot task-and-motion planning (TAMP) problems. TAMP planners search for goal-achieving sequences of high-level operator instances specified by both discrete and continuous parameters. Our algorithm learns a policy for selecting the continuous parameters during search using a small training set generated from the search trees of previously solved instances. We also introduce a novel fixed-length vector representation for world states with varying numbers of objects with different shapes based on a set of key robot configurations. We demonstrate experimentally that our method learns more efficiently from less data than standard reinforcementlearning approaches and that using a learned policy to guide a planner results in the improvement of planning efficiency.
我们提出了一种演员批评算法，该算法利用过去的规划经验来提高解决机器人任务与动作规划（TAMP）问题的效率。 TAMP计划人员搜索由离散和连续参数指定的高级操作员实例的目标达成序列。我们的算法学习了一种策略，该策略使用从先前求解的实例的搜索树生成的小型训练集在搜索过程中选择连续参数。我们还将基于一组关键的机器人配置，针对具有不同数量，不同形状的对象的世界状态，推出一种新颖的定长矢量表示。我们通过实验证明，与标准的强化学习方法相比，我们的方法可从更少的数据中更有效地学习，并且使用学习的策略指导规划者可以提高规划效率。

Model-Free IRL Using Maximum Likelihood Estimation
使用最大似然估计的无模型IRL
Vinamra Jain@University of GeorgiaPrashant Doshi@University of GeorgiaBikramjit Banerjee@University of Southern Mississippi
Vinamra Jain @乔治亚大学Prashant Doshi @乔治亚大学Bikramjit Banerjee @南密西西比大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
The problem of learning an expert’s unknown reward function using a limited number of demonstrations recorded from the expert’s behavior is investigated in the area of inverse reinforcement learning (IRL). To gain traction in this challenging and underconstrained problem IRL methods predominantly represent the reward function of the expert as a linear combination of known features. Most of the existing IRL algorithms either assume the availability of a transition function or provide a complex and inefficient approach to learn it. In this paper we present a model-free approach to IRL which casts IRL in the maximum likelihood framework. We present modifications of the model-free Q-learning that replace its maximization to allow computing the gradient of the Q-function. We use gradient ascent to update the feature weights to maximize the likelihood of expert’s trajectories. We demonstrate on two problem domains that our approach improves the likelihood compared to previous methods.
在逆向强化学习（IRL）领域中研究了使用从专家的行为记录的有限数量的演示来学习专家的未知奖励函数的问题。为了在这个具有挑战性和约束不足的问题中获得牵引力，IRL方法主要将专家的奖励功能表示为已知特征的线性组合。大多数现有的IRL算法要么假定转换函数可用，要么提供一种复杂而效率低下的方法来学习它。在本文中，我们提出了一种无模型的IRL方法，该方法将IRL转换为最大似然框架。我们提出了无模型Q学习的修改，这些修改取代了它的最大化以允许计算Q函数的梯度。我们使用梯度上升来更新特征权重，以最大化专家轨迹的可能性。我们在两个问题域上证明，与以前的方法相比，我们的方法提高了可能性。

A Deep Reinforcement Learning Framework for Rebalancing Dockless Bike Sharing Systems
深度强化学习框架，用于重新平衡无基座自行车共享系统
Ling Pan@Tsinghua UniversityQingpeng Cai@Tsinghua UniversityZhixuan Fang@The Chinese University of Hong KongPingzhong Tang@Tsinghua UniversityLongbo Huang@Tsinghua Univeristy
潘玲@清华大学蔡庆鹏@清华大学智选坊@香港中文大学唐平中@清华大学黄龙波@清华大学
AAAI Technical Track: Computational Sustainability
AAAI技术专栏：计算可持续性
Bike sharing provides an environment-friendly way for traveling and is booming all over the world. Yet due to the high similarity of user travel patterns the bike imbalance problem constantly occurs especially for dockless bike sharing systems causing significant impact on service quality and company revenue. Thus it has become a critical task for bike sharing operators to resolve such imbalance efficiently. In this paper we propose a novel deep reinforcement learning framework for incentivizing users to rebalance such systems. We model the problem as a Markov decision process and take both spatial and temporal features into consideration. We develop a novel deep reinforcement learning algorithm called Hierarchical Reinforcement Pricing (HRP) which builds upon the Deep Deterministic Policy Gradient algorithm. Different from existing methods that often ignore spatial information and rely heavily on accurate prediction HRP captures both spatial and temporal dependencies using a divide-and-conquer structure with an embedded localized module. We conduct extensive experiments to evaluate HRP based on a dataset from Mobike a major Chinese dockless bike sharing company. Results show that HRP performs close to the 24-timeslot look-ahead optimization and outperforms state-of-the-art methods in both service level and bike distribution. It also transfers well when applied to unseen areas.
共享单车提供了一种环保的出行方式，并且在全世界范围内都在蓬勃发展。然而，由于用户出行方式的高度相似性，自行车不平衡问题经常发生，特别是对于无坞座自行车共享系统，这对服务质量和公司收入产生了重大影响。因此，有效地解决这种不平衡成为自行车共享操作者的关键任务。在本文中，我们提出了一种新颖的深度强化学习框架，以激励用户重新平衡此类系统。我们将问题建模为马尔可夫决策过程，并同时考虑了时空特征。我们开发了一种新的深度强化学习算法，称为“层次强化定价”（HRP），它基于“深度确定性策略梯度”算法。与通常忽略空间信息并严重依赖准确预测的现有方法不同，HRP使用带有嵌入式局部模块的分而治之结构捕获空间和时间相关性。我们进行了广泛的实验，以基于中国主要的非停靠自行车共享公司Mobike的数据集评估HRP。结果表明，HRP的性能接近24时隙前瞻性优化，并且在服务水平和自行车分配方面均优于最新方法。当应用于看不见的区域时，它也可以很好地转移。

Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games Using Baselines
使用基线对广泛形式游戏的蒙特卡洛反事实后悔最小化（VR-MCCFR）的方差减少
Martin Schmid@DeepMindNeil Burch@DeepMindMarc Lanctot@DeepmindMatej Moravcik@DeepMIndRudolf Kadlec@Google DeepMindMichael Bowling@DeepMind
马丁·施密德@ DeepMindNeil Burch @ DeepMindMarc Lanctot @ DeepmindMatej Moravcik @ DeepMIndRudolf Kadlec @ Google DeepMind迈克尔保龄球@ DeepMind
AAAI Technical Track: Game Theory and Economic Paradigms
AAAI技术专题：博弈论与经济范式
Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting Monte Carlo Counterfactual Regret Minimization (MCCFR) can have slow long-term convergence rates due to high variance. In this paper we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique periteration estimated values and updates are reformulated as a function of sampled values and state-action baselines similar to their use in policy gradient reinforcement learning. The new formulation allows estimates to be bootstrapped from other estimates within the same episode propagating the benefits of baselines along the sampled trajectory; the estimates remain unbiased even when bootstrapping from other estimates. Finally we show that given a perfect baseline the variance of the value estimates can be reduced to zero. Experimental evaluation shows that VR-MCCFR brings an order of magnitude speedup while the empirical variance decreases by three orders of magnitude. The decreased variance allows for the first time CFR+ to be used with sampling increasing the speedup to two orders of magnitude.
从互动样本中获取不完美信息游戏的学习策略是一个具有挑战性的问题。此设置的常用方法是蒙特卡洛反事实后悔最小化（MCCFR），由于方差较大，因此长期收敛速度较慢。在本文中，我们介绍了适用于MCCFR的任何采样变量的方差减少技术（VR-MCCFR）。使用这种技术，渗透估计值和更新将根据采样值和状态操作基线进行重新构造，类似于它们在策略梯度强化学习中的用法。新的公式允许将估计值从同一事件中的其他估计值中推导出来，从而沿采样轨迹传播基线的好处；即使从其他估算值引导下来，这些估算值仍保持不变。最后，我们表明，给定一个完美的基线，值估计值的方差可以减小到零。实验评估表明，VR-MCCFR加速了一个数量级，而经验方差减小了三个数量级。减小的方差允许首次将CFR +与采样一起使用，从而将加速提高到两个数量级。

Task Transfer by Preference-Based Cost Learning
通过基于偏好的成本学习进行任务转移
Mingxuan Jing@Tsinghua UniversityXiaojian Ma@Tsinghua UniversityWenbing Huang@Tencent AI LabFuchun Sun@Tsinghua UniversityHuaping Liu@Tsinghua University
井明轩@清华大学马小建@清华大学黄文兵@腾讯AI LabFuchun孙@清华大学刘怀平@清华大学
AAAI Technical Track: Human-AI Collaboration
AAAI技术专栏：人与人工智能的协作
The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task. Given their successes on robotic action planning current methods mostly rely on two requirements: exactlyrelevant expert demonstrations or the explicitly-coded cost function on target task both of which however are inconvenient to obtain in practice. In this paper we relax these two strong conditions by developing a novel task transfer framework where the expert preference is applied as a guidance. In particular we alternate the following two steps: Firstly letting experts apply pre-defined preference rules to select related expert demonstrates for the target task. Secondly based on the selection result we learn the target cost function and trajectory distribution simultaneously via enhanced Adversarial MaxEnt IRL and generate more trajectories by the learned target distribution for the next preference selection. The theoretical analysis on the distribution learning and convergence of the proposed algorithm are provided. Extensive simulations on several benchmarks have been conducted for further verifying the effectiveness of the proposed method.
强化学习中任务转移的目标是将代理的操作策略从源任务迁移到目标任务。考虑到它们在机器人行动计划上的成功，当前的方法主要依赖于两个要求：完全相关的专家演示或目标任务上明确编码的成本函数，但是在实践中这两种方法都不方便。在本文中，我们通过开发一种新颖的任务转移框架来放松这两个强条件，该框架将专家的偏好作为指导。特别是，我们交替执行以下两个步骤：首先，让专家应用预定义的偏好规则来选择目标任务的相关专家演示。其次，基于选择结果，我们通过增强的对抗性MaxEnt IRL同时学习目标成本函数和轨迹分布，并通过学习的目标分布生成更多轨迹用于下一个偏好选择。提供了该算法的分布学习和收敛性的理论分析。为了进一步验证所提出方法的有效性，已经在几个基准上进行了广泛的仿真。

Large-Scale Interactive Recommendation with Tree-Structured Policy Gradient
具有树状结构策略梯度的大规模交互式推荐
Haokun Chen@Shanghai Jiao Tong UniversityXinyi Dai@Shanghai Jiao Tong UniversityHan Cai@Shanghai Jiao Tong UniversityWeinan Zhang@Shanghai Jiao Tong UniversityXuejian Wang@Shanghai Jiao Tong UniversityRuiming Tang@HuaweiYuzhou Zhang@HuaweiYong Yu@Shanghai Jiao Tong University
陈浩坤@上海交通大学戴信义@上海交通大学汉才@上海交通大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Reinforcement learning (RL) has recently been introduced to interactive recommender systems (IRS) because of its nature of learning from dynamic interactions and planning for long-run performance. As IRS is always with thousands of items to recommend (i.e. thousands of actions) most existing RL-based methods however fail to handle such a large discrete action space problem and thus become inefficient. The existing work that tries to deal with the large discrete action space problem by utilizing the deep deterministic policy gradient framework suffers from the inconsistency between the continuous action representation (the output of the actor network) and the real discrete action. To avoid such inconsistency and achieve high efficiency and recommendation effectiveness in this paper we propose a Tree-structured Policy Gradient Recommendation (TPGR) framework where a balanced hierarchical clustering tree is built over the items and picking an item is formulated as seeking a path from the root to a certain leaf of the tree. Extensive experiments on carefully-designed environments based on two real-world datasets demonstrate that our model provides superior recommendation performance and significant efficiency improvement over state-of-the-art methods.
强化学习（RL）最近已引入交互式推荐系统（IRS），因为它具有从动态交互中学习和规划长期性能的特性。由于IRS总是要推荐数千个项目（即数千个动作），因此大多数现有的基于RL的方法都无法处理如此大的离散动作空间问题，因此效率低下。试图通过利用深度确定性策略梯度框架来处理大型离散动作空间问题的现有工作受到连续动作表示（参与者网络的输出）与实际离散动作之间不一致的困扰。为了避免这种不一致并实现高效和推荐效果，我们提出了一种树状结构的政策梯度推荐（TPGR）框架，其中在项目之上构建了一个平衡的层次聚类树，并制定了一个选择项目以从中寻求路径。扎根到树的某个叶子。在基于两个真实数据集的精心设计的环境中进行的大量实验表明，与最新方法相比，我们的模型提供了出色的推荐性能和显着的效率改进。

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning
用于风险敏感型强化学习的状态增强转换
Shuai Ma@Concordia UniversityJia Yuan Yu@Concordia University
马帅@康科迪亚大学贾元瑜@康科迪亚大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
In the framework of MDP although the general reward function takes three arguments—current state action and successor state; it is often simplified to a function of two arguments—current state and action. The former is called a transition-based reward function whereas the latter is called a state-based reward function. When the objective involves the expected total reward only this simplification works perfectly. However when the objective is risk-sensitive this simplification leads to an incorrect value. We propose three successively more general state-augmentation transformations (SATs) which preserve the reward sequences as well as the reward distributions and the optimal policy in risk-sensitive reinforcement learning. In risk-sensitive scenarios firstly we prove that for every MDP with a stochastic transition-based reward function there exists an MDP with a deterministic state-based reward function such that for any given (randomized) policy for the first MDP there exists a corresponding policy for the second MDP such that both Markov reward processes share the same reward sequence. Secondly we illustrate that two situations require the proposed SATs in an inventory control problem. One could be using Q-learning (or other learning methods) on MDPs with transition-based reward functions and the other could be using methods which are for the Markov processes with a deterministic state-based reward functions on the Markov processes with general reward functions. We show the advantage of the SATs by considering Value-at-Risk as an example which is a risk measure on the reward distribution instead of the measures (such as mean and variance) of the distribution. We illustrate the error in the reward distribution estimation from the reward simplification and show how the SATs enable a variance formula to work on Markov processes with general reward functions.
在MDP的框架中，尽管通用奖励函数采用三个参数-当前状态动作和后继状态；它通常简化为两个参数的函数-当前状态和操作。前者称为基于过渡的奖励函数，而后者称为基于状态的奖励函数。当目标涉及预期的总回报时，只有这种简化才能完美地发挥作用。但是，当目标对风险敏感时，这种简化会导致错误的值。我们提出了三个连续的更一般的状态增强转换（SAT），它们在风险敏感型强化学习中保留了奖励序列，奖励分布和最优策略。首先，在风险敏感的场景中，我们证明，对于每个具有基于随机过渡的奖励函数的MDP，都存在一个具有基于状态的确定性奖励函数的MDP，从而对于第一个MDP的任何给定（随机）策略，都存在一个对应的策略。对于第二个MDP，以使两个Markov奖励过程共享相同的奖励序列。其次，我们说明在库存控制问题中有两种情况需要拟议的SAT。一种可能是在具有基于过渡的奖励函数的MDP上使用Q学习（或其他学习方法），另一种可能是在具有通用奖励函数的Markov过程上使用具有确定性基于状态的奖励函数的Markov过程的方法。我们以风险价值为例来说明SAT的优势，该风险价值是对报酬分布的风险度量，而不是对分布的度量（例如均值和方差）。我们通过简化奖励来说明奖励分配估计中的错误，并说明SAT如何使方差公式在具有通用奖励函数的Markov过程上起作用。

Regularized Evolution for Image Classifier Architecture Search
图像分类器体系结构搜索的正则化进化
Esteban Real@Google BrainAlok Aggarwal@Google BrainYanping Huang@Google BrainQuoc V. Le@Google Brain
Esteban Real @ Google BrainAlok Aggarwal @ Google BrainYanping Huang @ Google BrainQuoc V. Le @ Google Brain
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
The effort devoted to hand-crafting neural network image classifiers has motivated the use of architecture search to discover them automatically. Although evolutionary algorithms have been repeatedly applied to neural network topologies the image classifiers thus discovered have remained inferior to human-crafted ones. Here we evolve an image classifier— AmoebaNet-A—that surpasses hand-designs for the first time. To do this we modify the tournament selection evolutionary algorithm by introducing an age property to favor the younger genotypes. Matching size AmoebaNet-A has comparable accuracy to current state-of-the-art ImageNet models discovered with more complex architecture-search methods. Scaled to larger size AmoebaNet-A sets a new state-of-theart 83.9% top-1 / 96.6% top-5 ImageNet accuracy. In a controlled comparison against a well known reinforcement learning algorithm we give evidence that evolution can obtain results faster with the same hardware especially at the earlier stages of the search. This is relevant when fewer compute resources are available. Evolution is thus a simple method to effectively discover high-quality architectures.
致力于手工制作神经网络图像分类器的工作促使使用体系结构搜索来自动发现它们。尽管进化算法已经被重复地应用于神经网络拓扑，但是由此发现的图像分类器仍然不如人为的分类器。在这里，我们开发了一种图像分类器AmoebaNet-A，该分类器首次超过了手工设计。为此，我们通过引入年龄属性来支持年轻的基因型来修改锦标赛选择进化算法。匹配大小的AmoebaNet-A具有与通过更复杂的体系结构搜索方法发现的当前最新ImageNet模型相当的准确性。缩放到更大尺寸AmoebaNet-A设置了最新的83.9％top-1 / 96.6％top-5 ImageNet精度。在与众所周知的强化学习算法的受控比较中，我们提供了证据，表明进化可以使用相同的硬件更快地获得结果，尤其是在搜索的早期阶段。当可用的计算资源较少时，这是相关的。因此，进化是一种有效发现高质量架构的简单方法。

Verifiable and Interpretable Reinforcement Learning through Program Synthesis
通过程序综合可验证且可解释的强化学习
Abhinav Verma@Rice University
莱斯大学阿比纳夫·维尔玛
Doctoral Consortium Track Abstracts
博士联合会文摘
We study the problem of generating interpretable and verifiable policies for Reinforcement Learning (RL). Unlike the popular Deep Reinforcement Learning (DRL) paradigm in which the policy is represented by a neural network the aim of this work is to find policies that can be represented in highlevel programming languages. Such programmatic policies have several benefits including being more easily interpreted than neural networks and being amenable to verification by scalable symbolic methods. The generation methods for programmatic policies also provide a mechanism for systematically using domain knowledge for guiding the policy search. The interpretability and verifiability of these policies provides the opportunity to deploy RL based solutions in safety critical environments. This thesis draws on and extends work from both the machine learning and formal methods communities.
我们研究为强化学习（RL）生成可解释和可验证的策略的问题。与流行的深度强化学习（DRL）范式不同，在后者中，策略由神经网络表示。这项工作的目的是找到可以用高级编程语言表示的策略。这样的程序化策略具有许多好处，包括比神经网络更容易解释，并且易于通过可伸缩的符号方法进行验证。程序化策略的生成方法还提供了一种系统地使用领域知识来指导策略搜索的机制。这些策略的可解释性和可验证性提供了在安全关键环境中部署基于RL的解决方案的机会。本文借鉴并扩展了机器学习和形式方法社区的工作。

Learning Resource Allocation and Pricing for Cloud Profit Maximization
学习资源分配和定价以实现云利润最大化
Bingqian Du@The University of Hong KongChuan Wu@The University of Hong KongZhiyi Huang@The University of Hong Kong
杜炳谦@香港大学吴传@香港大学黄志怡@香港大学
AAAI Technical Track: Planning Routing and Scheduling
AAAI技术专区：规划路由和计划
Cloud computing has been widely adopted to support various computation services. A fundamental problem faced by cloud providers is how to efficiently allocate resources upon user requests and price the resource usage in order to maximize resource efficiency and hence provider profit. Existing studies establish detailed performance models of cloud resource usage and propose offline or online algorithms to decide allocation and pricing. Differently we adopt a blackbox approach and leverage model-free Deep Reinforcement Learning (DRL) to capture dynamics of cloud users and better characterize inherent connections between an optimal allocation/pricing policy and the states of the dynamic cloud system. The goal is to learn a policy that maximizes net profit of the cloud provider through trial and error which is better than decisions made on explicit performance models. We combine long short-term memory (LSTM) units with fully-connected neural networks in our DRL to deal with online user arrivals and adjust the output and update methods of basic DRL algorithms to address both resource allocation and pricing. Evaluation based on real-world datasets shows that our DRL approach outperforms basic DRL algorithms and state-of-theart white-box online cloud resource allocation/pricing algorithms significantly in terms of both profit and the number of accepted users.
云计算已被广泛采用以支持各种计算服务。云提供商面临的一个基本问题是如何根据用户请求有效地分配资源并为资源使用定价，以最大程度地提高资源效率，从而最大化提供商的利润。现有研究建立了详细的云资源使用性能模型，并提出了离线或在线算法来决定分配和定价。不同地，我们采用黑盒方法，并利用无模型的深度强化学习（DRL）来捕获云用户的动态，并更好地表征最佳分配/定价策略与动态云系统状态之间的内在联系。目标是学习一种通过反复试验使云提供商的净利润最大化的策略，该策略比对显式性能模型做出的决策要好。我们在DRL中将长短期记忆（LSTM）单元与完全连接的神经网络相结合，以处理在线用户的到来情况，并调整基本DRL算法的输出和更新方法，以解决资源分配和定价问题。基于现实世界数据集的评估表明，我们的DRL方法在利润和接受用户数量方面均明显优于基本DRL算法和最新的白盒在线云资源分配/定价算法。

Deictic Image Mapping: An Abstraction for Learning Pose Invariant Manipulation Policies
Deictic图像映射：学习姿势不变操作策略的抽象
Robert Platt@Northeastern UniversityColin Kohler@Northeastern UniversityMarcus Gualtieri@Northeastern University
罗伯特·普拉特（Robert Platt）@东北大学科林·科勒（Colin Kohler）@东北大学马库斯·古铁里（Marcus Gualtieri）@东北大学
AAAI Technical Track: Robotics
AAAI技术专栏：机器人技术
In applications of deep reinforcement learning to robotics it is often the case that we want to learn pose invariant policies: policies that are invariant to changes in the position and orientation of objects in the world. For example consider a pegin-hole insertion task. If the agent learns to insert a peg into one hole we would like that policy to generalize to holes presented in different poses. Unfortunately this is a challenge using conventional methods. This paper proposes a novel state and action abstraction that is invariant to pose shifts called deictic image maps that can be used with deep reinforcement learning. We provide broad conditions under which optimal abstract policies are optimal for the underlying system. Finally we show that the method can help solve challenging robotic manipulation problems.
在将深度强化学习应用到机器人技术中时，经常需要学习姿势不变策略：与世界上物体的位置和方向的变化无关的策略。例如，考虑一个孔洞插入任务。如果探员学会将销钉插入一个孔中，我们希望该策略可以推广到以不同姿势呈现的孔中。不幸的是，这是使用常规方法的挑战。本文提出了一种新颖的状态和动作抽象，这种抽象对于姿势变换是不变的，称为姿势图像贴图，可以与深度强化学习一起使用。我们提供了广泛的条件，在这些条件下，最佳的抽象策略对于底层系统是最佳的。最后，我们证明了该方法可以帮助解决具有挑战性的机器人操纵问题。

The Utility of Sparse Representations for Control in Reinforcement Learning
稀疏表示法在强化学习中的控制作用
Vincent Liu@University of AlbertaRaksha Kumaraswamy@University of AlbertaLei Le@Indiana University BloomingtonMartha White@University of Alberta
Vincent Liu @ Alberta大学Raksha Kumaraswamy @ Alberta大学Lei Le @印第安纳大学BloomingtonMartha White @ Alberta大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We investigate sparse representations for control in reinforcement learning. While these representations are widely used in computer vision their prevalence in reinforcement learning is limited to sparse coding where extracting representations for new data can be computationally intensive. Here we begin by demonstrating that learning a control policy incrementally with a representation from a standard neural network fails in classic control domains whereas learning with a representation obtained from a neural network that has sparsity properties enforced is effective. We provide evidence that the reason for this is that the sparse representation provides locality and so avoids catastrophic interference and particularly keeps consistent stable values for bootstrapping. We then discuss how to learn such sparse representations. We explore the idea of Distributional Regularizers where the activation of hidden nodes is encouraged to match a particular distribution that results in sparse activation across time. We identify a simple but effective way to obtain sparse representations not afforded by previously proposed strategies making it more practical for further investigation into sparse representations for reinforcement learning.
我们调查在强化学习中控制的稀疏表示。尽管这些表示法已广泛用于计算机视觉，但它们在强化学习中的流行仅限于稀疏编码，在这种情况下，提取新数据的表示法可能需要大量计算。在这里，我们首先说明，在经典控制域中，使用来自标准神经网络的表示来增量学习控制策略会失败，而使用具有强制稀疏性的神经网络所获得的表示来学习则是有效的。我们提供的证据表明，这样做的原因是稀疏表示提供了局部性，因此避免了灾难性干扰，尤其是对于自举保持了稳定的稳定值。然后，我们讨论如何学习这种稀疏表示。我们探讨了分布正则器的思想，其中鼓励隐藏节点的激活以匹配特定分布，从而导致跨时间的稀疏激活。我们确定了一种简单但有效的方法来获得先前提出的策略无法提供的稀疏表示，这使得它对于进一步研究稀疏表示以进行强化学习更为实用。

Determinantal Reinforcement Learning
行列式强化学习
Takayuki Osogami@IBM Research - TokyoRudy Raymond@IBM Research - Tokyo
Tsakayuki Osogami @ IBM研究-东京Rudy Raymond @ IBM研究-东京
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We study reinforcement learning for controlling multiple agents in a collaborative manner. In some of those tasks it is insufficient for the individual agents to take relevant actions but those actions should also have diversity. We propose the approach of using the determinant of a positive semidefinite matrix to approximate the action-value function in reinforcement learning where we learn the matrix in a way that it represents the relevance and diversity of the actions. Experimental results show that the proposed approach allows the agents to learn a nearly optimal policy approximately ten times faster than baseline approaches in benchmark tasks of multi-agent reinforcement learning. The proposed approach is also shown to achieve the performance that cannot be achieved with conventional approaches in partially observable environment with exponentially large action space.
我们研究以协作方式控制多个代理的强化学习。在其中某些任务中，单个代理人不足以采取相关行动，但这些行动也应具有多样性。我们提出了使用正半定矩阵的行列式逼近强化学习中的动作值函数的方法，其中我们以表示动作的相关性和多样性的方式学习矩阵。实验结果表明，在多主体强化学习的基准任务中，所提出的方法使代理能够学习比基准方法快近十倍的最佳策略。还证明了所提出的方法在具有指数大的动作空间的部分可观察的环境中实现了常规方法无法实现的性能。

Geometric Multi-Model Fitting by Deep Reinforcement Learning
深度强化学习的几何多模型拟合
Zongliang Zhang@Xiamen UniversityHongbin Zeng@Xiamen UniversityJonathan Li@Xiamen UniversityYiping Chen@Xiamen UniversityChenhui Yang@Xiamen UniversityCheng Wang@Xiamen University
张宗亮@厦门大学曾宏斌@厦门大学李宗盛@厦门大学陈萍萍@厦门大学杨晨辉@厦门大学王成@厦门大学
Student Abstracts
学生文摘
This paper deals with the geometric multi-model fitting from noisy unstructured point set data (e.g. laser scanned point clouds). We formulate multi-model fitting problem as a sequential decision making process. We then use a deep reinforcement learning algorithm to learn the optimal decisions towards the best fitting result. In this paper we have compared our method against the state-of-the-art on simulated data. The results demonstrated that our approach significantly reduced the number of fitting iterations.
本文从嘈杂的非结构化点集数据（例如激光扫描点云）处理几何多模型拟合。我们将多模型拟合问题公式化为顺序决策过程。然后，我们使用深度强化学习算法来学习朝着最佳拟合结果的最佳决策。在本文中，我们将我们的方法与模拟数据的最新技术进行了比较。结果表明，我们的方法大大减少了拟合迭代的次数。

VidyutVanika: A Reinforcement Learning Based Broker Agent for a Power Trading Competition
VidyutVanika：基于增强学习的电力交易竞赛经纪人代理
Susobhan Ghosh@International Institute of Information Technology HyderabadEaswar Subramanian@Tata Consultancy Services LimitedSanjay P. Bhat@Tata Consultancy Services LimitedSujit Gujar@International Institute of Information Technology HyderabadPraveen Paruchuri@Indian Institute of Technology Hyderabad
Susobhan Ghosh @海得拉巴国际信息技术研究所Easwar Subramanian @塔塔咨询服务有限公司Sanjay P. Bhat @塔塔咨询服务有限公司Sujit Gujar @国际信息技术研究所海得拉巴Praveen Paruchuri @印度海得拉巴技术学院
AAAI Technical Track: Applications
AAAI技术专题：应用
A smart grid is an efficient and sustainable energy system that integrates diverse generation entities distributed storage capacity and smart appliances and buildings. A smart grid brings new kinds of participants in the energy market served by it whose effect on the grid can only be determined through high fidelity simulations. Power TAC offers one such simulation platform using real-world weather data and complex state-of-the-art customer models. In Power TAC autonomous energy brokers compete to make profits across tariff wholesale and balancing markets while maintaining the stability of the grid. In this paper we design an autonomous broker VidyutVanika the runner-up in the 2018 Power TAC competition. VidyutVanika relies on reinforcement learning (RL) in the tariff market and dynamic programming in the wholesale market to solve modified versions of known Markov Decision Process (MDP) formulations in the respective markets. The novelty lies in defining the reward functions for MDPs solving these MDPs and the application of these solutions to real actions in the market. Unlike previous participating agents VidyutVanika uses a neural network to predict the energy consumption of various customers using weather data. We use several heuristic ideas to bridge the gap between the restricted action spaces of the MDPs and the much more extensive action space available to VidyutVanika. These heuristics allow VidyutVanika to convert near-optimal fixed tariffs to time-of-use tariffs aimed at mitigating transmission capacity fees spread out its orders across several auctions in the wholesale market to procure energy at a lower price more accurately estimate parameters required for implementing the MDP solution in the wholesale market and account for wholesale procurement costs while optimizing tariffs. We use Power TAC 2018 tournament data and controlled experiments to analyze the performance of VidyutVanika and illustrate the efficacy of the above strategies.
智能电网是一种高效且可持续的能源系统，它集成了分布式发电容量，智能电器和建筑物等各种发电实体。智能电网为能源市场带来了新的参与者，其对电网的影响只能通过高保真度模拟来确定。 Power TAC使用真实的天气数据和复杂的最新客户模型提供了这样一种模拟平台。在Power TAC中，自治的能源经纪人竞争以在整个电价批发和平衡市场中获利，同时保持电网的稳定性。在本文中，我们设计了自治经纪人VidyutVanika在2018 Power TAC竞赛中获得亚军。 VidyutVanika依靠关税市场中的强化学习（RL）和批发市场中的动态编程来解决各个市场中已知的马尔可夫决策过程（MDP）公式的修改版本。新颖之处在于为解决这些MDP的MDP定义奖励功能，以及将这些解决方案应用于市场中的实际行为。与以前的参与代理商不同，VidyutVanika使用神经网络使用天气数据预测各种客户的能源消耗。我们使用几种启发式的想法来弥合MDP受限行动空间与VidyutVanika可用的更广泛行动空间之间的鸿沟。这些试探法使VidyutVanika可以将近乎最佳的固定关税转换为使用时间关税，以减轻在批发市场的几次拍卖中散布其订单的传输容量费用，从而以较低的价格获取能源，从而更准确地估算实施该协议所需的参数。批发市场中的MDP解决方案，并在优化关税的同时考虑了批发采购成本。我们使用Power TAC 2018锦标赛数据和受控实验来分析VidyutVanika的表现并说明上述策略的功效。

Towards Better Interpretability in Deep Q-Networks
在深度Q网络中实现更好的可解释性
Raghuram Mandyam Annasamy@Carnegie Mellon UniversityKatia Sycara@Carnegie Mellon University
Raghuram Mandyam Annasamy @卡内基·梅隆大学Katia Sycara @卡内基·梅隆大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Deep reinforcement learning techniques have demonstrated superior performance in a wide variety of environments. As improvements in training algorithms continue at a brisk pace theoretical or empirical studies on understanding what these networks seem to learn are far behind. In this paper we propose an interpretable neural network architecture for Q-learning which provides a global explanation of the model’s behavior using key-value memories attention and reconstructible embeddings. With a directed exploration strategy our model can reach training rewards comparable to the state-of-the-art deep Q-learning models. However results suggest that the features extracted by the neural network are extremely shallow and subsequent testing using out-of-sample examples shows that the agent can easily overfit to trajectories seen during training.
深度强化学习技术已在各种环境中表现出卓越的性能。随着训练算法的快速发展，关于理解这些网络似乎要学习什么的理论或实证研究也远远落后。在本文中，我们提出了一种用于Q学习的可解释的神经网络体系结构，该体系结构使用键值存储注意力和可重构的嵌入方式来提供模型行为的全局解释。通过定向探索策略，我们的模型可以获得与最新的深度Q学习模型相当的培训奖励。但是结果表明，由神经网络提取的特征非常浅，随后使用样本外示例进行的测试表明，该代理很容易过度适应训练期间看到的轨迹。

Bootstrap Estimated Uncertainty of the Environment Model for Model-Based Reinforcement Learning
基于模型的强化学习的Bootstrap估计环境模型的不确定性
Wenzhen Huang@Chinese Academy of SciencesJunge Zhang@Chinese Academy of SciencesKaiqi Huang@Chinese Academy of Sciences
黄文珍@中国科学院黄J珍@中国科学院黄凯琪@中国科学院
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Model-based reinforcement learning (RL) methods attempt to learn a dynamics model to simulate the real environment and utilize the model to make better decisions. However the learned environment simulator often has more or less model error which would disturb making decision and reduce performance. We propose a bootstrapped model-based RL method which bootstraps the modules in each depth of the planning tree. This method can quantify the uncertainty of environment model on different state-action pairs and lead the agent to explore the pairs with higher uncertainty to reduce the potential model errors. Moreover we sample target values from their bootstrap distribution to connect the uncertainties at current and subsequent time-steps and introduce the prior mechanism to improve the exploration efficiency. Experiment results demonstrate that our method efficiently decreases model error and outperforms TreeQN and other stateof-the-art methods on multiple Atari games.
基于模型的强化学习（RL）方法尝试学习动力学模型以模拟实际环境，并利用该模型做出更好的决策。但是，学习型环境模拟器通常具有或多或少的模型误差，这会干扰决策并降低性能。我们提出了一种基于自举模型的RL方法，该方法将模块引导到规划树的每个深度。该方法可以量化不同状态-动作对上环境模型的不确定性，并使代理探索具有更高不确定性的对，以减少潜在的模型误差。此外，我们从其引导分布中采样目标值，以连接当前和后续时间步的不确定性，并介绍了提高勘探效率的现有机制。实验结果表明，我们的方法可以有效地减少模型误差，并且在多个Atari游戏中均优于TreeQN和其他最新方法。

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments
通过分解环境中的基准引导来改进安全策略
Thiago D. Simão@Delft University of TechnologyMatthijs T. J. Spaan@Delft University of Technology
ThiagoD.Simão@代尔夫特理工大学Matthijs T.J.Spaan @代尔夫特理工大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy it is necessary to provide a confidence level regarding its expected performance. However algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning on the other hand is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.
我们提出了一种新颖的安全强化学习算法，该算法利用环境的因式动力学来降低保守性。我们专注于已经在运行策略并且与环境的交互受到限制的问题设置。为了安全地部署更新的策略，有必要提供有关其预期性能的置信度。但是，用于改进安全策略的算法可能需要大量的过去经验，以变得足够自信以改变代理的行为。另一方面，已知因式强化学习可以很好地利用所提供的数据。通过利用环境特征之间的独立性，可以实现更好的样本复杂性，但是它缺乏置信度。我们研究如何通过利用环境的分解结构来使用基准自举算法提高安全策略改进的样本效率。我们的主要结果是理论上的界限，该界限在因式表示的参数数量而不是状态数量上呈线性关系。实证分析表明，我们的方法可以使用可能比扁平算法小一个数量级的样本来改善策略。

A Hierarchical Framework for Relation Extraction with Reinforcement Learning
强化学习中关系提取的层次框架
Ryuichi Takanobu@Tsinghua UniversityTianyang Zhang@Tsinghua UniversityJiexi Liu@Tsinghua UniversityMinlie Huang@Tsinghua University
高信隆一@清华大学张天阳@清华大学刘洁熙@清华大学黄敏烈@清华大学
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
Most existing methods determine relation types only after all the entities have been recognized thus the interaction between relation types and entity mentions is not fully modeled. This paper presents a novel paradigm to deal with relation extraction by regarding the related entities as the arguments of a relation. We apply a hierarchical reinforcement learning (HRL) framework in this paradigm to enhance the interaction between entity mentions and relation types. The whole extraction process is decomposed into a hierarchy of two-level RL policies for relation detection and entity extraction respectively so that it is more feasible and natural to deal with overlapping relations. Our model was evaluated on public datasets collected via distant supervision and results show that it gains better performance than existing methods and is more powerful for extracting overlapping relations1.
大多数现有方法仅在所有实体都已被识别之后才确定关系类型，因此关系类型与实体提及之间的交互未完全建模。通过将相关实体视为关系的参数，本文提出了一种处理关系提取的新范式。我们在此范例中应用了分层强化学习（HRL）框架，以增强实体提及和关系类型之间的交互。整个提取过程被分解为分别用于关系检测和实体提取的两级RL策略层次结构，因此处理重叠关系更为可行和自然。通过远程监督收集的公共数据集对我们的模型进行了评估，结果表明，该模型比现有方法具有更好的性能，并且在提取重叠关系方面更强大1。

State Abstraction as Compression in Apprenticeship Learning
学徒学习中的状态抽象作为压缩
David Abel@Brown UniversityDilip Arumugam@Stanford UniversityKavosh Asadi@Brown UniversityYuu Jinnai@Brown UniversityMichael L. Littman@Brown UniversityLawson L.S. Wong@Northeastern University
David Abel @布朗大学Dilip Arumugam @斯坦福大学Kavosh Asadi @布朗大学Yuu Jinnai @布朗大学Michael L.Littman @布朗大学Lawson L.S.黄@东北大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
State abstraction can give rise to models of environments that are both compressed and useful thereby enabling efficient sequential decision making. In this work we offer the first formalism and analysis of the trade-off between compression and performance made in the context of state abstraction for Apprenticeship Learning. We build on Rate-Distortion theory the classic Blahut-Arimoto algorithm and the Information Bottleneck method to develop an algorithm for computing state abstractions that approximate the optimal tradeoff between compression and performance. We illustrate the power of this algorithmic structure to offer insights into effective abstraction compression and reinforcement learning through a mixture of analysis visuals and experimentation.
状态抽象可以产生既压缩又有用的环境模型，从而实现有效的顺序决策。在这项工作中，我们提供了第一个形式主义，并分析了在学徒学习状态抽象的背景下进行的压缩与性能之间的权衡。我们基于速率失真理论，经典的Blahut-Arimoto算法和信息瓶颈方法，开发了一种用于计算状态抽象的算法，该算法可近似估算压缩与性能之间的最佳折衷。我们说明了这种算法结构的强大功能，可通过结合分析视觉效果和实验来提供有效的抽象压缩和强化学习的见解。

Online Pandora’s Boxes and Bandits
在线潘多拉魔盒和土匪
Hossein Esfandiari@Google ResearchMohammadTaghi HajiAghayi@University of MarylandBrendan Lucier@Microsoft Research New EnglandMichael Mitzenmacher@Harvard University
侯赛因·埃斯凡迪亚里（Hossein Esfandiari）@ Google ResearchMohammadTaghi HajiAghayi @马里兰大学布兰丹·卢西尔（Brendan Lucier）@微软新英格兰研究院迈克尔·米森马赫（Michael Mitzenmacher）@哈佛大学
AAAI Technical Track: Game Theory and Economic Paradigms
AAAI技术专题：博弈论与经济范式
We consider online variations of the Pandora’s box problem (Weitzman 1979) a standard model for understanding issues related to the cost of acquiring information for decision-making. Our problem generalizes both the classic Pandora’s box problem and the prophet inequality framework. Boxes are presented online each with a random value and cost drawn jointly from some known distribution. Pandora chooses online whether to open each box given its cost and then chooses irrevocably whether to keep the revealed prize or pass on it. We aim for approximation algorithms against adversaries that can choose the largest prize over any opened box and use optimal offline policies to decide which boxes to open (without knowledge of the value inside)1. We consider variations where Pandora can collect multiple prizes subject to feasibility constraints such as cardinality matroid or knapsack constraints. We also consider variations related to classic multi-armed bandit problems from reinforcement learning. Our results use a reduction-based framework where we separate the issues of the cost of acquiring information from the online decision process of which prizes to keep. Our work shows that in many scenarios Pandora can achieve a good approximation to the best possible performance.
我们认为潘多拉魔盒问题的在线变化（Weitzman 1979）是理解与决策信息获取成本有关的问题的标准模型。我们的问题概括了经典的潘多拉盒子问题和先知不平等框架。在线显示每个框，每个框的随机值和成本是从某些已知分布中共同得出的。潘多拉（Pandora）在线选择是否打开每个已确定成本的盒子，然后无可挽回地选择是保留所显示的奖品还是将其传递。我们的目标是针对可以选择任何已打开包装盒中最大奖品并使用最佳离线策略来决定打开哪个包装盒的对手（无需了解其中的价值）的对手的近似算法。我们考虑了各种变化，其中Pandora可以根据可行性约束（例如基数拟阵或背包约束）收集多个奖项。我们还考虑了强化学习中与经典多臂匪问题有关的变化。我们的结果使用基于减少的框架，在该框架中，我们将获取信息的成本问题与保留哪些奖品的在线决策过程分开。我们的工作表明，在许多情况下，Pandora可以很好地逼近最佳性能。

Meta Learning for Image Captioning
元学习的图像字幕
Nannan Li@Wuhan UniversityZhenzhong Chen@Wuhan UniversityShan Liu@Tencent America
李南南@武汉大学陈振中@武汉大学刘珊@腾讯美国
AAAI Technical Track: Vision
AAAI技术轨道：愿景
Reinforcement learning (RL) has shown its advantages in image captioning by optimizing the non-differentiable metric directly in the reward learning process. However due to the reward hacking problem in RL maximizing reward may not lead to better quality of the caption especially from the aspects of propositional content and distinctiveness. In this work we propose to use a new learning method meta learning to utilize supervision from the ground truth whilst optimizing the reward function in RL. To improve the propositional content and the distinctiveness of the generated captions the proposed model provides the global optimal solution by taking different gradient steps towards the supervision task and the reinforcement task simultaneously. Experimental results on MS COCO validate the effectiveness of our approach when compared with the state-of-the-art methods.
强化学习（RL）通过直接在奖励学习过程中优化不可微化指标，显示了其在图像字幕中的优势。但是，由于RL中存在奖励黑客问题，因此，最大化奖励可能无法带来更好的字幕质量，尤其是从命题内容和独特性方面而言。在这项工作中，我们建议使用一种新的学习方法元学习来利用来自地面事实的监督，同时优化RL中的奖励功能。为了提高命题内容和字幕的独特性，提出的模型通过同时采取不同的梯度步骤来实现监督任务和加固任务，从而提供了全局最优解。与最新技术相比，MS COCO的实验结果验证了我们方法的有效性。

Long Short-Term Memory with Dynamic Skip Connections
具有动态跳过连接的长短期记忆
Tao Gui@Fudan UniversityQi Zhang@Fudan UniversityLujun Zhao@Fudan UniversityYaosong Lin@Fudan UniversityMinlong Peng@Fudan UniversityJingjing Gong@Fudan UniversityXuanjing Huang@Fudan University
陶桂@复旦大学张琦@复旦大学赵陆军@复旦大学林耀嵩@复旦大学彭敏龙@复旦大学经京经@复旦大学黄宣京@复旦大学
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
In recent years long short-term memory (LSTM) has been successfully used to model sequential data of variable length. However LSTM can still experience difficulty in capturing long-term dependencies. In this work we tried to alleviate this problem by introducing a dynamic skip connection which can learn to directly connect two dependent words. Since there is no dependency information in the training data we propose a novel reinforcement learning-based method to model the dependency relationship and connect dependent words. The proposed model computes the recurrent transition functions based on the skip connections which provides a dynamic skipping advantage over RNNs that always tackle entire sentences sequentially. Our experimental results on three natural language processing tasks demonstrate that the proposed method can achieve better performance than existing methods. In the number prediction experiment the proposed model outperformed LSTM with respect to accuracy by nearly 20%.
近年来，长期短期记忆（LSTM）已成功用于建模可变长度的顺序数据。但是，LSTM在捕获长期依赖性方面仍然会遇到困难。在这项工作中，我们试图通过引入动态跳过连接来缓解此问题，该动态跳过连接可以学习直接连接两个从属词。由于训练数据中没有依赖信息，因此我们提出了一种新的基于强化学习的方法来对依赖关系进行建模并连接依赖词。所提出的模型基于跳过连接来计算递归转换函数，这相对于始终顺序处理整个句子的RNN具有动态跳过的优势。我们在三个自然语言处理任务上的实验结果表明，该方法可以比现有方法获得更好的性能。在数字预测实验中，所提出的模型在准确性方面优于LSTM约20％。

Human-Like Delicate Region Erasing Strategy for Weakly Supervised Detection
弱监督检测的类似人的精细区域擦除策略
Qing En@Beijing University of TechnologyLijuan Duan@Beijing University of TechnologyZhaoxiang Zhang@Chinese Academy of SciencesXiang Bai@Huazhong University of Science and TechnologyYundong Zhang@Vimicro Corporation
恩恩@北京工业大学段丽娟@北京工业大学张兆祥@中国科学院白象翔@华中科技大学张云东@中星微电子股份有限公司
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We explore a principle method to address the weakly supervised detection problem. Many deep learning methods solve weakly supervised detection by mining various object proposal or pooling strategies which may cause redundancy and generate a coarse location. To overcome this limitation we propose a novel human-like active searching strategy that recurrently ignores the background and discovers class-specific objects by erasing undesired pixels from the image. The proposed detector acts as an agent providing guidance to erase unremarkable regions and eventually concentrating the attention on the foreground. The proposed agents which are composed of a deep Q-network and are trained by the Q-learning algorithm analyze the contents of the image features to infer the localization action according to the learned policy. To the best of our knowledge this is the first attempt to apply reinforcement learning to address weakly supervised localization with only image-level labels. Consequently the proposed method is validated on the PASCAL VOC 2007 and PASCAL VOC 2012 datasets. The experimental results show that the proposed method is capable of locating a single object within 5 steps and has great significance to the research on weakly supervised localization with a human-like mechanism.
我们探索一种解决弱监督检测问题的原理方法。许多深度学习方法通过挖掘各种可能导致冗余并生成粗糙位置的对象建议或合并策略来解决弱监督检测问题。为克服此限制，我们提出了一种新颖的类似于人的主动搜索策略，该策略经常忽略背景，并通过从图像中删除不需要的像素来发现特定于类别的对象。提出的检测器充当提供指导以擦除不明显区域并最终将注意力集中在前景上的媒介。提出的由深层Q网络组成并由Q学习算法训练的智能体，根据学习到的策略，分析图像特征的内容，以推断出定位动作。据我们所知，这是首次应用强化学习来解决仅图像级标签的弱监督定位问题。因此，该建议方法在PASCAL VOC 2007和PASCAL VOC 2012数据集上得到了验证。实验结果表明，该方法能够在5步之内定位单个物体，对研究类人机制的弱监督定位具有重要意义。

Read Watch and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
阅读观看和移动：增强学习以使视频中的自然语言描述暂时扎根
Dongliang He@Baidu Inc.Xiang Zhao@Baidu Inc.Jizhou Huang@Baidu Inc.Fu Li@Baidu Inc.Xiao Liu@Baidu Inc.Shilei Wen@Baidu Research
何东亮@百度公司赵翔@百度公司Jizhou Huang @百度公司Fu Li @百度Inc.Liu Liu @百度公司Shilei Wen @百度研究
AAAI Technical Track: Vision
AAAI技术轨道：愿景
The task of video grounding which temporally localizes a natural language description in a video plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.
在视频中暂时定位自然语言描述的视频接地任务在理解视频中起着重要作用。现有研究已采用在整个视频上滑动窗口或在预定视频中穷举所有可能的短句对的策略，这些视频不可避免地遭受穷举枚举。为了缓解此问题，我们通过学习基于其策略逐步调节时间接地边界的代理，将该任务表述为顺序决策问题。具体来说，我们提出了一种基于强化学习的框架，该框架经过多任务学习的改进，通过在训练过程中考虑其他受监督的边界信息，显示出稳定的性能提升。我们提出的框架在ActivityNet'18 DenseCaption数据集（Krishna等人2017）和Charades-STA数据集（Sigurdsson等人2016; Gao等人2017）上实现了最先进的性能，同时仅观察了10个或更少的剪辑每个视频。

Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains
自主派生子目标以加速稀疏奖励域中的学习
Michael Dann@RMIT UniversityFabio Zambetta@RMIT UniversityJohn Thangarajah@RMIT University
迈克尔·丹恩（Michael Dann）@ RMIT大学（Fabio Zambetta）@ RMIT大学约翰Thangarajah @ RMIT大学
AAAI Technical Track: Applications
AAAI技术专题：应用
Sparse reward games such as the infamous Montezuma’s Revenge pose a significant challenge for Reinforcement Learning (RL) agents. Hierarchical RL which promotes efficient exploration via subgoals has shown promise in these games. However existing agents rely either on human domain knowledge or slow autonomous methods to derive suitable subgoals. In this work we describe a new autonomous approach for deriving subgoals from raw pixels that is more efficient than competing methods. We propose a novel intrinsic reward scheme for exploiting the derived subgoals applying it to three Atari games with sparse rewards. Our agent’s performance is comparable to that of state-of-the-art methods demonstrating the usefulness of the subgoals found.
诸如臭名昭著的蒙特祖玛的复仇之类的稀疏奖励游戏对强化学习（RL）特工构成了重大挑战。通过子目标促进有效探索的Hierarchical RL在这些游戏中显示出了希望。但是，现有代理依赖于人类领域知识或缓慢的自治方法来得出合适的子目标。在这项工作中，我们描述了一种新的自主方法，用于从原始像素中导出子目标，该方法比竞争方法更有效。我们提出了一种新颖的内在奖励方案，用于利用派生的子目标将其应用于稀疏奖励的三个Atari游戏。我们的代理人的表现与最先进的方法相当，这证明了所发现的子目标的有用性。

Improving Optimization Bounds Using Machine Learning: Decision Diagrams Meet Deep Reinforcement Learning
使用机器学习改善优化界限：决策图满足深度强化学习
Quentin Cappart@Ecole Polytechnique de MontréalEmmanuel Goutierre@Ecole PolytechniqueDavid Bergman@University of ConnecticutLouis-Martin Rousseau@Ecole Polytechnique de Montréal
Quentin Cappart @蒙特利尔理工学院Emmanuel Goutierre @理工学院大卫·伯格曼@康涅狄格大学路易斯·马丁·卢梭@蒙特利尔理工学院
AAAI Technical Track: Constraint Satisfaction and Optimization
AAAI技术专栏：约束满足与优化
Finding tight bounds on the optimal solution is a critical element of practical solution methods for discrete optimization problems. In the last decade decision diagrams (DDs) have brought a new perspective on obtaining upper and lower bounds that can be significantly better than classical bounding mechanisms such as linear relaxations. It is well known that the quality of the bounds achieved through this flexible bounding method is highly reliant on the ordering of variables chosen for building the diagram and finding an ordering that optimizes standard metrics is an NP-hard problem. In this paper we propose an innovative and generic approach based on deep reinforcement learning for obtaining an ordering for tightening the bounds obtained with relaxed and restricted DDs. We apply the approach to both the Maximum Independent Set Problem and the Maximum Cut Problem. Experimental results on synthetic instances show that the deep reinforcement learning approach by achieving tighter objective function bounds generally outperforms ordering methods commonly used in the literature when the distribution of instances is known. To the best knowledge of the authors this is the first paper to apply machine learning to directly improve relaxation bounds obtained by general-purpose bounding mechanisms for combinatorial optimization problems.
在最佳解决方案上找到严格的界限是离散优化问题的实用解决方案方法的关键要素。在过去的十年中，决策图（DD）为获取上限和下限带来了新的视角，该上限和下限可能比经典的边界机制（如线性松弛）好得多。众所周知，通过这种灵活的边界方法获得的边界质量高度依赖于为构建图表而选择的变量的排序，而找到优化标准度量的排序则是一个难题。在本文中，我们提出了一种基于深度强化学习的创新且通用的方法，该方法用于获得有序的条件，以收紧使用松弛和受限DD所获得的边界。我们将方法应用于最大独立集问题和最大割集问题。对合成实例的实验结果表明，当实例分布已知时，通过实现更严格的目标函数边界的深度强化学习方法通常会优于文献中常用的排序方法。就作者所知，这是第一篇应用机器学习直接改善通过通用边界机制获得的组合优化问题的松弛边界的论文。

Virtual-Taobao: Virtualizing Real-World Online Retail Environment for Reinforcement Learning
虚拟淘宝：虚拟现实世界在线零售环境以进行强化学习
Jing-Cheng Shi@Nanjing UniversityYang Yu@Nanjing UniversityQing Da@Alibaba GroupShi-Yong Chen@Alibaba GroupAn-Xiang Zeng@Alibaba Group
史静成@南京大学杨宇@南京大学青达@阿里巴巴集团陈世勇@阿里巴巴集团曾安祥@阿里巴巴集团
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Applying reinforcement learning in physical-world tasks is extremely challenging. It is commonly infeasible to sample a large number of trials as required by current reinforcement learning methods in a physical environment. This paper reports our project on using reinforcement learning for better commodity search in Taobao one of the largest online retail platforms and meanwhile a physical environment with a high sampling cost. Instead of training reinforcement learning in Taobao directly we present our environment-building approach: we build Virtual-Taobao a simulator learned from historical customer behavior data and then we train policies in Virtual-Taobao with no physical sampling costs. To improve the simulation precision we propose GAN-SD (GAN for Simulating Distributions) for customer feature generation with better matched distribution; we propose MAIL (Multiagent Adversarial Imitation Learning) for generating better generalizable customer actions. To further avoid overfitting the imperfection of the simulator we propose ANC (Action Norm Constraint) strategy to regularize the policy model. In experiments Virtual-Taobao is trained from hundreds of millions of real Taobao customers’ records. Compared with the real Taobao Virtual-Taobao faithfully recovers important properties of the real environment. We further show that the policies trained purely in Virtual-Taobao which has zero physical sampling cost can have significantly superior real-world performance to the traditional supervised approaches through online A/B tests. We hope this work may shed some light on applying reinforcement learning in complex physical environments.
在物理世界的任务中应用强化学习极具挑战性。根据当前在物理环境中的强化学习方法的要求，通常无法对大量试验进行抽样。本文报告了我们的项目，该项目是在最大的在线零售平台之一的淘宝上使用强化学习进行更好的商品搜索，同时也是一个抽样成本较高的物理环境。我们没有直接在淘宝上训练强化学习，而是介绍了我们的环境构建方法：我们构建虚拟淘宝，它是从历史客户行为数据中学到的模拟器，然后我们在虚拟淘宝中训练策略，而无需实际抽样成本。为了提高仿真精度，我们建议使用GAN-SD（用于仿真分布的GAN）来生成具有更好匹配分布的客户特征。我们建议使用MAIL（多代理对抗模拟学习）来生成更好的可推广客户行为。为了进一步避免过度拟合模拟器的缺陷，我们提出了ANC（行动规范约束）策略来规范化策略模型。在实验中，虚拟淘宝是从数亿真实淘宝客户的记录中训练出来的。与真实的淘宝相比，虚拟的淘宝忠实地恢复了真实环境的重要属性。我们进一步表明，通过在线A / B测试，仅在虚拟淘宝网中训练的，具有零物理采样成本的策略可以具有比传统的受监督方法更好的真实世界性能。我们希望这项工作可以为在复杂的物理环境中应用强化学习提供一些启发。

Classification with Costly Features Using Deep Reinforcement Learning
使用深度强化学习进行具有昂贵功能的分类
Jaromír Janisch@Czech Technical University in PragueTomáš Pevný@Czech Technical University in PragueViliam Lisý@Czech Technical University in Prague
JaromírJanisch @布拉格捷克技术大学TomášPevný@布拉格捷克技术大学ViliamLisý@布拉格捷克技术大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
We study a classification problem where each feature can be acquired for a cost and the goal is to optimize a trade-off between the expected classification error and the feature cost. We revisit a former approach that has framed the problem as a sequential decision-making problem and solved it by Q-learning with a linear approximation where individual actions are either requests for feature values or terminate the episode by providing a classification decision. On a set of eight problems we demonstrate that by replacing the linear approximation with neural networks the approach becomes comparable to the state-of-the-art algorithms developed specifically for this problem. The approach is flexible as it can be improved with any new reinforcement learning enhancement it allows inclusion of pre-trained high-performance classifier and unlike prior art its performance is robust across all evaluated datasets.
我们研究了一个分类问题，其中每个特征都可以以成本获得，并且目标是优化预期分类误差与特征成本之间的权衡。我们重新审视以前的方法，该方法将问题框架化为顺序决策问题，并通过线性近似的Q学习来解决它，其中单个动作要么是对特征值的请求，要么通过提供分类决策来终止事件。在一组八个问题上，我们证明了通过用神经网络代替线性逼近，该方法可以与专门针对此问题开发的最新算法相媲美。该方法是灵活的，因为可以通过任何新的强化学习增强对其进行改进，它允许包含预训练的高性能分类器，并且与现有技术不同，其性能在所有评估的数据集上都非常可靠。

Improving Image Captioning with Conditional Generative Adversarial Nets
使用条件生成对抗网络改善图像字幕
Chen Chen@TencentShuai Mu@TencentWanpeng Xiao@TencentZexiong Ye@TencentLiesi Wu@TencentQi Ju@Tencent
陈晨@ TencentShuai Mu @ TencentWanpeng Xiao @ TencentZexiong Ye @ TencentLiesi Wu @ TencentQi Ju @腾讯
AAAI Technical Track: Vision
AAAI技术轨道：愿景
In this paper we propose a novel conditional-generativeadversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture. To deal with the inconsistent evaluation problem among different objective language metrics we are motivated to design some “discriminator” networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architectures (CNN and RNNbased structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing RL-based image captioning framework and we show that the conventional RL training method is just a special case of our approach. Empirically we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models. In addition the well-trained discriminators can also be viewed as objective image captioning evaluators.
在本文中，我们提出了一种新颖的基于条件生成对抗网络的图像字幕框架，作为基于传统增强学习（RL）的编码器-解码器体系结构的扩展。为了解决不同目标语言指标之间不一致的评估问题，我们鼓励设计一些“区分”网络，以自动，逐步确定生成的字幕是人为描述的还是机器生成的。介绍了两种区分器架构（基于CNN和RNN的结构），因为每种都有其自身的优势。所提出的算法是通用的，因此它可以增强任何现有的基于RL的图像字幕框架，并且我们证明了常规的RL训练方法只是我们方法的一种特殊情况。根据经验，我们针对不同的最新图像字幕模型显示了所有语言评估指标的持续改进。此外，训练有素的鉴别器也可以视为客观的图像字幕评估器。

ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search
ACE：使用树搜索进行连续控制的演员合奏算法
Shangtong Zhang@University of AlbertaHengshuai Yao@Huawei Technologies
张尚通@阿尔伯塔大学姚恒帅@华为技术
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
In this paper we propose an actor ensemble algorithm named ACE for continuous control with a deterministic policy in reinforcement learning. In ACE we use actor ensemble (i.e. multiple actors) to search the global maxima of the critic. Besides the ensemble perspective we also formulate ACE in the option framework by extending the option-critic architecture with deterministic intra-option policies revealing a relationship between ensemble and options. Furthermore we perform a look-ahead tree search with those actors and a learned value prediction model resulting in a refined value estimation. We demonstrate a significant performance boost of ACE over DDPG and its variants in challenging physical robot simulators.
在本文中，我们提出了一种名为ACE的角色集合算法，用于在强化学习中采用确定性策略进行连续控制。在ACE中，我们使用演员合奏（即多个演员）来搜索评论者的全局最大值。除了整体视角之外，我们还通过使用确定性的内部选项政策扩展选项批评体系来揭示ACE与选项之间的关系，从而在选项框架中制定ACE。此外，我们对那些参与者和学习值预测模型执行了前瞻性树搜索，从而得到了精确的值估计。在具有挑战性的物理机器人模拟器中，我们展示了ACE优于DDPG及其变体的显着性能提升。

What and Where the Themes Dominate in Image
主题在图像中占主导地位和位置
Xinyu Xiao@National Laboratory of Pattern RecognitionLingfeng Wang@Chinese Academy of SciencesShiming Xiang@Chinese Academy of SciencesChunhong Pan@Chinese Academy of Sciences
肖新宇@模式识别国家实验室王灵峰@中国科学院向世明@中国科学院潘春红@中国科学院
AAAI Technical Track: Vision
AAAI技术轨道：愿景
The image captioning is to describe an image with natural language as human which has benefited from the advances in deep neural network and achieved substantial progress in performance. However the perspective of human description to scene has not been fully considered in this task recently. Actually the human description to scene is tightly related to the endogenous knowledge and the exogenous salient objects simultaneously which implies that the content in the description is confined to the known salient objects. Inspired by this observation this paper proposes a novel framework which explicitly applies the known salient objects in image captioning. Under this framework the known salient objects are served as the themes to guide the description generation. According to the property of the known salient object a theme is composed of two components: its endogenous concept (what) and the exogenous spatial attention feature (where). Specifically the prediction of each word is dominated by the concept and spatial attention feature of the corresponding theme in the process of caption prediction. Moreover we introduce a novel learning method of Distinctive Learning (DL) to get more specificity of generated captions like human descriptions. It formulates two constraints in the theme learning process to encourage distinctiveness between different images. Particularly reinforcement learning is introduced into the framework to address the exposure bias problem between the training and the testing modes. Extensive experiments on the COCO and Flickr30K datasets achieve superior results when compared with the state-of-the-art methods.
图像标题是用人类的自然语言描述图像，该图像得益于深度神经网络的进步并在性能上取得了实质性进展。但是，最近在此任务中尚未充分考虑人对场景的描述的角度。实际上，对场景的人类描述同时与内生知识和外来显着对象紧密相关，这意味着描述中的内容仅限于已知的显着对象。受此观察启发，本文提出了一种新颖的框架，该框架可将已知的显着对象明确应用于图像字幕。在此框架下，已知的显着对象用作主题以指导描述的生成。根据已知显着对象的属性，主题由两个部分组成：其内源性概念（what）和外源性空间关注特征（where）。具体来说，在字幕预测过程中，每个单词的预测受相应主题的概念和空间注意特征支配。此外，我们引入了一种新颖的独特学习（DL）学习方法，以使生成的字幕（如人类描述）更具特异性。它在主题学习过程中提出了两个约束条件，以鼓励不同图像之间的独特性。框架中特别引入了强化学习，以解决培训和测试模式之间的曝光偏差问题。与最先进的方法相比，在COCO和Flickr30K数据集上进行的广泛实验获得了出色的结果。

Diversity-Driven Extensible Hierarchical Reinforcement Learning
多样性驱动的可扩展层次强化学习
Yuhang Song@University of OxfordJianyi Wang@Beihang UniversityThomas Lukasiewicz@University of OxfordZhenghua Xu@Hebei University of TechnologyMai Xu@Beihang University
宋宇航@牛津大学王建一@北京航空航天大学托马斯·卢卡西维奇@牛津大学徐正华@河北工业大学徐迈@北京航空航天大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Hierarchical reinforcement learning (HRL) has recently shown promising advances on speeding up learning improving the exploration and discovering intertask transferable skills. Most recent works focus on HRL with two levels i.e. a master policy manipulates subpolicies which in turn manipulate primitive actions. However HRL with multiple levels is usually needed in many real-world scenarios whose ultimate goals are highly abstract while their actions are very primitive. Therefore in this paper we propose a diversitydriven extensible HRL (DEHRL) where an extensible and scalable framework is built and learned levelwise to realize HRL with multiple levels. DEHRL follows a popular assumption: diverse subpolicies are useful i.e. subpolicies are believed to be more useful if they are more diverse. However existing implementations of this diversity assumption usually have their own drawbacks which makes them inapplicable to HRL with multiple levels. Consequently we further propose a novel diversity-driven solution to achieve this assumption in DEHRL. Experimental studies evaluate DEHRL with nine baselines from four perspectives in two domains; the results show that DEHRL outperforms the state-of-the-art baselines in all four aspects.
分层强化学习（HRL）最近在加速学习，改进探索和发现任务间可转移技能方面显示出令人鼓舞的进展。最新作品集中于两个层次的HRL，即主策略操纵子策略，而子策略又操纵原始动作。但是，在许多实际场景中，通常需要使用多个级别的HRL，这些场景的最终目标是高度抽象的，而其动作却非常原始。因此，在本文中，我们提出了多样性驱动的可扩展HRL（DEHRL），其中构建了可扩展和可扩展的框架，并逐级学习以实现多层次的HRL。 DEHRL遵循一个普遍的假设：多样的子政策是有用的，即人们认为，子政策越多样化，其作用就越大。然而，这种多样性假设的现有实现通常具有其自身的缺点，这使得它们不适用于具有多个级别的HRL。因此，我们进一步提出了一种新颖的多样性驱动解决方案，以实现DEHRL中的这一假设。实验研究从两个领域的四个角度用九个基线评估了DEHRL。结果表明，DEHRL在所有四个方面均优于最新的基准。

Deep Reinforcement Learning for Syntactic Error Repair in Student Programs
用于学生程序中语法错误修复的深度强化学习
Rahul Gupta@Indian Institute of ScienceAditya Kanade@Indian Institute of ScienceShirish Shevade@Indian Institute of Science
拉胡尔·古普塔（Rahul Gupta）@印度科学研究所阿迪亚·卡纳德（Aditya Kanade）@印度科学研究所瑟里什·谢瓦德（Shirish Shevade）@印度科学研究所
AAAI Technical Track: Applications
AAAI技术专题：应用
Novice programmers often struggle with the formal syntax of programming languages. In the traditional classroom setting they can make progress with the help of real time feedback from their instructors which is often impossible to get in the massive open online course (MOOC) setting. Syntactic error repair techniques have huge potential to assist them at scale. Towards this we design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We demonstrate that the agent can be trained through self-exploration directly from the raw input that is program text itself without either supervision or any prior knowledge of the formal syntax of the programming language. We evaluate our technique on a publicly available dataset containing 6975 erroneous C programs with typographic errors written by students during an introductory programming course. Our technique fixes 1699 (24.4%) programs completely and 1310 (18.8%) program partially outperforming DeepFix a state-of-the-art syntactic error repair technique which uses a fully supervised neural machine translation approach.
新手程序员经常会为编程语言的形式语法而苦恼。在传统的教室环境中，他们可以借助教师的实时反馈来取得进步，而在大规模开放式在线课程（MOOC）环境中，这通常是不可能的。语法错误修复技术具有巨大的潜力，可以大规模地帮助他们。为此，我们设计了一种适合强化学习的新颖的编程语言校正框架。该框架允许代理模仿人的行为来进行文本导航和编辑。我们证明了可以直接从原始输入（即程序文本本身）通过自我探索来训练代理，而无需监督或没有任何编程语言形式语法的先验知识。我们在一个公开的数据集上评估我们的技术，该数据集包含6975个错误的C程序，这些程序由学生在入门编程课程中编写的印刷错误。我们的技术可完全修复1699（24.4％）个程序，而1310（18.8％）程序则部分优于DeepFix，后者是一种使用完全监督的神经机器翻译方法的最新语法错误修复技术。

No-Reference Image Quality Assessment with Reinforcement Recursive List-Wise Ranking
增强递归列表明智排序的无参考图像质量评估
Jie Gu@Chinese Academy of SciencesGaofeng Meng@Chinese Academy of SciencesCheng Da@Chinese Academy of SciencesShiming Xiang@Chinese Academy of SciencesChunhong Pan@Chinese Academy of Sciences
顾洁@中国科学院孟高峰@中国科学院成达@中国科学院世明乡@中国科学院潘春红@中国科学院
AAAI Technical Track: Vision
AAAI技术轨道：愿景
Opinion-unaware no-reference image quality assessment (NR-IQA) methods have received many interests recently because they do not require images with subjective scores for training. Unfortunately it is a challenging task and thus far no opinion-unaware methods have shown consistently better performance than the opinion-aware ones. In this paper we propose an effective opinion-unaware NR-IQA method based on reinforcement recursive list-wise ranking. We formulate the NR-IQA as a recursive list-wise ranking problem which aims to optimize the whole quality ordering directly. During training the recursive ranking process can be modeled as a Markov decision process (MDP). The ranking list of images can be constructed by taking a sequence of actions and each of them refers to selecting an image for a specific position of the ranking list. Reinforcement learning is adopted to train the model parameters in which no ground-truth quality scores or ranking lists are necessary for learning. Experimental results demonstrate the superior performance of our approach compared with existing opinion-unaware NR-IQA methods. Furthermore our approach can compete with the most effective opinion-aware methods. It improves the state-of-the-art by over 2% on the CSIQ benchmark and outperforms most compared opinion-aware models on TID2013.
无意见无参考图像质量评估（NR-IQA）方法最近受到了很多关注，因为它们不需要具有主观评分的图像来进行训练。不幸的是，这是一项具有挑战性的任务，因此到目前为止，没有意见感知的方法始终表现出比意见感知的方法更好的性能。在本文中，我们提出了一种基于增强递归列表式排序的有效的无意识的NR-IQA方法。我们将NR-IQA公式化为递归列表式排名问题，旨在直接优化整体质量排序。在训练期间，可以将递归排名过程建模为马尔可夫决策过程（MDP）。图像的排名列表可以通过采取一系列动作来构造，并且每个动作都指的是针对排名列表的特定位置选择图像。采用强化学习来训练模型参数，在该模型参数中，学习不需要地面质量分数或排名列表。实验结果表明，与现有的无意识的NR-IQA方法相比，我们的方法具有更高的性能。此外，我们的方法可以与最有效的意见意识方法竞争。它使CSIQ基准上的最新技术水平提高了2％以上，并且胜过TID2013上大多数可比较的舆论感知模型。

IPOMDP-Net: A Deep Neural Network for Partially Observable Multi-Agent Planning Using Interactive POMDPs
IPOMDP-Net：使用交互式POMDP进行部分可观察的多代理规划的深层神经网络
Yanlin Han@University of Illinois at ChicagoPiotr Gmytrasiewicz@University of Illinois at Chicago
韩燕林@芝加哥伊利诺伊大学彼得·米特拉西维奇@芝加哥伊利诺伊大学
AAAI Technical Track: Multiagent Systems
AAAI技术专题：多代理系统
This paper introduces the IPOMDP-net a neural network architecture for multi-agent planning under partial observability. It embeds an interactive partially observable Markov decision process (I-POMDP) model and a QMDP planning algorithm that solves the model in a neural network architecture. The IPOMDP-net is fully differentiable and allows for end-to-end training. In the learning phase we train an IPOMDP-net on various fixed and randomly generated environments in a reinforcement learning setting assuming observable reinforcements and unknown (randomly initialized) model functions. In the planning phase we test the trained network on new unseen variants of the environments under the planning setting using the trained model to plan without reinforcements. Empirical results show that our model-based IPOMDP-net outperforms the other state-of-the-art modelfree network and generalizes better to larger unseen environments. Our approach provides a general neural computing architecture for multi-agent planning using I-POMDPs. It suggests that in a multi-agent setting having a model of other agents benefits our decision-making resulting in a policy of higher quality and better generalizability.
本文介绍了IPOMDP-net神经网络体系结构，用于在部分可观察性下进行多主体规划。它嵌入了交互式的部分可观察的马尔可夫决策过程（I-POMDP）模型和QMDP规划算法，该算法在神经网络体系结构中解决了该模型。 IPOMDP网络是完全可区分的，并允许端到端培训。在学习阶段，我们在加固学习设置中，在各种固定和随机生成的环境中训练IPOMDP网络，并假设可观察到的加固和未知（随机初始化）的模型函数。在计划阶段，我们使用训练好的模型在没有增强的情况下进行计划的测试，以在计划设置下的新的看不见的环境变体上测试训练后的网络。经验结果表明，我们基于模型的IPOMDP网络优于其他最新的无模型网络，并且可以更好地推广到更大的看不见的环境。我们的方法为使用I-POMDP的多主体规划提供了一种通用的神经计算架构。它表明，在具有其他代理模型的多代理环境中，我们的决策会受益，从而导致更高质量和更好的可推广性的策略。

Policy Optimization with Model-Based Explorations
基于模型的探索的策略优化
Feiyang Pan@Chinese Academy of SciencesQingpeng Cai@Tsinghua UniversityAn-Xiang Zeng@AlibabaChun-Xiang Pan@Alibaba GroupQing Da@Alibaba GroupHualin He@Alibaba GroupQing He@Chinese Academy of SciencesPingzhong Tang@Tsinghua University
潘飞阳@中国科学院蔡庆鹏@清华大学曾安祥@阿里巴巴潘春湘@阿里巴巴
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Model-free reinforcement learning methods such as the Proximal Policy Optimization algorithm (PPO) have successfully applied in complex decision-making problems such as Atari games. However these methods suffer from high variances and high sample complexity. On the other hand model-based reinforcement learning methods that learn the transition dynamics are more sample efficient but they often suffer from the bias of the transition estimation. How to make use of both model-based and model-free learning is a central problem in reinforcement learning.
诸如近端策略优化算法（PPO）之类的无模型强化学习方法已成功应用于诸如Atari游戏之类的复杂决策问题。然而，这些方法具有高方差和高样本复杂性的缺点。另一方面，学习过渡动态的基于模型的强化学习方法的样本效率更高，但它们经常遭受过渡估计的偏差。如何同时利用基于模型的学习和无模型的学习是强化学习中的核心问题。

A Deep Reinforcement Learning Based Multi-Step Coarse to Fine Question Answering (MSCQA) System
基于深度强化学习的多步粗到细答（MSCQA）系统
Yu Wang@Samsung Research AmericaHongxia Jin@Samsung Research America
王瑜@三星研究美国洪金进@三星研究美国
AAAI Technical Track: Natural Language Processing
AAAI技术专栏：自然语言处理
In this paper we present a multi-step coarse to fine question answering (MSCQA) system which can efficiently processes documents with different lengths by choosing appropriate actions. The system is designed using an actor-critic based deep reinforcement learning model to achieve multistep question answering. Compared to previous QA models targeting on datasets mainly containing either short or long documents our multi-step coarse to fine model takes the merits from multiple system modules which can handle both short and long documents. The system hence obtains a much better accuracy and faster trainings speed compared to the current state-of-the-art models. We test our model on four QA datasets WIKEREADING WIKIREADING LONG CNN and SQuAD and demonstrate 1.3%-1.7% accuracy improvements with 1.5x-3.4x training speed-ups in comparison to the baselines using state-of-the-art models.
在本文中，我们提出了一种多步粗略到精细的问答系统（MSCQA），该系统可以通过选择适当的操作来有效地处理不同长度的文档。该系统使用基于行为者批评的深度强化学习模型进行设计，以实现多步问题解答。与以前针对主要包含短文档或长文档的数据集的质量保证模型相比，我们的多步粗到精模型采用了多个系统模块的优点，这些模块可以处理短文档和长文档。因此，与当前最先进的模型相比，该系统具有更高的准确性和更快的训练速度。我们在四个QA数据集WIKEREADING WIKIREADING LONG CNN和SQuAD上测试了我们的模型，并证明与最新技术相比，与基线相比，训练速度提高了1.5％-3.4x，训练精度提高了1.3％-1.7％。

Self-Supervised Mixture-of-Experts by Uncertainty Estimation
通过不确定性估计进行自我监督的专家混合物
Zhuobin Zheng@Tsinghua UniversityChun Yuan@Tsinghua UniversityXinrui Zhu@Tsinghua UniversityZhihui Lin@Tsinghua UniversityYangyang Cheng@Tsinghua UniversityCheng Shi@Tsinghua UniversityJiahui Ye@Tsinghua University
郑卓斌@清华大学春媛@清华大学朱新瑞@清华大学林志辉@清华大学程阳阳@清华大学程史@清华大学叶嘉辉@清华大学
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
Learning related tasks in various domains and transferring exploited knowledge to new situations is a significant challenge in Reinforcement Learning (RL). However most RL algorithms are data inefficient and fail to generalize in complex environments limiting their adaptability and applicability in multi-task scenarios. In this paper we propose SelfSupervised Mixture-of-Experts (SUM) an effective algorithm driven by predictive uncertainty estimation for multitask RL. SUM utilizes a multi-head agent with shared parameters as experts to learn a series of related tasks simultaneously by Deep Deterministic Policy Gradient (DDPG). Each expert is extended by predictive uncertainty estimation on known and unknown states to enhance the Q-value evaluation capacity against overfitting and the overall generalization ability. These enable the agent to capture and diffuse the common knowledge across different tasks improving sample efficiency in each task and the effectiveness of expert scheduling across multiple tasks. Instead of task-specific design as common MoEs a self-supervised gating network is adopted to determine a potential expert to handle each interaction from unseen environments and calibrated completely by the uncertainty feedback from the experts without explicit supervision. To alleviate the imbalanced expert utilization as the crux of MoE optimization is accomplished via decayedmasked experience replay which encourages both diversification and specialization of experts during different periods. We demonstrate that our approach learns faster and achieves better performance by efficient transfer and robust generalization outperforming several related methods on extended OpenAI Gym’s MuJoCo multi-task environments.
在强化学习（RL）中，学习各个领域的相关任务并将所利用的知识转移到新情况中是一项重大挑战。但是，大多数RL算法数据效率低下，无法在复杂的环境中推广，从而限制了它们在多任务场景中的适应性和适用性。在本文中，我们提出了一种自我监督专家混合物（SUM），它是一种由预测不确定性估计驱动的多任务RL的有效算法。 SUM利用具有共享参数的多头代理作为专家，通过深度确定性策略梯度（DDPG）同时学习一系列相关任务。通过对已知和未知状态的预测不确定性估计来扩展每个专家，以增强针对过度拟合的Q值评估能力和总体泛化能力。这些使代理能够捕获和分散不同任务之间的常识，从而提高每个任务中的样本效率以及跨多个任务进行专家计划的效率。代替一般的MoE特定于任务的设计，而是采用自我监督的选通网络来确定潜在的专家来处理来自看不见的环境中的每个交互，并通过专家的不确定性反馈进行完全校准，而无需明确的监督。为了减少专家利用的不平衡，因为MoE优化的关键是通过隐蔽的经验重播来完成的，这鼓励了不同时期专家的多元化和专业化。我们证明，通过在扩展的OpenAI Gym的MuJoCo多任务环境中进行有效的传输和强大的概括，我们的方法可以更快地学习并达到更好的性能。

Hierarchical Reinforcement Learning for Course Recommendation in MOOCs
用于MOOC的课程推荐的分层强化学习
Jing Zhang@Renmin University of ChinaBowen Hao@Renmin University of ChinaBo Chen@Renmin University of ChinaCuiping Li@Renmin University of ChinaHong Chen@Renmin University of ChinaJimeng Sun@Georgia Institute of Technology
张静@中国人民大学郝博文@中国人民大学陈博@中国人民大学李翠萍@中国人民大学陈洪@中国人民大学孙吉eng @乔治亚理工学院
AAAI Technical Track: AI and the Web
AAAI技术专题：AI和Web
The proliferation of massive open online courses (MOOCs) demands an effective way of personalized course recommendation. The recent attention-based recommendation models can distinguish the effects of different historical courses when recommending different target courses. However when a user has interests in many different courses the attention mechanism will perform poorly as the effects of the contributing courses are diluted by diverse historical courses. To address such a challenge we propose a hierarchical reinforcement learning algorithm to revise the user profiles and tune the course recommendation model on the revised profiles.
大规模开放在线课程（MOOC）的激增要求个性化课程推荐的有效方法。最近的基于注意力的推荐模型可以在推荐不同的目标课程时区分不同历史课程的效果。但是，当用户对许多不同的课程感兴趣时，注意力机制将表现不佳，因为贡献的课程的效果被各种历史课程所削弱。为了解决这一挑战，我们提出了一种分层强化学习算法，以修改用户个人资料并在修改后的个人资料上调整课程推荐模型。

TAPAS: Train-Less Accuracy Predictor for Architecture Search
TAPAS：用于架构搜索的火车精度预测器
R. Istrate@IBM Research - ZurichF. Scheidegger@IBM Research - ZurichG. Mariani@IBM Research - ZurichD. Nikolopoulos@Queens University of BelfastC. Bekas@IBM ResearchA. C. I. Malossi@IBM Research - Zurich
R. Istrate @ IBM研究-ZurichF。 Scheidegger @ IBM研究-ZurichG。 Mariani @ IBM研究-ZurichD。 Nikolopoulos @贝尔法斯特女王大学C。前@ IBM ResearchA。 C. I. Malossi @ IBM Research-苏黎世
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
In recent years an increasing number of researchers and practitioners have been suggesting algorithms for large-scale neural network architecture search: genetic algorithms reinforcement learning learning curve extrapolation and accuracy predictors. None of them however demonstrated highperformance without training new experiments in the presence of unseen datasets. We propose a new deep neural network accuracy predictor that estimates in fractions of a second classification performance for unseen input datasets without training. In contrast to previously proposed approaches our prediction is not only calibrated on the topological network information but also on the characterization of the dataset-difficulty which allows us to re-tune the prediction without any training. Our predictor achieves a performance which exceeds 100 networks per second on a single GPU thus creating the opportunity to perform large-scale architecture search within a few minutes. We present results of two searches performed in 400 seconds on a single GPU. Our best discovered networks reach 93.67% accuracy for CIFAR-10 and 81.01% for CIFAR-100 verified by training. These networks are performance competitive with other automatically discovered state-of-the-art networks however we only needed a small fraction of the time to solution and computational resources.
近年来，越来越多的研究人员和从业人员提出了用于大规模神经网络体系结构搜索的算法：遗传算法增强了学习曲线的外推性和准确性预测器。但是，如果没有在看不见的数据集的情况下进行新的实验，则它们都无法表现出高性能。我们提出了一种新的深度神经网络精度预测器，无需训练即可对看不见的输入数据集进行第二分类性能的分数估算。与先前提出的方法相反，我们的预测不仅根据拓扑网络信息进行校准，而且根据数据集的困难度进行表征，这使我们无需进行任何培训即可重新调整预测。我们的预测器在单个GPU上的性能超过每秒100个网络，从而为在几分钟之内执行大规模架构搜索提供了机会。我们展示了在单个GPU上在400秒内执行的两次搜索的结果。经过培训验证，我们发现的最佳网络对于CIFAR-10的准确性达到93.67％，对于CIFAR-100的准确性达到81.01％。这些网络与其他自动发现的最新网络在性能上具有竞争力，但是我们只需要一小部分时间来解决和计算资源。

Dynamic Vehicle Traffic Control Using Deep Reinforcement Learning in Automated Material Handling System
自动化物料处理系统中基于深度强化学习的动态车辆交通控制
Younkook Kang@Seoul National UniversitySungwon Lyu@Seoul National UniversityJeeyung Kim@Seoul National UniversityBongjoon Park@Seoul National UniversitySungzoon Cho@Seoul National University
姜永ook @首尔国立大学成原柳@首尔国立大学金洁y @首尔国立大学奉准公园@首尔国立大学成宗Cho @首尔国立大学
Student Abstracts
学生文摘
In automated material handling systems (AMHS) delivery time is an important issue directly associated with the production cost and the quality of the product. In this paper we propose a dynamic routing strategy to shorten delivery time and delay. We set the target of control by analyzing traffic flows and selecting the region with the highest flow rate and congestion frequency. Then we impose a routing cost in order to dynamically reflect the real-time changes of traffic states. Our deep reinforcement learning model consists of a Q-learning step and a recurrent neural network through which traffic states and action values are predicted. Experiment results show that the proposed method decreases manufacturing costs while increasing productivity. Additionally we find evidence the reinforcement learning structure proposed in this study can autonomously and dynamically adjust to the changes in traffic patterns.
在自动化物料搬运系统（AMHS）中，交货时间是与生产成本和产品质量直接相关的重要问题。在本文中，我们提出了一种动态路由策略，以缩短交付时间和延迟。我们通过分析交通流量并选择流量和拥堵频率最高的区域来设置控制目标。然后，我们施加了路由成本，以便动态反映流量状态的实时变化。我们的深度强化学习模型包括一个Q学习步骤和一个递归神经网络，通过该网络可以预测交通状态和行动值。实验结果表明，该方法在降低生产成本的同时提高了生产率。此外，我们发现有证据表明本研究中提出的强化学习结构可以自动动态地适应交通模式的变化。

Theory of Minds: Understanding Behavior in Groups through Inverse Planning
心智理论：通过逆向计划了解群体行为
Michael Shum@Massachusetts Institute of TechnologyMax Kleiman-Weiner@Massachusetts Institute of TechnologyMichael L. Littman@Brown UniversityJoshua B. Tenenbaum@Massachusetts Institute of Technology
Michael Shum @麻省理工学院Max Kleiman-Weiner @麻省理工学院Michael L. Littman @布朗大学Joshua B. Tenenbaum @麻省理工学院
AAAI Technical Track: Multiagent Systems
AAAI技术专题：多代理系统
Human social behavior is structured by relationships. We form teams groups tribes and alliances at all scales of human life. These structures guide multi-agent cooperation and competition but when we observe others these underlying relationships are typically unobservable and hence must be inferred. Humans make these inferences intuitively and flexibly often making rapid generalizations about the latent relationships that underlie behavior from just sparse and noisy observations. Rapid and accurate inferences are important for determining who to cooperate with who to compete with and how to cooperate in order to compete. Towards the goal of building machine-learning algorithms with human-like social intelligence we develop a generative model of multiagent action understanding based on a novel representation for these latent relationships called Composable Team Hierarchies (CTH). This representation is grounded in the formalism of stochastic games and multi-agent reinforcement learning. We use CTH as a target for Bayesian inference yielding a new algorithm for understanding behavior in groups that can both infer hidden relationships as well as predict future actions for multiple agents interacting together. Our algorithm rapidly recovers an underlying causal model of how agents relate in spatial stochastic games from just a few observations. The patterns of inference made by this algorithm closely correspond with human judgments and the algorithm makes the same rapid generalizations that people do.
人类的社会行为是由关系构成的。我们组成了人类各个层面的团队，部落部落和联盟。这些结构指导多主体的合作与竞争，但是当我们观察其他结构时，这些潜在的关系通常是不可观察的，因此必须进行推断。人类凭直觉和灵活地做出这些推论，通常是对潜在关系的快速概括，这些潜在关系仅来自稀疏和嘈杂的观察而构成了行为。快速而准确的推论对于确定谁与谁合作以及与谁竞争以及如何进行竞争至关重要。为了建立具有类似于人类的社会智能的机器学习算法的目标，我们基于这种潜在关系的新颖表示形式（可组合团队层次结构（CTH）），开发了多主体动作理解的生成模型。这种表示基于随机游戏和多主体强化学习的形式主义。我们将CTH用作贝叶斯推理的目标，从而产生了一种用于理解组中行为的新算法，该算法既可以推断隐藏的关系，也可以预测多个代理交互在一起的未来动作。我们的算法仅通过少量观察就可以快速恢复潜在的因果模型，以了解代理在空间随机博弈中的关系。该算法做出的推理模式与人类的判断紧密相关，并且该算法与人们所做的快速概括相同。

Successor Features Based Multi-Agent RL for Event-Based Decentralized MDPs
基于后继功能的多事件RL，用于基于事件的分散MDP
Tarun Gupta@Indian Institute of Technology HyderabadAkshat Kumar@Singapore Management UniversityPraveen Paruchuri@Indian Institute of Technology Hyderabad
Tarun Gupta @印度海得拉巴技术学院Akshat Kumar @新加坡管理大学Praveen Paruchuri @印度海得拉巴技术学院
AAAI Technical Track: Multiagent Systems
AAAI技术专题：多代理系统
Decentralized MDPs (Dec-MDPs) provide a rigorous framework for collaborative multi-agent sequential decisionmaking under uncertainty. However their computational complexity limits the practical impact. To address this we focus on a class of Dec-MDPs consisting of independent collaborating agents that are tied together through a global reward function that depends upon their entire histories of states and actions to accomplish joint tasks. To overcome scalability barrier our main contributions are: (a) We propose a new actor-critic based Reinforcement Learning (RL) approach for event-based Dec-MDPs using successor features (SF) which is a value function representation that decouples the dynamics of the environment from the rewards; (b) We then present Dec-ESR (Decentralized Event based Successor Representation) which generalizes learning for event-based Dec-MDPs using SF within an end-to-end deep RL framework; (c) We also show that Dec-ESR allows useful transfer of information on related but different tasks hence bootstraps the learning for faster convergence on new tasks; (d) For validation purposes we test our approach on a large multi-agent coverage problem which models schedule coordination of agents in a real urban subway network and achieves better quality solutions than previous best approaches.
分散式MDP（Dec-MDP）为不确定性下的协作多主体顺序决策提供了严格的框架。然而，它们的计算复杂度限制了实际影响。为了解决这个问题，我们重点关注由独立协作代理组成的一类Dec-MDP，这些协作代理通过全局奖励函数绑定在一起，全局奖励函数取决于其整个状态历史和完成联合任务的行动。为了克服可扩展性障碍，我们的主要贡献是：（a）我们提出了一种新的基于行为者批评的强化学习（RL）方法，用于使用后继功能（SF）的基于事件的Dec-MDP，这是一种价值函数表示形式，可将奖励带来的环境；（b）然后，我们介绍Dec-ESR（基于分散事件的后继表示），它概括了在端到端的深度RL框架中使用SF学习基于事件的Dec-MDP；（c）我们还表明，Dec-ESR允许对相关但不同的任务进行有用的信息传递，因此引导学习，以便更快地集中于新任务；（d）为了进行验证，我们在一个大型的多代理覆盖问题上测试了我们的方法，该问题对实际城市地铁网络中的代理调度协调进行建模，并且比以前的最佳方法获得更好的质量解决方案。

Combined Reinforcement Learning via Abstract Representations
通过抽象表示进行组合强化学习
Vincent Francois-Lavet@McGill UniversityYoshua Bengio@Universite de MontrealDoina Precup@McGill UniversityJoelle Pineau@McGill Unversity
文森特·弗朗索瓦·拉维特（Vincent Francois-Lavet）@麦吉尔大学（McGill University）约书亚·本吉欧（Yoshua Bengio）@蒙特利尔大学杜纳（Doina）
AAAI Technical Track: Machine Learning
AAAI技术专题：机器学习
In the quest for efficient and robust reinforcement learning methods both model-free and model-based approaches offer advantages. In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient with planning happening in a smaller latent state space. In addition this approach recovers a sufficient low-dimensional representation of the environment which opens up new strategies for interpretable AI exploration and transfer learning.
在寻求有效和鲁棒的强化学习方法时，无模型方法和基于模型的方法都具有优势。在本文中，我们提出了一种新的方式，通过共享的低维学习环境编码来明确桥接这两种方法，该编码旨在捕获摘要摘要。我们证明了这种方法带来的模块化导致良好的概括性，同时在较小的潜在状态空间中进行计划时计算效率很高。此外，这种方法还可以恢复对环境的足够低维表示，从而为解释性AI探索和转移学习开辟了新的策略。

Deep Reinforcement Learning for Green Security Games with Real-Time Information
具有实时信息的绿色安全游戏的深度强化学习
Yufei Wang@Peking UniversityZheyuan Ryan Shi@Carnegie Mellon UniversityLantao Yu@Stanford UniversityYi Wu@University of California BerkeleyRohit Singh@World Wide Fund for NatureLucas Joppa@Microsoft ResearchFei Fang@Carnegie Mellon University
王宇飞@北京大学施远源Ryan Shi @卡内基梅隆大学兰涛于@斯坦福大学伊武@加州大学BerkeleyRohit Singh @世界自然基金会卢卡斯·乔帕@微软研究院方芳@卡内基梅隆大学
AAAI Technical Track: Computational Sustainability
AAAI技术专栏：计算可持续性
Green Security Games (GSGs) have been proposed and applied to optimize patrols conducted by law enforcement agencies in green security domains such as combating poaching illegal logging and overfishing. However real-time information such as footprints and agents’ subsequent actions upon receiving the information e.g. rangers following the footprints to chase the poacher have been neglected in previous work. To fill the gap we first propose a new game model GSG-I which augments GSGs with sequential movement and the vital element of real-time information. Second we design a novel deep reinforcement learning-based algorithm DeDOL to compute a patrolling strategy that adapts to the real-time information against a best-responding attacker. DeDOL is built upon the double oracle framework and the policy-space response oracle solving a restricted game and iteratively adding best response strategies to it through training deep Q-networks. Exploring the game structure DeDOL uses domain-specific heuristic strategies as initial strategies and constructs several local modes for efficient and parallelized training. To our knowledge this is the first attempt to use Deep Q-Learning for security games.
已提出绿色安全游戏（GSG），并将其用于优化执法机构在绿色安全领域（例如打击偷猎非法伐木和过度捕捞）中的巡逻。但是，实时信息（例如足迹和座席在收到信息后的后续操作）例如在以前的工作中，忽略了足迹追逐偷猎者的巡游者。为了填补空白，我们首先提出了一种新的游戏模型GSG-I，该游戏模型通过顺序移动和实时信息的重要元素来增强GSG。其次，我们设计了一种新颖的基于深度强化学习的算法DeDOL来计算一种巡逻策略，该策略适用于对响应最迅速的攻击者的实时信息。 DeDOL建立在双重预言框架和策略空间响应预言之上，解决了受限游戏，并通过训练深层Q网络向其迭代地添加了最佳响应策略。探索游戏结构DeDOL使用特定于领域的启发式策略作为初始策略，并构建了几种本地模式以进行有效和并行的训练。据我们所知，这是将Deep Q-Learning用于安全游戏的首次尝试。

A Model-Free Affective Reinforcement Learning Approach to Personalization of an Autonomous Social Robot Companion for Early Literacy Education
一种用于早期扫盲教育的自主社交机器人同伴的个性化的无模型情感强化学习方法
Hae Won Park@Massachusetts Institute of TechnologyIshaan Grover@Massachusetts Institute of TechnologySamuel Spaulding@Massachusetts Institute of TechnologyLouis Gomez@Wichita State UniversityCynthia Breazeal@Massachusetts Institute of Technology
海原公园@麻省理工学院伊桑·格罗弗@麻省理工学院塞缪尔·斯波丁@麻省理工学院路易斯·戈麦斯@威奇托州立大学辛西娅·布雷阿泽尔@麻省理工学院
AAAI Technical Track: AI for Social Impact
AAAI技术专栏：增强社会影响力的AI
Personalized education technologies capable of delivering adaptive interventions could play an important role in addressing the needs of diverse young learners at a critical time of school readiness. We present an innovative personalized social robot learning companion system that utilizes children’s verbal and nonverbal affective cues to modulate their engagement and maximize their long-term learning gains. We propose an affective reinforcement learning approach to train a personalized policy for each student during an educational activity where a child and a robot tell stories to each other. Using the personalized policy the robot selects stories that are optimized for each child’s engagement and linguistic skill progression. We recruited 67 bilingual and English language learners between the ages of 4–6 years old to participate in a between-subjects study to evaluate our system. Over a three-month deployment in schools a unique storytelling policy was trained to deliver a personalized story curriculum for each child in the Personalized group. We compared their engagement and learning outcomes to a Non-personalized group with a fixed curriculum robot and a baseline group that had no robot intervention. In the Personalization condition our results show that the affective policy successfully personalized to each child to boost their engagement and outcomes with respect to learning and retaining more target words as well as using more target syntax structures as compared to children in the other groups.
能够提供适应性干预措施的个性化教育技术可以在准备入学的关键时刻满足多样化的年轻学习者的需求，发挥重要作用。我们提出了一种创新的个性化社交机器人学习伴侣系统，该系统利用儿童的言语和非言语情感线索来调节他们的参与度并最大限度地提高他们的长期学习收益。我们提出了一种情感强化学习方法，以在孩子和机器人互相讲故事的教育活动中为每个学生训练个性化的策略。机器人使用个性化策略选择适合每个孩子的互动和语言技能发展的故事。我们招募了67位4至6岁之间的双语和英语学习者，参加一项主题间研究以评估我们的系统。在学校进行的三个月部署中，培训了一套独特的讲故事政策，可以为“个性化”小组中的每个孩子提供个性化的故事课程。我们将他们的参与度和学习成果与固定课程机器人和没有机器人干预的基线组的非个性化组进行了比较。在“个性化”条件下，我们的结果表明，与其他组的孩子相比，情感策略成功地对每个孩子进行了个性化，以提高他们在学习和保留更多目标词以及使用更多目标语法结构方面的参与度和结果。