【深度强化学习的可解释性】Explainability in deep reinforcement learning

最新推荐文章于 2024-09-06 03:42:22 发布

资源存储库

最新推荐文章于 2024-09-06 03:42:22 发布

阅读量2.1k

点赞数 21

文章标签：笔记

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135520563

版权

Explainability in deep reinforcement learning 深度强化学习的可解释性

Alexandre Heuillet a 1, Fabien Couthouis b 1, Natalia Díaz-Rodríguez c
Show more
Add to Mendeley
Share
Cite
https://doi.org/10.1016/j.knosys.2020.106685
Get rights and content 获取权限和内容

Highlights 突出

•
We review concepts related to the explainability of Deep Reinforcement Learning models.
我们回顾了与深度强化学习模型的可解释性相关的概念。

•
We provide a comprehensive analysis of the Explainable Reinforcement Learning literature.
我们对可解释的强化学习文献进行了全面分析。

•
We propose a categorization of existing Explainable Reinforcement Learning methods.
我们提出了对现有可解释强化学习方法的分类。

•
We discuss ideas emerging from the literature and provide insights for future work.
我们讨论了从文献中出现的想法，并为未来的工作提供了见解。

Abstract 抽象

A large set of the explainable Artificial Intelligence (XAI) literature is emerging on feature relevance techniques to explain a deep neural network (DNN) output or explaining models that ingest image source data. However, assessing how XAI techniques can help understand models beyond classification tasks, e.g. for reinforcement learning (RL), has not been extensively studied. We review recent works in the direction to attain Explainable Reinforcement Learning (XRL), a relatively new subfield of Explainable Artificial Intelligence, intended to be used in general public applications, with diverse audiences, requiring ethical, responsible and trustable algorithms. In critical situations where it is essential to justify and explain the agent’s behaviour, better explainability and interpretability of RL models could help gain scientific insight on the inner workings of what is still considered a black box. We evaluate mainly studies directly linking explainability to RL, and split these into two categories according to the way the explanations are generated: transparent algorithms and post-hoc explainability. We also review the most prominent XAI works from the lenses of how they could potentially enlighten the further deployment of the latest advances in RL, in the demanding present and future of everyday problems.
大量可解释的人工智能（XAI）文献正在出现，涉及特征相关性技术，以解释深度神经网络（DNN）输出或解释摄取图像源数据的模型。然而，评估 XAI 技术如何帮助理解分类任务之外的模型，例如用于强化学习（RL），尚未得到广泛研究。我们回顾了最近在实现可解释强化学习（XRL）方面的工作，XRL 是可解释人工智能的一个相对较新的子领域，旨在用于一般公共应用，具有不同的受众，需要道德、负责任和可信赖的算法。在必须证明和解释智能体行为的关键情况下，RL模型更好的可解释性和可解释性可以帮助获得对仍然被认为是黑匣子的内部运作的科学见解。我们主要评估直接将可解释性与RL联系起来的研究，并根据解释的产生方式将其分为两类：透明算法和事后可解释性。我们还从以下角度回顾了最杰出的 XAI 作品，它们如何潜在地启发 RL 最新进展的进一步部署，以应对当前和未来的日常问题。

Previous article in issue 上一页已发行文章Next article in issue 下一篇文章有问题

Keywords 关键字

Reinforcement Learning 强化学习Explainable artificial intelligence
可解释的人工智能Machine Learning 机器学习Deep Learning 深度学习Responsible artificial intelligence
负责任的人工智能Representation learning 表征学习

1. Introduction 1. 引言

During the past decade, Artificial Intelligence (AI), and by extension Machine Learning (ML), have seen an unprecedented rise in both industry and research. The progressive improvement of computer hardware associated with the need to process larger and larger amounts of data made these underestimated techniques shine under a new light. Reinforcement Learning (RL) focuses on learning how to map situations to actions, in order to maximize a numerical reward signal [1]. The learner is not told which actions to take, but instead must discover which actions are the most rewarding by trying them. Reinforcement learning addresses the problem of how agents should learn a policy that take actions to maximize the cumulative reward through interaction with the environment [2].
在过去十年中，人工智能（AI）以及机器学习（ML）在工业和研究领域都取得了前所未有的增长。与处理越来越大的数据量相关的计算机硬件的逐步改进使这些被低估的技术在新的光芒下大放异彩。强化学习（RL）侧重于学习如何将情境映射到行动中，以最大化数字奖励信号[1]。学习者不会被告知要采取哪些行动，而是必须通过尝试来发现哪些行动最有价值。强化学习解决了智能体应该如何学习策略的问题，该策略通过与环境的交互采取行动以最大化累积奖励[2]。

Recent progress in Deep Learning (DL) for learning feature representations has significantly impacted RL, and the combination of both methods (known as deep RL) has led to remarkable results in a lot of areas. Typically, RL is used to solve optimization problems when the system has a very large number of states and has a complex stochastic structure. Notable examples include training agents to play Atari games based on raw pixels [3], [4], board games [5], [6], complex real-world robotics problems such as manipulation [7] or grasping [8] and other real-world applications such as resource management in computer clusters [9], network traffic signal control [10], chemical reactions optimization [11] or recommendation systems [12].
深度学习（DL）在学习特征表示方面的最新进展对 RL 产生了重大影响，两种方法（称为深度 RL）的结合在许多领域都取得了显著的成果。通常，当系统具有大量状态并具有复杂的随机结构时，RL 用于求解优化问题。值得注意的例子包括训练代理玩基于原始像素的雅达利游戏[3]、[4]、棋盘游戏[5]、[6]、复杂的现实世界机器人问题（如操纵[7]或抓取[8]）以及其他现实世界的应用，如计算机集群中的资源管理[9]、网络交通信号控制[10]、化学反应优化[11]或推荐系统[12]。

The success of Deep RL could augur an imminent arrival in the industrial world. However, like many Machine Learning algorithms, RL algorithms suffer from a lack of explainability. This defect can be highly crippling as many promising RL applications (defence, finance, medicine, etc.) need a model that can explain its decisions and actions to human users [14] as a condition to their full acceptation by society. Furthermore, deep RL models are complex to debug for developers, as they rely on many factors: environment (in particular the design of the reward function), observations encoding, large DL models and the algorithm used to train the policy. Thus, an explainable model could aid fixing problems quicker and drastically speed up new development in RL methods. Those last two points are the main arguments in favour of the necessity of explainable reinforcement learning (XRL).
Deep RL的成功可能预示着工业世界即将到来。然而，与许多机器学习算法一样，RL算法缺乏可解释性。这种缺陷可能非常严重，因为许多有前途的强化学习应用（国防、金融、医学等）需要一个模型来向人类用户解释其决策和行动[14]，作为他们被社会完全接受的条件。此外，对于开发人员来说，深度强化学习模型的调试非常复杂，因为它们依赖于许多因素：环境（特别是奖励函数的设计）、观察编码、大型深度学习模型和用于训练策略的算法。因此，一个可解释的模型可以帮助更快地解决问题，并大大加快RL方法的新开发。最后两点是支持可解释强化学习（XRL）必要性的主要论据。

Table 1. Target audience in XAI. This table shows the different objectives of explainability in Machine Learning models for different audience profiles.
表 1.XAI的目标受众。下表显示了机器学习模型中针对不同受众配置文件的可解释性的不同目标。

Target audience 目标受众 Description 描述 Explainability purposes 可解释性目的 Pursued goals 追求的目标
Experts 专家 Domain experts, model users (e.g. medical doctors, insurance agents)
领域专家、模型用户（例如医生、保险代理人） Trust the model itself, gain scientific knowledge
相信模型本身，获得科学知识 Trustworthiness, causality, transferability, informativeness, confidence, interactivity
可信度、因果关系、可转移性、信息性、信心、互动性
Users 用户 Users affected by model decisions
受模型决策影响的用户 Understand their situation, verify fair decisions.
了解他们的情况，验证公平的决定。 Trustworthiness, informativeness, fairness, accessibility, interactivity, privacy awareness
可信度、信息性、公平性、可访问性、交互性、隐私意识
Developers 开发人员 Developers, researchers, data scientists, product owners…
开发人员、研究人员、数据科学家、产品负责人… Ensure and improve product efficiency, research, new functionalities…
确保和提高产品效率、研究、新功能… Transferability, informativeness, confidence
可转移性、信息量、信心
Executives 高管 Managers, executive board members…
经理、执行董事会成员… Assess regulatory compliance, understand corporate AI applications…
评估法规遵从性，了解企业 AI 应用… Causality, informativeness, confidence
因果关系、信息量、信心
Regulation 调节 Regulatory entities/agencies
监管机构/机构 Certify model compliance with the legislation in force, audits, …
证明模型符合现行法律、审计、… Causality, informativeness, confidence, fairness, privacy awareness
因果关系、信息性、信心、公平性、隐私意识
Inspired from the diagram presented in Barredo Arrieta et al. [13].
灵感来自Barredo Arrieta等人[13]提出的图表。

While explainability starts being well developed for standard ML models and neural networks [15], [16], [17], the particular domain of RL has yet many intricacies to be better understood: both in terms of its functioning, and in terms of conveying the decisions of an RL model to different audiences. The difficulty lies in the very recent human-level performance of deep RL algorithms and by their complexity, normally parameterized with thousands if not millions of parameters [18]. The present work intends to provide a non-exhaustive state-of-the-art review on explainable reinforcement learning, highlighting the main methods that we envision most promising. In the following, we will briefly recall some important concepts in XAI.
虽然标准ML模型和神经网络[15]、[16]、[17]的可解释性开始得到很好的发展，但RL的特定领域仍有许多错综复杂的问题需要更好地理解：无论是在功能方面，还是在将RL模型的决策传达给不同受众方面。困难在于深度强化学习算法的最新人类水平性能及其复杂性，通常使用数千甚至数百万个参数进行参数化[18]。本工作旨在对可解释的强化学习进行非详尽的现有回顾，重点介绍我们设想的最有前途的主要方法。在下文中，我们将简要回顾一下 XAI 中的一些重要概念。

1.1. Explainable AI: Audience
1.1. 可解释的 AI：受众
Explaining a Machine Learning model may involve different goals: trustworthiness, causality, transferability, informativeness, fairness, confidence, accessibility, interactivity and privacy awareness. These goals have to be taken into account while explaining a model because the expected type of explanation may differ, depending on the pursued objective. For example, a saliency map explaining what is recognized as a dog on an input image does not tell us much about privacy awareness. In addition, each goal may be a dimension of interest, but only for a certain audience (the public to whom the explanations will be addressed). Indeed, the transferability of a model can be significative for a developer, since he/she can save time by training only one model for different tasks, while the user will not be impacted, if not aware, by this aspect.
解释机器学习模型可能涉及不同的目标：可信度、因果关系、可转移性、信息性、公平性、信心、可访问性、交互性和隐私意识。在解释模型时必须考虑这些目标，因为预期的解释类型可能会有所不同，具体取决于所追求的目标。例如，解释输入图像上识别为狗的显著性图并不能告诉我们太多关于隐私意识的信息。此外，每个目标都可能是一个感兴趣的维度，但仅限于特定的受众（将针对其解释的公众）。事实上，模型的可转移性对开发人员来说可能很重要，因为他/她可以通过为不同的任务只训练一个模型来节省时间，而用户即使不知道也不会受到这方面的影响。

The understandability of an ML model therefore depends on its transparency (its capacity to be understandable by itself) but also on human understanding. According to these considerations, it is essential to take into account the concept of audience, as the intelligibility and comprehensibility of a model is dependent on the goals and the cognitive skills of its users. Barredo Arrieta et al. [13] discuss these aspects with additional details (see Table 1).
因此，机器学习模型的可理解性取决于其透明度（其自身可理解的能力），但也取决于人类的理解。根据这些考虑，必须考虑受众的概念，因为模型的可理解性和可理解性取决于其用户的目标和认知技能。Barredo Arrieta等[13]讨论了这些方面，并提供了更多细节（见表1）。

1.2. Evaluating explanations
1.2. 评估解释
The broad concept of evaluation is based on metrics aiming to compare how well a technique performs compared to another. In the case of model explainability, metrics should evaluate how well a model fits the definition of explainable and how well performs in a certain aspect of explainability.
评估的广泛概念基于旨在比较一种技术与另一种技术相比的性能的指标。在模型可解释性的情况下，指标应评估模型与可解释性定义的契合程度，以及在可解释性的某个方面的表现。

Explanation evaluation in XAI has proven to be quite a challenging task. First because the concept of explainability in Machine Learning is not well or uniformly accepted by the community and there is not a clear definition and thus, not a clear consensus on which metrics to use. Secondly because an explanation is relative to a specific audience, which is sometimes difficult to deal with (in particular when this specific audience is composed of domain experts who can be hard to involve in a testing phase). Thirdly, because the quality of an explanation is always qualitative and subjective, since it depends on the audience, the pursued goal and even the human variability as two people can have a different level of understandability for the same explanation. That is why user studies are so popular to evaluate explanations as it makes possible to convert qualitative evaluations into quantitative ones, by asking questions on the accuracy and clarity of the explanation such as “Does this explanation allow you to understand why the model predicted that this image is a dog? Did the context helped the model?”… etc. Generally in XAI, there is only a single model to explain at a time; however, it is more complicated in XRL, as we generally want to explain a policy, or “Why the agent took action x in state s?”.
事实证明，XAI中的解释评估是一项相当具有挑战性的任务。首先，机器学习中的可解释性概念没有被社区很好地接受或统一接受，并且没有明确的定义，因此对使用哪些指标没有明确的共识。其次，因为解释是相对于特定受众而言的，这有时很难处理（特别是当这个特定的受众由领域专家组成时，他们可能很难参与测试阶段）。第三，因为解释的质量总是定性的和主观的，因为它取决于受众、追求的目标，甚至人类的可变性，因为两个人对同一个解释可以有不同的可理解性。这就是为什么用户研究在评估解释方面如此受欢迎的原因，因为它可以通过询问解释的准确性和清晰度来将定性评估转化为定量评估，例如“这个解释是否能让你理解为什么模型预测这张图像是一只狗？上下文对模型有帮助吗？…等。一般来说，在XAI中，一次只有一个模型需要解释;然而，在 XRL 中，它更为复杂，因为我们通常想要解释一个策略，或者“为什么代理在状态 s 中采取了操作 x？

Doshi-Velez et al. [19] propose an attempt to formulate some approaches to evaluate XAI methods. The authors introduce three main levels to evaluate the quality of the explanations provided by an XAI method, as summarized in Table 2.
Doshi-Velez等[19]提出了一种尝试，试图制定一些方法来评估XAI方法。如表2所示，作者介绍了三个主要层次来评估XAI方法提供的解释的质量。

A common example of evaluation of an application level or human level task is to evaluate the quality of the mental model built by the user after seeing the explanation(s). Mental models can be described as internal representations, built upon experiences, and which allow to mentally simulate how something works in the real world. Hoffman et al. [20] propose to evaluate mental models by 1. Asking post-task questions on the behaviour of the agent (such as “How does it work?” or “What does it achieve?”) and 2. Asking the participants to make predictions on the agent’s next action. These evaluations are often done using Likert scales.
评估应用程序级别或人类级别任务的一个常见示例是在看到解释后评估用户构建的心智模型的质量。心智模型可以被描述为建立在经验之上的内部表征，并允许在心理上模拟事物在现实世界中的运作方式。Hoffman等[20]提出用1.在任务后询问关于智能体行为的问题（例如“它是如何工作的？”或“它实现了什么？”）和 2.要求参与者对智能体的下一步行动做出预测。这些评估通常使用李克特量表进行。

Table 2. Three main levels for evaluating the explanations provided by an XAI method.
表 2.评估 XAI 方法提供的解释的三个主要级别。

Level of evaluation 评估水平 Type of task 任务类型 Required humans 所需人员 Modus Operandi 做法
Application level 应用程序级别 Real task 实际任务 Domain expert 领域专家 Put the explanation into the product and have it tested by the end user.
将说明放入产品中，并由最终用户进行测试。
Human level 人性化水平 Simplified task 简化任务 Layperson 外行 Carry out the application level experiments with laypersons as it makes the experiment cheaper and it is easier to find more testers.
与外行人一起进行应用级实验，因为这样可以使实验更便宜，并且更容易找到更多的测试人员。
Function level 功能级别 Proxy task 代理任务 No human required 无需人工 Uses a proxy to evaluate the explanation quality. Works best when the model class used has already been evaluated by someone else in human level. For instance, a proxy for decision trees can be the depth of the tree.
使用代理来评估解释质量。当使用的模型类已经由其他人在人类水平上评估时，效果最佳。例如，决策树的代理可以是树的深度。
Inspired from explanations provided in Doshi-Velez et al. [19].
灵感来自Doshi-Velez等人[19]提供的解释。

1.3. Organization of this work
1.3. 这项工作的组织
In this survey, we first introduce XAI and its main challenges in Section 1. We then review the recent literature in XAI applied to reinforcement learning in Section 2. In Section 3, we discuss the different approaches employed in the literature. Finally, we conclude in Section 4 with some directions for future research.
在本次调查中，我们首先在第 1 节中介绍了 XAI 及其主要挑战。然后，我们在第 2 节中回顾了 XAI 应用于强化学习的最新文献。在第 3 节中，我们讨论了文献中采用的不同方法。最后，我们在第四节中总结了未来研究的一些方向。

The key contributions of this paper are as follows:
本文的主要贡献如下：

•
A recent state of the art on Explainability in the latest Reinforcement Learning models.
最新的强化学习模型中关于可解释性的最新进展。

•
A broad categorization of explainable RL (XRL) methods.
可解释 RL （XRL）方法的广泛分类。

•
Discussion and future work recommendations.
讨论和今后的工作建议。

We hope that this work will give more visibility to existing XRL methods, while helping developing new ideas in this field, accounting for different audiences involved.
我们希望这项工作能够提高现有XRL方法的可见性，同时帮助开发该领域的新想法，并考虑到所涉及的不同受众。

XAI in RL: State of the art and reviewed literature
RL中的XAI：最新技术和综述文献
We reviewed the state of the art on XRL and summarized it in Table 3. This table presents, for each paper, the task(s) for which an explanation is provided, the employed RL algorithms (whose Algorithms glossary can be found in the A), and the provided type of explanations, i.e.: based on images, diagrams (graphical components such as bar charts, plots or graphs), or text. We also present the level of the provided explanation (local if it explains only predictions, global if it explains the whole model), and the audience concerned by the explanation, as discussed in Section 1.1.
我们回顾了XRL的最新技术，并将其总结在表3中。该表列出了每篇论文提供解释的任务、所采用的 RL 算法（其算法词汇表可在 A 中找到）以及提供的解释类型，即：基于图像、图表（图形组件，如条形图、绘图或图形）或文本。我们还介绍了所提供解释的水平（如果它只解释预测，则为局部，如果它解释了整个模型，则为全局），以及与解释相关的受众，如第 1.1 节所述。

In Table 3 we summarized the literature focusing on explainable fundamental RL algorithms. However, we also reviewed articles about state of the art XAI techniques that can be used in the context of current RL which we did not include in Table 3. Next, we will describe the main ideas provided by these papers which can help bring explainability in RL. It is possible to classify all recent studies in two main categories: transparent methods and Post-Hoc explainability according to the XAI taxonomies in Barredo Arrieta et al. [13]. On the one hand, inherently transparent algorithms include by definition every algorithm which is understandable by itself, such as a decision-trees. On the other hand, Post-Hoc explainability includes all methods that provide explanations of an RL algorithm after its training, such as SHAP (SHapley Additive exPlanations) [16] or LIME [15] for standard ML models. Reviewed papers are referenced by type of explanation in Fig. 1.
在表 3 中，我们总结了专注于可解释的基本 RL 算法的文献。然而，我们也回顾了有关可用于当前RL上下文的最新XAI技术的文章，这些技术我们没有包含在表3中。接下来，我们将描述这些论文提供的主要思想，这些思想可以帮助在RL中实现可解释性。根据Barredo Arrieta等人[13]的XAI分类法，可以将所有最近的研究分为两大类：透明方法和事后可解释性。一方面，根据定义，固有透明的算法包括所有本身可以理解的算法，例如决策树。另一方面，事后可解释性包括所有在训练后解释RL算法的方法，例如用于标准ML模型的SHAP（SHapley Additive exPlanations）[16]或LIME [15]。综述论文按解释类型引用，见图1。

Table 3. Summary of reviewed literature on explainable RL (XRL) and deep RL (DRL).
表 3.关于可解释RL（XRL）和深度RL（DRL）的综述文献摘要。

Reference 参考 Task/Environment 任务/环境 Decision process 决策过程 Algorithm(s) 算法 Explanation type (Level) 说明类型（级别） Target 目标
Relational Deep RL [21] 关系深度强化学习 [21] Planning + strategy games (Box-World/ Starcraft II)
策划+策略游戏（Box-World/星际争霸II） POMDP POMDP的 IMPALA 高角羚 Images (Local) 图片（本地） Experts 专家
Symbolic RL with Common Sense [22]
具有常识的符号RL [22] Game (object retrieval) 游戏（对象检索） POMDP POMDP的 SRL+CS, DQL SRL+CS、DQL Images (Global) 图片（全球） Experts 专家
Decoupling feature extraction from policy learning [23]
将特征提取与策略学习解耦 [23] Robotics (grasping), and navigation
机器人（抓取）和导航 MDP PPO Diagram (state plot & image slider (Local)
图表（状态图和图像滑块（局部） Experts 专家
Explainable RL via Reward Decomposition [24]
通过奖励分解解释RL [24] Game (grid and landing) 游戏（网格和着陆） MDP HRA, SARSA, Q-learning HRA、SARSA、Q-学习 Diagrams (Local) 图表（本地） Experts, Users, Executives
专家、用户、高管
Explainable RL Through a Causal Lens [25]
通过因果透镜解释RL [25] Games (OpenAI benchmark and Starcraft II)
游戏（OpenAI 基准测试和星际争霸 II） Both 双 PG, DQN, DDPG, A2C, SARSA
PG、DQN、DDPG、A2C、SARSA Diagrams, Text (Local) 图表、文本（本地） Experts, Users, Executives
专家、用户、高管
Shapley Q-value: A Local Reward Approach to Solve Global Reward Games [26]
Shapley Q值：解决全球奖励游戏的局部奖励方法[26] Multiagents (Cooperative Navigation, Prey-and-Predator and Traffic Junction)
多智能体（协同导航、猎物和捕食者和交通交汇处） POMDP POMDP的 DDPG DDPG系列 Diagrams (Local) 图表（本地） Experts 专家
Dot-to-Dot: Explainable HRL For Robotic Manipulation [27]
点对点：用于机器人操作的可解释HRL [27] Robotics (grasping) 机器人（抓取） MDP DDPG, HER, HRL DDPG、HER、HRL Diagrams (Global) 图表（全局） Experts, Developers 专家、开发人员
Self-Educated Language Agent With HER For Instruction Following [28]
自学成才的语言代理，由HER指导[28] Instruction Following (MiniGrid)
指令遵循（MiniGrid） MDP Textual HER 文本 HER Text (Local) 文本（本地） Experts, Users, Developers
专家、用户、开发人员
Commonsense and Semantic-guided Navigation [29]
常识和语义引导导航 [29] Room navigation 房间导航 POMDP POMDP的 – Text (Global) 文本（全局） Experts 专家
Boolean Task Algebra [30]
布尔任务代数 [30] Game (grid) 游戏（网格） MDP DQN Diagrams 图 Experts 专家
Visualizing and Understanding Atari [31]
可视化和理解雅达利 [31] Games (Pong, Breakout, Space Invaders)
游戏（乒乓球、突围、太空入侵者） MDP A3C Images (Global) 图片（全球） Experts, Users, Developers
专家、用户、开发人员
Interestingness Elements for XRL through Introspection [32], [33]
通过内省进行XRL的有趣元素[32]，[33] Arcade game (Frogger) 街机游戏（青蛙） POMDP POMDP的 Q-Learning Q-学习 Images (Local) 图片（本地） Users 用户
Composable DRL for Robotic Manipulation [34]
用于机器人操作的可组合 DRL [34] Robotics (pushing and reaching)
机器人（推和伸手） MDP Soft Q-learning 软 Q 学习 Diagrams (Local) 图表（本地） Experts 专家
Symbolic-Based Recognition of Contact States for Learning Assembly Skills [35]
基于符号的接触状态识别学习装配技能 [35] Robotic grasping 机器人抓取 POMDP POMDP的 HMM, PAA, K-means HMM、PAA、K-均值 Diagrams (Local) 图表（本地） Experts 专家
Safe Reinforcement Learning with Model Uncertainty Estimates [36]
基于模型不确定性估计的安全强化学习 [36] Collision avoidance 避碰 POMDP POMDP的 Monte Carlo Dropout, bootstrapping
蒙特卡洛辍学，引导 Diagrams (Local) 图表（本地） Experts 专家

Download : Download high-res image (131KB)
下载：下载高分辨率图片（131KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 1. Taxonomy of the reviewed literature identified for bringing explainability to RL models. References in , , and correspond to XAI techniques using , or , respectively.
图 1.为RL模型带来可解释性而确定的综述文献的分类。、和中的引用分别对应于、或的 XAI 技术。

2.1. Transparent algorithms
2.1. 透明算法
Transparent algorithms are well known and used in standard Machine Learning (e.g., linear regression, decision trees or rule-based systems). Their strength lie in the fact that they are designed to have a transparent architecture that makes them explainable by themselves, without the need of any external processing. However, it is quite different for RL, as standard DRL algorithms (e.g., DQN, PPO, DDPG, A2C…) are not transparent by nature. In addition, the large majority of studies related to transparency in XRL chose to build algorithms targeting only a specific task. Nonetheless, most of the time and contrary to standard Machine Learning models, transparent RL algorithms can achieve state of the art performance in these specific tasks [21], [22], [26].
透明算法是众所周知的，并用于标准机器学习（例如，线性回归、决策树或基于规则的系统）。它们的优势在于它们被设计成一个透明的架构，使它们可以自行解释，而无需任何外部处理。然而，RL的情况却大不相同，因为标准的DRL算法（例如DQN、PPO、DDPG、A2C等）本质上并不透明。此外，与XRL透明度相关的绝大多数研究都选择构建仅针对特定任务的算法。尽管如此，大多数时候，与标准机器学习模型相反，透明的RL算法可以在这些特定任务中实现最先进的性能[21]，[22]，[26]。

2.1.1. Explanation through representation learning
2.1.1. 通过表征学习进行解释
Representation learning algorithms focuses on learning abstract features that characterize data, in order to make it easier to extract useful information when building predictors [37], [38]. These learned features have the advantage of having low dimensionality, which generally improves training speed and generalization of Deep Learning models [23], [37], [39].
表示学习算法侧重于学习表征数据的抽象特征，以便在构建预测变量时更容易提取有用的信息[37]，[38]。这些学习特征具有低维的优点，这通常可以提高深度学习模型的训练速度和泛化[23]，[37]，[39]。

In the context of RL, learning representations of states, actions or policy can be useful to explain a RL algorithm, as these representations can bring some clues on the functioning of the algorithm. Indeed, State Representation Learning (SRL) [37] is a particular type of representation learning that aims at building a low-dimensional and meaningful representation of a state space, by processing high-dimensional raw observation data (e.g., learn a position (x, y) from raw image pixels). This enables to capture the variations in the environment influenced by the agent’s actions and thus, extrapolate explanations. SRL can be especially useful in RL for robotics and control [23], [40], [41], [42], [43], and can help to understand how the agent interprets the observations and what is relevant to learn to act, i.e., actionable or controllable features [39]. Indeed, the dimensionality reduction induced by SRL, coupled with the link to the control and possible disentanglement of variation factors, could be highly beneficial to improve our understanding capacity of the decisions made by RL algorithms using a state representation method [37]. For example, SRL can be used to split the state representation [23] according to the different training objectives to be optimized before learning a policy. This allows to allocate room for encoding each necessary objective within the embedding state to be learned (in that case, reward prediction, a reconstruction objective and an inverse model). In this context, tools such as S-RL Toolbox [40] allow sampling from the embedding state space (learned through SRL) to allow a visual interpretation of the model’s internal state, and pairing it with its associated input observation. Comprehensibility is thus enhanced, more easily observing if smoothness is preserved in the state space, as well as whether other invariants related to learning specific control task are guaranteed.
在强化学习的背景下，学习状态、动作或策略的表示对于解释强化学习算法很有用，因为这些表示可以为算法的功能提供一些线索。事实上，状态表示学习（SRL）[37]是一种特殊类型的表征学习，旨在通过处理高维原始观测数据（例如，从原始图像像素中学习位置（x，y））来构建状态空间的低维和有意义的表示。这样就可以捕获受智能体行为影响的环境变化，从而推断出解释。SRL在RL中对机器人和控制[23]、[40]、[41]、[42]、[43]特别有用，可以帮助理解智能体如何解释观察结果以及与学习行动相关的内容，即可操作或可控制的特征[39]。事实上，SRL诱导的降维，加上与控制和变异因子可能解开的联系，对于提高我们对使用状态表示方法的RL算法所做决策的理解能力非常有益[37]。例如，SRL可用于根据学习策略之前要优化的不同训练目标来拆分状态表示[23]。这允许在要学习的嵌入状态中分配空间来编码每个必要的目标（在这种情况下，奖励预测、重建目标和逆向模型）。在这种情况下，S-RL Toolbox [40]等工具允许从嵌入状态空间（通过SRL学习）进行采样，以允许对模型的内部状态进行可视化解释，并将其与其相关的输入观察结果配对。因此，可理解性得到增强，更容易观察状态空间中是否保持了平滑度，以及与学习特定控制任务相关的其他不变量是否得到保证。

There are several approaches employed for SRL: reconstructing the observations using autoencoders [44], [45], training a forward model to predict next state [46], [47], teach to an inverse model how to predict actions from previous state(s) [47], [48] or using prior knowledge to constrain the state space [39], [49].
SRL采用的方法有几种：使用自动编码器重建观测值[44]，[45]，训练前向模型来预测下一个状态[46]，[47]，教逆向模型如何预测先前状态的动作[47]，[48]或使用先验知识来约束状态空间[39]，[49]。

Along the same lines, learning disentangled representations [50], [51], [52], [53] is another interesting idea used for unsupervised learning, which decomposes (or disentangles) each feature into narrowly defined variables and encodes them as separate low- dimensional features (generally using a Variational Autoencoders [54]). It is also possible to make use of this concept, as well as lifelong learning to learn more interpretable representations on unsupervised classification tasks. In addition, one could argue that learning through life would allow compacting and updating old knowledge with new one while preventing catastrophic forgetting [55]. Thus, this is a key concept that could lead to more versatile RL agents, being able to learn new tasks without forgetting the previous ones. Information Maximizing Generative Adversarial Networks (InfoGAN) [56] is another model based on the principles of learning disentangled representations. The noise vector used in traditional GANs is decomposed into two parts:

incompressible noise; and

the latent code used to target the salient semantic features of the data distribution. The main idea is to feed
and
to the generator
, to maximize the mutual information between
and
, in order to assure that the information contained in
is preserved during the generation process. As a result, the InfoGAN model is able to create an interpretable representation via the latent code
(i.e., the values changing according to shape and features of the input data).
同样，学习解缠表示[50]、[51]、[52]、[53]是另一个用于无监督学习的有趣想法，它将每个特征分解（或解缠）为狭义定义的变量，并将它们编码为单独的低维特征（通常使用变分自动编码器[54]）。也可以利用这个概念，以及终身学习来学习无监督分类任务的更多可解释的表示。此外，有人可能会争辩说，通过生活学习将允许用新知识压缩和更新旧知识，同时防止灾难性的遗忘[55]。因此，这是一个关键概念，可以带来更通用的RL代理，能够在不忘记以前的任务的情况下学习新任务。Information Maximizing Generative Adversarial Networks（InfoGAN）[56]是另一种基于学习解缠表示原理的模型。传统GAN中使用的噪声矢量被分解为两部分：
不可压缩的噪声;以及
：用于针对数据分布的显着语义特征的潜在代码。其主要思想是将和馈送到
生成器，以最大化和

之间的
互信息
，以确保在生成过程中保留其中包含
的信息。因此，InfoGAN模型能够通过潜在代码
（即，根据输入数据的形状和特征而变化的值）创建可解释的表示。

Download : Download high-res image (356KB)
下载：下载高分辨率图片（356KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 2. Visualization of attention weights for the Box-World task (environment created by authors of [21]), in which the agent has to open boxes to obtain either keys or rewards. (a) The underlying graph of one example level. (b) The result of the analysis for that level, using each entity (represented as coloured pixels) along the solution path (1–5) as the source of attention. Boxes are represented by two adjacent coloured pixels. On each box, the pixel on the right represents the box’s lock and its colour indicates which key can be used to open it. The pixel on the left indicates the content of the box which is inaccessible while the box is locked. Arrows point to the entities that the source is attending to. The arrow’s transparency is determined by the corresponding attention weight.
图 2.Box-World任务（由[21]的作者创建的环境）的注意力权重的可视化，在该任务中，智能体必须打开盒子才能获得钥匙或奖励。（a）一个示例级别的底层图。（b）该层次的分析结果，使用沿解路径（1-5）的每个实体（表示为彩色像素）作为注意力源。框由两个相邻的彩色像素表示。在每个盒子上，右侧的像素代表盒子的锁，其颜色表示可以使用哪把钥匙来打开它。左侧的像素表示盒子的内容，当盒子被锁定时，该内容无法访问。箭头指向源正在处理的实体。箭头的透明度由相应的注意力权重决定。

Reproduced with permission of Vinicius Zambaldi [21].
经Vinicius Zambaldi许可转载[21]。
Some work has been done to learn representations by combining symbolic AI with deep RL in order to facilitate the use of background knowledge, the exploitation of learnt knowledge, and to improve generalization [57], [58], [59], [60]. Consequently, it also improves the explainability of the algorithms, while preserving state-of-the-art performance.
通过将符号人工智能与深度强化学习相结合来学习表征已经做了一些工作，以促进背景知识的使用，利用所学知识，并提高泛化[57]，[58]，[59]，[60]。因此，它还提高了算法的可解释性，同时保留了最先进的性能。

Zambaldi et al. [21] propose making use of Inductive Logic Programming and self-attention to represent states, actions and policies using first order logic, using a mechanism similar to graph neural networks and more generally, message passing computations [61], [62], [63], [64]. In these kind of models entity–entity relations are explicitly computed when considering the messages passed between connected nodes of the graph as shown in Fig. 2. Self-attention is used here as a method to compute interactions between these different entities (i.e. relevant pixels in a RGB image for the example from [21]), and thus perform non-local pairwise relational computations. This technique allows an expert to visualize the agent’s attention weights associated to its available actions and interpret how to improve the understanding of its strategy.
Zambaldi等[21]提出利用归纳逻辑编程和自注意力，使用类似于图神经网络的机制，使用类似于图神经网络的机制，以及更一般的消息传递计算[61]，[62]，[63]，[64]。在这类模型中，当考虑图的连接节点之间传递的消息时，会显式计算实体-实体关系，如图 2 所示。自注意力在这里被用作计算这些不同实体之间的相互作用的方法（即[21]中示例中的RGB图像中的相关像素），从而执行非局部成对关系计算。这种技术允许专家可视化与其可用操作相关的智能体注意力权重，并解释如何提高对其策略的理解。

Another work that aims to incorporate common sense to the agent, in terms of symbolic abstraction to represent the problem, is in [22]. This method subdivides the world state representation into many sub-states, with a degree of associated importance based on how far the object is from the agent. This helps understand the relevance of the actions taken by the agent by determining which sub-states were chosen.
另一项旨在将常识纳入智能体的工作，即符号抽象来表示问题，是在[22]中。此方法将世界状态表示细分为许多子状态，并根据对象与代理的距离具有一定程度的关联重要性。这有助于通过确定选择哪些子状态来了解代理所采取的行动的相关性。

Download : Download high-res image (177KB)
下载：下载高分辨率图片（177KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 3. Left Reward Decompositions for DQN. Right Hybrid Reward Architecture (HRA) at cell (3,4) in Cliffworld. HRA predicts an extra “gold” reward for actions which do not lead to a terminal state.
图 3.DQN 的左奖励分解。Cliffworld 中单元格（3,4）的右混合奖励架构（HRA）。HRA预测，对于不会导致最终状态的行为，将获得额外的“黄金”奖励。

Reproduced with permission of Zoe Juozapaitis [24].
经Zoe Juozapaitis许可转载[24]。
2.1.2. Simultaneous learning of the explanation and the policy
2.1.2. 同时学习解释和政策
While standard DRL algorithms struggle to provide explanations, those can be tweaked to learn simultaneously both policy and explanation. Thus, explanations become a learned component of the model. These methods are recommended on specific problems where it is possible to introduce knowledge, such classifying rewards by types, adding relationships between states, etc… Thus, tweaking the algorithm to introduce some task knowledge and to learn explanations generally also improves performance. A general notion is that the knowledge gained from the auxiliary task objective must be useful for downstream tasks. In this direction, Juozapaitis et al. [24] introduced reward decomposition, whose main principle is to decompose the reward function into a sum of meaningful reward types. Authors used reward decomposition to improve performance on Cliffworld and Starcraft II, where each action can be classified according to its type. This method consists of using a custom decomposed reward DQN by defining a vector-valued reward function, where each component is the reward for a certain type so that actions can be compared in terms of trade-offs among the types. In the same way, the Q-function is also vector valued and each component gives action values that account for only a reward type. The sum of each of those vector-valued functions gives the overall Q or reward function. Learning multiple Q-functions, one for each type of reward, allows the model to learn the best policy while also learning the explanations (i.e. the type of reward that the agent wanted to maximize by his action, illustrated on Fig. 4). They introduce the concept of Reward Difference Explanation (RDX, in Fig. 3) which enables to understand the reasons why an action has an advantage (or disadvantage) over another. They also define Minimal Sufficient Explanations (MSX, See Fig. 4), in order to help humans identify a small set of the most important reasons why the agent choose specific actions over another. MSX+ and MSX- are sets of critical positive and negative reasons (respectively) for the actions preferred by the agent.
虽然标准 DRL 算法难以提供解释，但可以对其进行调整以同时学习策略和解释。因此，解释成为模型的学习组成部分。这些方法推荐用于可以引入知识的特定问题，例如按类型对奖励进行分类、添加状态之间的关系等…因此，调整算法以引入一些任务知识并学习解释通常也可以提高性能。一般的概念是，从辅助任务目标中获得的知识必须对下游任务有用。在这个方向上，Juozapaitis等[24]引入了奖励分解，其主要原理是将奖励函数分解为有意义的奖励类型的总和。作者使用奖励分解来提高《悬崖世界》和《星际争霸II》的表现，其中每个动作都可以根据其类型进行分类。此方法包括通过定义向量值奖励函数来使用自定义分解的奖励 DQN，其中每个组件都是特定类型的奖励，以便可以根据类型之间的权衡来比较操作。同样，Q 函数也是向量值的，每个组件都给出仅考虑奖励类型的动作值。每个向量值函数的总和给出了整体 Q 或奖励函数。学习多个 Q 函数，每种类型的奖励一个，允许模型学习最佳策略，同时学习解释（即智能体希望通过其行动最大化的奖励类型，如图 4 所示）。他们引入了奖励差异解释的概念（RDX，如图所示）。3）能够理解一个动作比另一个动作具有优势（或劣势）的原因。他们还定义了最小充分解释（MSX，见图4），以帮助人类识别智能体选择特定动作而不是其他动作的一小部分最重要原因。MSX+ 和 MSX- 分别是代理首选操作的关键正面和负面原因集。

While reward decompositions help to understand the agent choice preferences between several actions, minimal sufficient explanations are used to help selecting the most important reward decompositions. Other works that facilitate the explainability of RL models by using reward-based losses for more interpretable RL are in [47], [48], [65].
虽然奖励分解有助于理解几个动作之间的智能体选择偏好，但使用最少的充分解释来帮助选择最重要的奖励分解。[47]，[48]，[65]中的其他工作通过使用基于奖励的损失来促进RL模型的可解释性，从而实现更可解释的RL。

In the same vein, Madumal et al. [25] use the way humans understand and represent knowledge through causal relationships and introduce an action influence model: a causal model which can explain the behaviour agents using causal explanations. Structural causal models [66] represent the world using random variables, some of which might have causal relationships, which can be described thanks to a set of structural equations. In this work, structural causal models are extended to include actions as part of the causal relationships. An action influence model is a tuple represented by the state–actions ensemble and the corresponding set of structural equations. The whole process is divided into 3 phases:
同样，Madumal等[25]利用人类通过因果关系理解和表示知识的方式，并引入了一种行动影响模型：一种因果模型，可以使用因果解释来解释行为主体。结构因果模型[66]使用随机变量来表示世界，其中一些变量可能具有因果关系，这要归功于一组结构方程。在这项工作中，结构因果模型被扩展为包括动作作为因果关系的一部分。动作影响模型是由状态-动作集合和相应的结构方程集表示的元组。整个过程分为3个阶段：

Download : Download high-res image (143KB)
下载：下载高分辨率图片（143KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 4. Top Minimal Sufficient Explanations (MSX) (fire down engine action vs. do nothing action) for decomposed reward DQN in Lunar Lander environment near landing site. The shaping rewards dominate decisions. Bottom RDX (noop vs. fire-main-engine) for HRA in Lunar Lander before a crash. The RDX shows that noop is preferred to avoid penalties such as fuel cost.
图 4.在着陆点附近的月球着陆器环境中分解奖励 DQN 的顶级最小充分解释（MSX）（击落发动机动作与不执行任何动作）。塑造奖励主导着决策。坠毁前月球着陆器中 HRA 的底部 RDX（noop 与 fire-main-engine）。RDX 表明 noop 是首选，以避免燃料成本等处罚。

Reproduced with permission of Zoe Juozapaitis [24].
经Zoe Juozapaitis许可转载[24]。
•
Defining the qualitative causal relationships of variables as an action influence model.
将变量的定性因果关系定义为行动影响模型。

•
Learning the structural equations (as multivariate regression models during the training phase of the agent).
学习结构方程（作为智能体训练阶段的多元回归模型）。

•
Generating explanations, called explanans, by traversing the action influence graph (see Fig. 5) from the root to the leaf reward node.
通过遍历从根到叶奖励节点的动作影响图（见图5）来生成解释，称为解释。

This kind of models allow encoding cause–effect relations between events (actions and states) as shown by the graph featured in Fig. 5. Thus, they can be used to generate explanations of the agent behaviour (“why” and “why not” questions), based on knowledge about how actions influence the environment. Their method was evaluated through a user study showing that, compared to video game playing without any explanations and relevant variable explanations, this model performs significantly better on (1) task prediction and (2) explanation goodness. However, trust was not shown to be significantly improved.
这种模型允许对事件（动作和状态）之间的因果关系进行编码，如图 5 所示。因此，它们可以用来根据关于行为如何影响环境的知识来生成对智能体行为的解释（“为什么”和“为什么不”的问题）。他们的方法通过一项用户研究进行了评估，结果表明，与没有任何解释和相关变量解释的视频游戏相比，该模型在（1）任务预测和（2）解释优度方面的表现明显更好。然而，信任并未得到显着改善。

Authors of [36] also learn explanations along with the model policy on pedestrians collision avoidance tasks. In this paper, an ensemble of LSTM networks was trained using Monte Carlo Dropout [67] and bootstrapping [68] to estimate collision probabilities and thus predict uncertainty estimates to detect novel observations. The magnitude of those uncertainty estimates was shown to reveal novel obstacles in a variety of scenarios, indicating that the model knows what it does not know. The result is a collision avoidance policy that can measure the novelty of an observation (via model uncertainty) and cautiously avoids pedestrians that exhibit unseen behaviour. Measures of model uncertainty can also be used to identify unseen data during training or testing. Policies during simulation demonstrated to be more robust to novel observations and take safer actions than an uncertainty-unaware baseline. This work also responds to the problem of safe reinforcement learning [69], whose goal is to ensure reasonable system performance and/or respect safety constraints also at the deployment phase.
[36]的作者还学习了关于行人防撞任务的模型策略的解释。在本文中，使用蒙特卡洛辍学[67]和自举[68]训练LSTM网络集合，以估计碰撞概率，从而预测不确定性估计以检测新的观测结果。这些不确定性估计的大小被证明揭示了各种情景中的新障碍，表明模型知道它不知道的东西。其结果是防撞策略可以衡量观察的新颖性（通过模型不确定性），并谨慎地避开表现出看不见行为的行人。模型不确定性的度量也可用于识别训练或测试期间看不见的数据。模拟期间的策略被证明对新的观察结果更稳健，并采取比不确定的基线更安全的行动。这项工作还回应了安全强化学习的问题[69]，其目标是确保合理的系统性能和/或在部署阶段也遵守安全约束。

Download : Download high-res image (318KB)
下载：下载高分辨率图片（318KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 5. Action influence graph of a Starcraft II agent. The causal chain (explanation) for action
is depicted in bold arrows and the extracted explanan (subset of causes given the explanation) is shown as darkened nodes. The counterfactual action (why not
) explanan is shown as greyed node (B). Here,
is the explanandum, the action for which the user needs explanation. Thus, we can answer the question “Why not build_barrack (
)?”. Indeed, the explanation provided by the graph in bold arrows is: “Because it is more desirable to do action build_supply_depot (
) to have more Supply Depots as the goal is to have more Destroyed Units (
) and Destroyed Buildings (
)”.
图 5.《星际争霸II》特工的动作影响图。动作
的因果链（解释）用粗体箭头表示，提取的解释（给出解释的原因子集）显示为变暗的节点。反事实动作（为什么不
）解释显示为灰色节点（B）。这里是
解释，用户需要解释的操作。因此，我们可以回答“为什么不build_barrack（
）？事实上，图表以粗体箭头提供的解释是：“因为更可取的是采取行动build_supply_depot（
）拥有更多的补给站，因为目标是拥有更多的被摧毁单位（
）和被摧毁的建筑物（
）”。

Reproduced with permission of [25].
经[25]许可转载。
Some work has also been made to explain multiagent RL. Wang et al. [26] developed an approach named Shapley Q-values Deep Deterministic Policy Gradient (SQDDPG) to solve global reward games in a multiagent context based on Shapley values and DDPG. The proposed approach relies on distributing the global reward more efficiently across all agents. They show that integrating Shapley values into DDPG enables to share the global reward between all agents according to their contributions: the more the agent contributes, the more reward it will get. This contrasts to the classical shared reward approach, which could cause inefficient learning by assigning rewards to an agent who contributed poorly. The experiments showed that SQDDPG presents faster convergence rate and fairer credit assignment in comparison with other algorithms (i.e. IA2C, IDDPG, COMA and MADDPG). This method allows to plot credit assignment to each agent, which can explain how the global reward is divided during training and what agent contributed the most to obtain the global reward (see Fig. 6).
还做了一些工作来解释多智能体强化。Wang等[26]开发了一种名为Shapley Q值深度确定性策略梯度（SQDDPG）的方法，用于在基于Shapley值和DDPG的多智能体环境中求解全局奖励博弈。所提出的方法依赖于在所有代理之间更有效地分配全局奖励。他们表明，将Shapley价值观整合到DDPG中，可以根据所有代理的贡献在所有代理之间共享全局奖励：代理贡献越多，获得的奖励就越多。这与经典的共享奖励方法形成鲜明对比，后者通过将奖励分配给贡献不佳的代理，可能会导致学习效率低下。实验表明，与其他算法（IA2C、IDDPG、COMA 和 MADDPG）相比，SQDDPG 具有更快的收敛速度和更公平的学分分配。这种方法允许绘制每个智能体的信用分配，这可以解释在训练过程中全局奖励是如何分配的，以及哪个智能体对获得全局奖励的贡献最大（见图 6）。

Download : Download high-res image (308KB)
下载：下载高分辨率图片（308KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 6. Credit assignment to each predator for a fixed trajectory in prey and predator task (Multiagent Particles environment [70]). Left figure: Trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicate each agent’s temporary positions. Right figures: normalized credit assignments generated by different multiagent RL algorithms according to this trajectory. SQDDPG presents fairer credit assignments in comparison with other methods.
图 6.在猎物和捕食者任务中，每个捕食者都有固定的轨迹（多智能体粒子环境[70]）。左图：由专家策略采样的轨迹。正方形表示初始位置，而圆圈表示每个智能体的最终位置。轨迹上的点表示每个智能体的临时位置。右图：根据该轨迹，由不同的多智能体RL算法生成的归一化信用分配。与其他方法相比，SQDDPG 提供了更公平的学分分配。

Reproduced with permission of Jianhong Wang [26].
经王建红许可转载[26]。
2.1.3. Explanation through hierarchical goals
2.1.3. 通过分层目标进行解释
Methods based on Hierarchical RL [71] and sub-task decomposition [72] consist of a high level agent dividing the main goal into sub-goals for a low-level agent, which follows them one by one to perform the high-level task. By learning what sub-goals are optimal for the low-level agent, the high-level agent forms a representation of the environment that is interpretable by humans. Often, Hindsight Experience Replay (HER) [73] is used in order to ignore whether or not goals and sub-goals have been reached during an episode and to extract as much information as possible from past experience.
基于分层强化学习[71]和子任务分解[72]的方法包括一个高级智能体将主要目标划分为一个低级智能体的子目标，该子目标一个接一个地执行高级智能体。通过了解哪些子目标最适合低级智能体，高级智能体形成了人类可解释的环境表示。通常，使用后见之明经验回放（Hindsight Experience Replay， HER）[73]来忽略事件中是否达到了目标和子目标，并从过去的经验中提取尽可能多的信息。

Beyret et al. [27] used this kind of methods along with HER for robotic manipulation (grasping and moving an item). The high level agent learns which sub-goals can make the low level agent reach the main goal while the low level agent learns to maximize the rewards for these sub-goals. The high-level agent provides a representation of the learned environment and the Q-values associated, which can be represented as heat maps as shown in Fig. 7.
Beyret等[27]将这种方法与HER一起用于机器人操作（抓取和移动物品）。高级智能体学习哪些子目标可以使低级智能体达到主要目标，而低级智能体则学习最大化这些子目标的奖励。高级代理提供了学习环境和相关 Q 值的表示，可以表示为热图，如图 7 所示。

Based on the same ideas, Cideron et al. [28] proposed Textual Hierarchical Experience Replay (THER) which extends the HER explanation to a natural language setting, allowing to learn from past experiences and to map goals to trajectories without the need of an external expert. The mapping function labels unsuccessful trajectories by automatically predicting a substitute goal. THER is composed of two models: the instruction generator which outputs a language encoding of the final state, and an agent model which picks an action given the last observations and the language-encoded goal. The model learns to encode goals and states via natural language, and thus can be interpreted by a human operator (Fig. 8).
基于同样的想法，Cideron等[28]提出了文本分层体验回放（Textual Hierarchical Experience Replay，THER），将HER解释扩展到自然语言设置，允许从过去的经验中学习，并将目标映射到轨迹上，而无需外部专家。映射函数通过自动预测替代目标来标记不成功的轨迹。THER 由两个模型组成：指令生成器，它输出最终状态的语言编码，以及一个代理模型，它根据最后的观察结果和语言编码的目标选择一个动作。该模型通过自然语言学习对目标和状态进行编码，因此可以由人类操作员解释（图 8）。

Download : Download high-res image (364KB)
下载：下载高分辨率图片（364KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 7. Setup with initial state and goal diagonally opposed on the table. The heat maps show the value of the different areas (highest in yellow) for the high-level agent to predict a sub-goal. Black squares represent the position of the cube, the red circle is the end goal. Thus, the low-level agent will have a succession of sub-goals (e.g. multiple actions that the robotic arm must perform such as moving or opening its pinch) that will ultimately lead to the achievement of the high-level goal (i.e. grasping the red ball).
图 7.初始状态和目标在桌子上对角线相对的设置。热图显示不同区域（黄色最高）的值，供高级代理预测子目标。黑色方块代表立方体的位置，红色圆圈是最终目标。因此，低级智能体将具有一系列子目标（例如，机械臂必须执行的多个动作，例如移动或打开其捏合），这些子目标最终将导致高级目标的实现（即抓住红球）。

Reproduced with permission of [27].
经[27]许可转载。
Another interesting work finds inspiration in human behaviour to improve generalization on a room navigation task, just like common sense and semantic understanding are used by humans to navigate unseen environments [29]. The entire model is composed of three parts: (1) a semantically grounded navigator used to predict the next action. (2) a common sense planning module, used for route planning. It predicts the next room, based on the observed scene, helps finding intermediate targets, and learns what rooms are near the current one. (3) the semantic grounding module used to recognize rooms; it allows the detection of the current room and incorporates semantic understanding by generating questions about what the agent saw (”Did you see a bathroom?”). Self-supervision is then used for fine tuning on unseen environment. The explainability can be brought from the outputs of all parts of the entire model. We can get information about what room is detected by the agent, what are the next rooms targeted (sub-goals), what are the rooms predicted around the current room and what are the rooms already seen by the agent.
另一项有趣的工作从人类行为中找到了灵感，以改善房间导航任务的泛化，就像人类使用常识和语义理解来导航看不见的环境一样[29]。整个模型由三部分组成：（1）用于预测下一步行动的语义导航器。（2）常识规划模块，用于路线规划。它根据观察到的场景预测下一个房间，帮助找到中间目标，并了解当前房间附近有哪些房间。（3）用于识别房间的语义接地模块;它允许检测当前房间，并通过生成有关代理看到的内容的问题（“你看到浴室了吗？然后使用自我监督在看不见的环境中进行微调。可解释性可以从整个模型的所有部分的输出中带来。我们可以获取有关代理检测到的房间、下一个目标房间（子目标）、当前房间周围预测的房间以及代理已经看到的房间的信息。

Download : Download high-res image (125KB)
下载：下载高分辨率图片（125KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 8. MiniGrid environment [74], where the agent is instructed through a textual string to pick up an object and place it next to another one. The model learns to represent the achieved goal (e.g. “Pick the purple ball”) via language. As this achieved goal differs from the initial goal (“Pick the red ball”), the goal mapper relabels the episode, and both trajectories are appended to the replay buffer.
图 8.MiniGrid 环境 [74]，其中通过文本字符串指示代理拾取一个对象并将其放在另一个对象旁边。该模型学习通过语言表示已实现的目标（例如“选择紫色球”）。由于此实现的目标与初始目标（“选择红球”）不同，因此目标映射器会重新标记该剧集，并且两个轨迹都将附加到重播缓冲区中。

Reproduced with permission of M. Seurin [28].
经M. Seurin许可转载[28]。
An original idea proposed by Tasse et al. [30] consists of making an agent learn basic tasks and then allow it to perform new ones by composing the tasks previously learned in a boolean formula (i.e., with conjunctions, disjunctions and negations). The main strength of this method is that the agent is able to perform new tasks without the necessity of a learning phase. From an XRL point of view, the explainability comes from the fact that the agent is able to express its actions as boolean formulas, which are easily readable by humans.
Tasse等[30]提出的一个原创思想是让智能体学习基本任务，然后通过用布尔公式（即连词、不连词和否定）组合先前学习的任务来允许它执行新任务。这种方法的主要优点是代理能够在不需要学习阶段的情况下执行新任务。从XRL的角度来看，可解释性来自于这样一个事实，即智能体能够将其行为表示为布尔公式，这些公式很容易被人类阅读。

2.2. Post-Hoc explainability
2.2. 事后可解释性
Post-Hoc explainability refers to explainability methods that rely on an analysis done after the RL algorithm finishes its training and execution. In other terms, it is a way of “enhancing” the considered RL algorithm from a black box to something that is somewhat explainable. Most Post-Hoc methods encountered were used in a perception context, i.e., when the data manipulated by the RL algorithm consisted of visual input such as images.
事后可解释性是指依赖于 RL 算法完成训练和执行后完成的分析的可解释性方法。换句话说，它是一种将所考虑的RL算法从黑匣子“增强”到某种程度上可以解释的方法。遇到的大多数事后方法都是在感知环境中使用的，即当RL算法操作的数据由图像等视觉输入组成时。

2.2.1. Explanation through saliency maps
2.2.1. 通过显著性图进行解释
When an RL algorithm is learning from images, it can be useful to know which elements of those images hold the most relevant information (i.e., the salient elements). These elements can be detected using saliency methods that produce saliency maps [17], [75]. In most cases, a saliency or heat map consists of a filter applied to an image that will highlight the areas salient for the agent.
当RL算法从图像中学习时，了解这些图像的哪些元素包含最相关的信息（即突出元素）可能很有用。这些元素可以使用生成显著性图的显著性方法进行检测[17]，[75]。在大多数情况下，显著性或热图由应用于图像的过滤器组成，该过滤器将突出显示智能体的显著区域。

A major advantage of saliency maps is that it can produce elements that are easily interpretable by humans, even non-experts. Of course, the interpreting difficulty of a saliency map greatly depends on the saliency method used to compute that map and other parameters such as the colour scheme or the highlighting technique. A disadvantage is that they are very sensitive to different input variations, and schemes to debug such visual explanation may not be straightforward [76].
显著性图的一个主要优点是它可以生成易于人类（甚至是非专家）解释的元素。当然，显著性图的解释难度很大程度上取决于用于计算该图的显著性方法和其他参数，例如配色方案或高亮显示技术。缺点是它们对不同的输入变化非常敏感，调试这种视觉解释的方案可能并不简单[76]。

A very interesting example [31], introduces a new perturbation-based saliency computation method that produces crisp and easily interpretable saliency maps for RL agents playing OpenAI Gym environment Atari 2600 games with Asynchronous Actor–Critic [78]. The main idea is to apply a perturbation on the considered image that will remove information from a specific pixel without adding new information (by generating an interpolation from a Gaussian blur of the same image). Indeed, this perturbation can be interpreted as adding spatial uncertainty to the region around its point of application. This spatial uncertainty can help understand how removing information in a specific area of the input image affects the agent’s policy, and is quantified with a saliency metric
. The saliency map is then produced by computing
for every pixel
of the input image, leading to images such as those in Fig. 9.
一个非常有趣的例子[31]介绍了一种新的基于扰动的显著性计算方法，该方法为使用异步Actor-Critic玩OpenAI Gym环境Atari 2600游戏的RL代理生成清晰且易于解释的显著性图[78]。主要思想是对所考虑的图像施加扰动，该扰动将从特定像素中删除信息，而无需添加新信息（通过从同一图像的高斯模糊生成插值）。事实上，这种扰动可以解释为增加其应用点周围区域的空间不确定性。这种空间不确定性可以帮助理解删除输入图像特定区域中的信息如何影响智能体的策略，并使用显著性指标
进行量化。然后通过计算
输入图像
的每个像素来生成显著性图，从而生成如图 9 所示的图像。

Download : Download high-res image (274KB)
下载：下载高分辨率图片（274KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 9. Comparison of Jacobian saliency (left) first introduced by Simonyan et al. [77] to the authors’ perturbation-based approach (right) in an actor–critic model. Red indicates saliency for the critic; blue is saliency for the actor.
图 9.Simonyan等[77]首次提出的雅可比显著性（左）与作者在演员-批评模型中基于扰动的方法（右）的比较。红色表示批评者的显著性;蓝色是演员的显著性。

Reproduced with permission of Sam Greydanus [31].
经Sam Greydanus许可转载[31]。
However, saliency methods are not a perfect solution in every situation, as pointed out in [79], [80]. They need to respect a certain number of rules, such as implementation invariance or input invariance in order to be reliable, especially when it comes to their relation with either the model or the input data.
然而，正如[79]，[80]所指出的，显著性方法并不是在所有情况下都是一个完美的解决方案。它们需要遵守一定数量的规则，例如实现不变性或输入不变性，以便可靠，尤其是在涉及它们与模型或输入数据的关系时。

2.2.2. Explanation through interaction data
2.2.2. 通过交互数据进行解释
In a more generic way, the behaviour of an agent can be explained by gathering data from its interaction with the environment while running, and analysing it in order to extract key information. For instance, Caselles-Dupré et al. demonstrate that symmetry-based disentangled representation learning requires interaction and not only static perception [81].
以一种更通用的方式，可以通过在运行时从其与环境的交互中收集数据来解释智能体的行为，并对其进行分析以提取关键信息。例如，Caselles-Dupré等人证明，基于对称性的解缠表示学习需要交互，而不仅仅是静态感知[81]。

This idea is exploited by Sequeira et al. [33] where interaction is the core basis upon which their Interestingness Framework is built. This framework relies on introspection, conducted by the autonomous RL agents: the agent extracts interestingness elements that denote meaningful interactions from their history of interaction with the environment. This is done using interaction data collected by the agent that is analysed using statistical methods organized in a three-level introspection analysis: level 0: Environment analysis, level 1: Interaction analysis; level 3: Meta-analysis. From these interestingness elements, it is then possible to generate visual explanations (in the form of videos compiling specific highlight situations of interest in the agent’s behaviour), where the different introspection levels and their interconnections provide contextualized explanations (see Fig. 10).
Sequeira等人[33]利用了这一想法，其中交互是其趣味性框架的核心基础。该框架依赖于由自主 RL 代理进行的内省：代理从其与环境交互的历史中提取表示有意义的交互的有趣元素。这是使用智能体收集的交互数据完成的，这些数据使用统计方法进行分析，分为三个层次的内省分析：0 级：环境分析，1 级：交互分析;第 3 级：荟萃分析。从这些有趣的元素中，可以生成视觉解释（以视频的形式编译代理行为中感兴趣的特定突出情况），其中不同的内省水平及其相互联系提供了上下文化的解释（见图10）。

The authors applied their framework to the game Frogger and used it to generate video highlights of agents that were included in a user study. The latter showed that no summarizing technique among those used to generate highlight videos is adapted to all types of agents and scenarios. A related result is that agents having a monotonous, predictable performance will lack the variety of interactions needed by the interestingness framework to generate pertinent explanations. Finally, counter-intuitively, highlighting all different aspects of an agent’s interactions is not the best course of action, as it may confuse users by consecutively showing the best and poorest performances of an agent.
作者将他们的框架应用于游戏《青蛙》，并用它来生成用户研究中包含的代理的视频集锦。后者表明，在用于生成精彩视频的那些中，没有一种总结技术适用于所有类型的智能体和场景。一个相关的结果是，具有单调、可预测性能的智能体将缺乏趣味性框架所需的各种交互来生成相关的解释。最后，与直觉相反，突出显示代理交互的所有不同方面并不是最好的行动方案，因为它可能会通过连续显示代理的最佳和最差性能来混淆用户。

Download : Download high-res image (199KB)
下载：下载高分辨率图片（199KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 10. The interestingness framework. The introspection framework analyses interaction data collected by the agent and identifies interestingness elements of the interaction history. These elements are used by an explanation framework to expose the agent’s behaviour to a human user.
图 10.趣味性框架。内省框架分析智能体收集的交互数据，并识别交互历史记录的有趣元素。解释框架使用这些元素向人类用户公开代理的行为。

Reproduced with permission from Pedro Sequeira [33].
经佩德罗·塞奎拉（Pedro Sequeira）许可转载[33]。
2.3. Other concepts aiding XRL
2.3. 辅助 XRL 的其他概念
Some studies encountered do not fit in the above categories for the main reason that they are not linked to RL or do not directly provide explanations but nonetheless, they are interesting concepts that could contribute to the creation of new XRL methods in the future.
遇到的一些研究不属于上述类别，主要原因是它们与RL无关或没有直接提供解释，但尽管如此，它们仍然是有趣的概念，可能有助于未来创建新的XRL方法。

2.3.1. Explainability of CNNs
2.3.1. CNN的可解释性
Although deep neural networks have exhibited superior performance in various tasks, their interpretability is always their Achilles’ heel. Since CNNs are still considered black boxes, many recent research papers focus on providing different levels and notions of explanations to make them more explainable.
尽管深度神经网络在各种任务中表现出卓越的性能，但它们的可解释性始终是它们的致命弱点。由于CNN仍然被认为是黑匣子，因此最近的许多研究论文都专注于提供不同层次和概念的解释，以使其更易于解释。

As many RL models harness visual input DL models (for instance, when processing pixel observations), they could profit from better explainability of these algorithms. That way, the complete block of a CNN associated to learn a policy, would be explainable as whole. In addition, some techniques used in the visual domain, such as representation disentanglement could be relevant to apply in RL. Among the approaches detailed by Zhang et al. [82], one of the most promising aims at creating disentangled (interpretable) representations of the conv-layers of these networks [83], [84], as well as end-to-end learning of interpretable networks, working directly with comprehensible patterns, which are also a trending angle [85].
由于许多RL模型利用视觉输入深度学习模型（例如，在处理像素观察时），它们可以从这些算法更好的可解释性中受益。这样一来，与学习策略相关的 CNN 的完整块就可以整体解释。此外，视觉领域中使用的一些技术，例如表示解缠，可能与RL中的应用有关。在Zhang等人[82]详述的方法中，最有前途的目标之一是创建这些网络的卷积层的解缠（可解释）表示[83]，[84]，以及可解释网络的端到端学习，直接使用可理解的模式，这也是一个趋势角度[85]。

Explaining when, how, and under which conditions catastrophic forgetting [86] or memorizing of datasets occurs is another relevant aspect of life-long or continual learning [55] in DNNs yet not fully understood. An interesting method towards this vision is Learning Without Memorizing (LwM) [87], an extension of Learning Without Forgetting Multi-Class (LwF-MC) [88] applied to image classification. This model is able to incrementally learn new classes without forgetting classes previously learned and without storing data related them. The main idea is that at each step, a new model, the student, is trained to incrementally learn new classes, while the previous one, the teacher, only has knowledge of the base classes. By improving LwF-MC with the application of a new loss called Attention Distillation loss, LwM tries to preserve base classes knowledge across all models iterations. This new loss produces attention maps that can be studied by a human expert in order to interpret the model’s logic by inspecting the areas that focus its attention.
解释何时、如何以及在什么条件下发生灾难性遗忘[86]或记忆数据集是DNN中终身学习或持续学习[55]的另一个相关方面，但尚未完全理解。实现这一愿景的一个有趣的方法是无记忆学习（LwM）[87]，它是应用于图像分类的无遗多类学习（LwF-MC）[88]的扩展。该模型能够逐步学习新类，而不会忘记以前学习过的类，也无需存储与它们相关的数据。主要思想是，在每一步中，都会训练一个新模型（学生）逐步学习新课程，而前一个模型（教师）只了解基础课程。通过应用一种称为注意力蒸馏损失的新损失来改进 LwF-MC，LwM 试图在所有模型迭代中保留基类知识。这种新的损失会产生注意力图，人类专家可以研究这些图，以便通过检查集中注意力的区域来解释模型的逻辑。

Another approach for scene analysis aimed to build a graph where each node represents an object detected in the scene and is capable of building a context-aware representation of itself by sending messages to the other nodes [89]. This makes it possible for the network to support relational reasoning, allowing it to be effectively transparent. Thus, users are able to make textual enquiries about relationships between objects (e.g., “Is the plate next to a white bowl?”).
另一种场景分析方法旨在构建一个图，其中每个节点代表场景中检测到的一个对象，并且能够通过向其他节点发送消息来构建自身的上下文感知表示[89]。这使得网络能够支持关系推理，使其能够有效地透明。因此，用户能够对物体之间的关系进行文本查询（例如，“盘子旁边是白色碗吗？

2.3.2. Compositionality as a proxy tool to improve understandability
2.3.2. 组合性作为提高可理解性的代理工具
Compositionality is a universal concept stating that a complex (composed) problem can be decomposed into a set of simpler ones [90]. Thus, in the RL world, this idea can be translated into making an agent solve a complex task by hierarchically completing lesser ones (e.g. by first solving atomic ones as lesser tasks could also be complex) [91]. This provides reusability, enables quick initialization of policies and makes the learning process much faster by training an optimal policy for each reward and later combining them. Haarnoja et al. [34] showed that maximum entropy RL methods can produce much more composable policies. Empirical demonstrations were performed on a Sawyer robot trained to avoid a fix obstacle and to stack Lego blocks with both policies combined. They introduced the Soft Q-learning algorithm, based on maximum entropy RL [92] and energy-based models [93], as well as an extension of this algorithm that enables composition of learned skills. This kind of methods optimizing for compositionality does not provide a direct explanation tool; however compositionality can be qualitatively observed as self organized modules [94] and used to train multiple policies that benefit from being combined. Compositionality may also help better explain each policy along the training evolution in time, or each learned skill separately. However, it is also observed that compositionality may not emerge in the same manner as humans conceptually would understand it or expect it, e.g. based on symbolic abstract functionality modules. Some examples in language emergence in multi-agent RL settings show that generalization and acquisition speed [95] or language do not co-occur with compositionality, or that compositionality may not go hand in hand with language efficiency as in humans communication [96].
组合性是一个普遍的概念，指出一个复杂的（组合的）问题可以分解为一组更简单的问题[90]。因此，在RL世界中，这个想法可以转化为让智能体通过分层完成较小的任务来解决复杂的任务（例如，首先解决原子任务，因为较小的任务也可能很复杂）[91]。这提供了可重用性，支持策略的快速初始化，并通过为每个奖励训练最佳策略并随后将它们组合在一起，使学习过程更快。Haarnoja等[34]表明，最大熵RL方法可以产生更多的可组合策略。在Sawyer机器人上进行了实证演示，该机器人经过训练，可以避开固定障碍物，并将两种策略结合起来堆叠乐高积木。他们引入了基于最大熵RL[92]和基于能量的模型[93]的软Q学习算法，以及该算法的扩展，使学习技能的组合成为可能。这种针对组合性进行优化的方法并不能提供直接的解释工具;然而，组合性可以定性地观察为自组织模块[94]，并用于训练从组合中受益的多个策略。组合性还可能有助于更好地解释每个策略在训练过程中的演变，或者分别解释每个学习的技能。然而，人们也观察到，组合性可能不会以人类在概念上理解或期望它的方式出现，例如基于符号抽象功能模块。在多智能体强化环境中，语言涌现的一些例子表明，泛化和习得速度[95]或语言不会与组合性同时出现，或者组合性可能不会像人类交流那样与语言效率齐头并进[96]。

Distillation has also been used to learn task that are closely related and whose learning should improve speed up the learning of near tasks, in DisCoRL model [97], which helps transfer from simulation to real settings in navigation and goal based robotic tasks. We may then be able to further explain each policy along the training evolution timeline, or each learned skill separately.
在DisCoRL模型[97]中，蒸馏也被用于学习密切相关的任务，这些任务的学习应该可以提高对近距离任务的学习速度，这有助于在导航和基于目标的机器人任务中从模拟转移到真实设置。然后，我们可以沿着培训演变时间表进一步解释每个策略，或者单独解释每个学习的技能。

2.3.3. Improving trust via imitation learning
2.3.3. 通过模仿学习提高信任度
Imitation learning is a way of enabling algorithms to learn from human demonstrations, such as teaching robots to learn assembly skills [35], [98]. While improving training time (compared to more traditional approaches [43]), this method also allows for better understanding of the agent’s behaviour as it learns according to human expert actions [99]. It can also be a way to improve trust in the model, as it behaves seemingly as a human expert operator and can explain the basis of its decisions textually or verbally. Moreover, when encompassing human advice during training, it can be derived into advisable learning which further improves user trust as the model can understand human natural language and yields clear and precise explanations [100].
模仿学习是一种使算法能够从人类演示中学习的方法，例如教机器人学习组装技能[35]，[98]。在缩短训练时间的同时（与更传统的方法相比[43]），这种方法还可以更好地理解智能体的行为，因为它根据人类专家的行动进行学习[99]。它也可以成为提高对模型的信任度的一种方式，因为它的行为似乎是一个人类专家操作员，并且可以用文本或口头方式解释其决策的基础。此外，当在训练过程中包含人类建议时，可以将其推导出为可取的学习，从而进一步提高用户信任度，因为模型可以理解人类的自然语言并产生清晰准确的解释[100]。

2.3.4. Transparency-oriented explanation building
2.3.4. 以透明度为导向的解释构建
Transparency has been given multiple meanings over time, especially in robotics and AI Ethics. Theodorou et al. [101] freshly define it as a mechanism to expose decision making that could allow AI models to be debugged like traditional programs, as they will communicate information about their operation in real time. However, the relevance of this information should adapt to the user’s technological background, from simple progress bars to complex debug logs. An interesting concept is that an AI system could be created using a visual editor that can help communicate which decision will be taken in which situation (very much like decision trees). These concepts have already been successfully implemented in an RL setup using Temporal Difference (TD) error to create an emotional model of an agent [102].
随着时间的流逝，透明度被赋予了多种含义，尤其是在机器人和人工智能伦理方面。Theodorou等[101]将其定义为一种公开决策的机制，可以使AI模型像传统程序一样进行调试，因为它们将实时传达有关其操作的信息。但是，此信息的相关性应适应用户的技术背景，从简单的进度条到复杂的调试日志。一个有趣的概念是，可以使用可视化编辑器创建人工智能系统，该编辑器可以帮助传达在哪种情况下将做出哪个决策（非常类似于决策树）。这些概念已经在RL设置中成功实现，使用时间差分（TD）误差来创建智能体的情感模型[102]。

Discussion 3. 讨论
Despite explainable deep RL being still an emerging research field, we observed that numerous approaches were developed so far, as detailed in Section 2. However, there is no clear-cut method that serves all purposes. Most of the reviewed XRL methods are specifically designed to fit a particular task, often related to games or robotics and with no straight forward extension to other real-world RL applications. Furthermore, those methods cannot be generalized to other tasks or algorithms as they often make specific assumptions (e.g. on the MDP or environment properties). In fact in XRL there can be more than one model (as in Actor–Critic architectures) and different kinds of algorithms (DQN, DDPG, SARSA…) each with its own particularities. Moreover, there exists a wide variety of environments where each brings its own constraints. The necessity to adapt to the considered algorithm and environment means that it is hard to provide a holistic or generic explainability method. Thus, in our opinion, Shapley value-based methods [16], [26] can be considered as an interesting lead to contribute to this goal. Shapley values could be used to explain the roles taken by agents when learning a policy to achieve a collaborative task but also to detect defects in training agents or in the data fed to the network. In addition, as a post-hoc explainability method, it may be possible to generalize Shapley value computation to numerous RL environments and models in the same way it was done with SHAP [16] for other black boxes Deep Learning classifiers or regressors.
尽管可解释的深度强化学习仍然是一个新兴的研究领域，但我们观察到到目前为止已经开发了许多方法，详见第 2 节。但是，没有明确的方法可以满足所有目的。大多数经过审查的XRL方法都是专门为适应特定任务而设计的，通常与游戏或机器人技术有关，并且没有直接扩展到其他现实世界的RL应用。此外，这些方法不能推广到其他任务或算法，因为它们经常做出特定的假设（例如，关于MDP或环境属性）。事实上，在 XRL 中，可以有多个模型（如在 Actor-Critic 架构中）和不同类型的算法（DQN、DDPG、SARSA…），每种算法都有自己的特殊性。此外，存在各种各样的环境，每个环境都有自己的限制。适应所考虑的算法和环境的必要性意味着很难提供整体或通用的可解释性方法。因此，在我们看来，Shapley基于价值的方法[16]，[26]可以被认为是有助于实现这一目标的一个有趣的线索。Shapley值可用于解释智能体在学习策略以实现协作任务时所扮演的角色，也可以用于检测训练智能体或馈送到网络的数据中的缺陷。此外，作为一种事后可解释性方法，可以将Shapley值计算推广到许多RL环境和模型，就像使用SHAP [16]对其他黑盒深度学习分类器或回归器所做的那样。

Meanwhile, the research community would benefit if more global-oriented approaches, which do not focus on a particular task or algorithm, were developed in the future, as it has already been done in general XAI, with for instance LIME [15] or SHAP [16].
同时，如果未来开发出更多面向全球的方法，而不是专注于特定的任务或算法，研究界将受益，就像在一般的XAI中已经做过的那样，例如LIME [15]或SHAP [16]。

Moreover, some promising approaches to bring explainability to RL include representation learning related concepts such as Hindsight Experience Replay, Hierarchical RL and self-attention. However, despite the ability of those concepts to improve performance and interpretability in a mathematical sense (in particular representation learning), they somehow lack concrete explanations targeted to end users, as they mostly target technical domain experts and researchers. This is a key element to further develop and allow the deployment of RL in the real world and to make algorithms more trustable and understandable by the general public.
此外，一些有前途的方法为RL带来可解释性，包括表征学习相关概念，如后见之明经验回放、分层RL和自我关注。然而，尽管这些概念能够在数学意义上提高性能和可解释性（特别是表示学习），但它们在某种程度上缺乏针对最终用户的具体解释，因为它们主要针对技术领域专家和研究人员。这是进一步发展和允许在现实世界中部署RL的关键因素，也是使算法更易于公众信任和理解的关键因素。

The state of the art shows there is still room for progress to be made to better explain deep RL models in terms of different invariants preservation and other common assumptions of disentangled representation learning [103], [104].
最新技术表明，在更好地解释深度强化模型方面，仍有进步的空间，包括不同的不变量守恒和其他常见的解缠表示学习假设[103]，[104]。

Conclusion and future work
结论和今后的工作
We reviewed and analysed different state of the art approaches on RL and how XAI techniques can elucidate and inform their training, debugging and communication to different stakeholder audiences.
我们回顾并分析了关于RL的不同最新方法，以及XAI技术如何阐明和告知他们的训练、调试和与不同利益相关者受众的沟通。

We focused on agent based RL in this work, however, explainability in RL involving humans (e.g. in collaborative problem solving [105]) should involve explainability methods to better assess when robots are able to perform the requested task, and when uncertainty is an indicator of better relying a task to a human. Equally important is to evaluate and explain other aspects in reinforcement learning, e.g. formally explaining the role of curriculum learning [106], quality diversity or other human-learning inspired aspects of open-ended learning [42], [107], [108]. Thus, more theoretic bases to serve explainable by design DRL are required. The future development of post-hoc XAI techniques should adapt to the requirements to build, train, and convey DRL models. Furthermore, it is worth noting that all presented methods decompose final prediction into additive components attributed to particular features [109], and thus interaction between features should be accounted for, and included in the explanation elaboration. Since most presented strategies to explain RL have mainly considered discrete model interpretations for explaining a model, as advocated in [110], continuous formulations of the proposed approaches (such as Integrated Gradients [111] based on the continuous extension of Shapley value, Aumann–Shapley value cost-sharing technique) should be devised in the future in RL contexts.
在这项工作中，我们专注于基于智能体的RL，然而，涉及人类的RL中的可解释性（例如，在协作解决问题[105]中）应该涉及可解释性方法，以更好地评估机器人何时能够执行请求的任务，以及不确定性何时是更好地依赖人类任务的指标。同样重要的是评估和解释强化学习的其他方面，例如正式解释课程学习的作用[106]、质量多样性或开放式学习的其他人类学习启发方面[42]、[107]、[108]。因此，需要更多的理论基础来通过设计来解释DRL。事后 XAI 技术的未来发展应适应构建、训练和传达 DRL 模型的要求。此外，值得注意的是，所有提出的方法都将最终预测分解为归因于特定特征的加性成分[109]，因此应考虑特征之间的相互作用，并包含在解释阐述中。由于大多数解释RL的策略主要考虑了离散模型解释，如[110]所倡导的那样，因此未来应该在RL环境中设计所提出的方法的连续公式（例如基于Shapley值连续扩展的集成梯度[111]，Aumann-Shapley值成本分摊技术）。

We believe the reviewed approaches and future extensions tackling the identified issues will likely be critical in the demanding future applications of RL. We advocate for the needs of targeting in the future more diverse audiences (developer, tester, end-user, general public) not yet approached in the development of XAI tools. Only this way we will produce actionable explanations and more comprehensive frameworks for explainable, trustable and responsible RL that can be deployed in practice.
我们相信，经过审查的方法和未来解决已确定问题的扩展对于RL未来苛刻的应用可能至关重要。我们提倡在未来针对更多样化的受众（开发人员、测试人员、最终用户、公众）的需求，这些受众在 XAI 工具的开发中尚未触及。只有这样，我们才能为可解释、可信赖和负责任的强化学习提供可操作的解释和更全面的框架，这些框架可以在实践中部署。

CRediT authorship contribution statement
CRediT 作者贡献声明
Alexandre Heuillet: Conceptualization, Investigation, Visualization, Writing - original draft. Fabien Couthouis: Conceptualization, Investigation, Visualization, Writing - original draft. Natalia Díaz-Rodríguez: Conceptualization, Project administration, Supervision, Writing - review & editing, Resources.
Alexandre Heuillet：概念化、调查、可视化、写作 - 原稿。Fabien Couthouis：概念化、调查、可视化、写作 - 原稿。娜塔莉亚·迪亚斯-罗德里格斯（Natalia Díaz-Rodríguez）：概念化，项目管理，监督，写作 - 审查和编辑，资源。

Declaration of Competing Interest
利益冲突声明
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明，他们没有已知的相互竞争的经济利益或个人关系，这些利益或关系可能会影响本文所报告的工作。

Acknowledgments 确认
We thank Sam Greydanus, Zoe Juozapaitis, Benjamin Beyret, Prashan Madumal, Pedro Sequiera, Jianhong Wang, Mathieu Seurin and Vinicius Zambaldi for allowing us to use their original images for illustration purposes. We also would like to thank Frédéric Herbreteau and Adrien Bennetot for their help and support.
我们感谢 Sam Greydanus、Zoe Juozapaitis、Benjamin Beyret、Prashan Madumal、Pedro Sequiera、Jianhong Wang、Mathieu Seurin 和 Vinicius Zambaldi 允许我们使用他们的原始图像进行插图。我们还要感谢 Frédéric Herbreteau 和 Adrien Bennetot 的帮助和支持。

Appendix A. 附录 A.
A.1. Glossary A.1. 词汇表
•
A2C: Asynchronous Actor–Critic [78]
A2C：异步演员-评论家 [78]

•
AI: Artificial Intelligence
AI：人工智能

•
COMA: Counterfactual multi-agent [112]
COMA：反事实多智能体 [112]

•
CNN: Convolutional Neural Network [113]
CNN：卷积神经网络 [113]

•
DDPG: Deep Deterministic Policy Gradient [114]
DDPG：深度确定性策略梯度 [114]

•
DL: Deep Learning DL：深度学习

•
DRL: Deep Reinforcement Learning
DRL：深度强化学习

•
DQN: Deep Q Network [3]
DQN：深度 Q 网络 [3]

•
GAN: Generative Adversarial Network [115]
GAN：生成对抗网络 [115]

•
HER: Hindsight Experience Replay [73]
HER：事后诸葛亮的经历回放 [73]

•
HMM: Hidden Markov Model HMM：隐马尔可夫模型

•
HRA: Hybrid Reward Architecture [71]
HRA：混合奖励架构 [71]

•
HRL: Hierarchical Reinforcement Learning [72]
HRL：分层强化学习 [72]

•
IDDPG: Independent DDPG [114]
IDDPG：独立DDPG [114]

•
MADDPG: Multiagent DDPG [70]
MADDPG：多智能体 DDPG [70]

•
MDP: Markov Decision Process
MDP：马尔可夫决策过程

•
Machine Learning: Machine Learning
机器学习：机器学习

•
POMDP: Partially Observable Markov Decision Process
POMDP：部分可观察马尔可夫决策过程

•
PPO: Proximal Policy Optimization [116]
PPO：近端策略优化 [116]

•
R-CNN: Region Convolutional Neural Network [117]
R-CNN：区域卷积神经网络 [117]

•
RL: Reinforcement Learning
RL：强化学习

•
SARSA: State Action Reward State Action [118]
SARSA：国家行动奖励国家行动 [118]

•
SRL: State Representation Learning [37]
SRL：状态表征学习 [37]

•
VAE: Variational Auto-Encoder [54]
VAE：变分自动编码器 [54]

•
XAI: Explainable Artificial Intelligence
XAI：可解释的人工智能

•
XRL: Explainable Reinforcement Learning
XRL：可解释的强化学习

References 引用
[1]
Sutton R.S., Barto A.G. 萨顿R.S.，巴托A.G.
Reinforcement Learning: An Introduction
强化学习：简介
(second ed.), The MIT Press (2018)
（第二版），麻省理工学院出版社（2018 年）
URL http://incompleteideas.net/book/the-book-2nd.html
Google Scholar Google 学术搜索
[2]
Duan Y., Chen X., Houthooft R., Schulman J., Abbeel P.
段Y.， ChenX.， HouthooftR.， SchulmanJ.， AbbeelP.
Benchmarking deep reinforcement learning for continuous control
对深度强化学习进行基准测试，实现持续控制
(2016)
arXiv:1604.06778, URL https://arxiv.org/pdf/1604.06778.pdf
Google Scholar Google 学术搜索
[3]
Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M.
MnihV.， KavukcuogluK.， SilverD.， GravesA.， AntonoglouI.， WierstraD.， RiedmillerM.
Playing atari with deep reinforcement learning
使用深度强化学习玩雅达利
(2013)
arXiv preprint arXiv:1312.5602, URL https://arxiv.org/pdf/1312.5602.pdf
arXiv 预印本 arXiv：1312.5602， URL https://arxiv.org/pdf/1312.5602.pdf
Google Scholar Google 学术搜索
[4]
Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Graves A., Riedmiller M., Fidjeland A.K., Ostrovski G., et al.
MnihV.， KavukcuogluK.， SilverD.， RusuA.A.， VenessJ.， BellemareM.G.， GravesA.， RiedmillerM.， FidjelandA.K.， OstrovskiG.， et al.
Human-level control through deep reinforcement learning
通过深度强化学习实现人类水平的控制
Nature, 518 (7540) (2015), p. 529
《自然》，518 （7540）（2015），第 529 页
URL https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
View PDF
Your institution provides access to this article.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
[5]
Silver D., Hubert T., Schrittwieser J., Antonoglou I., Lai M., Guez A., Lanctot M., Sifre L., Kumaran D., Graepel T., Lillicrap T., Simonyan K., Hassabis D.
SilverD.， HubertT.， SchrittwieserJ.， AntonoglouI.， LaiM.， GuezA.， LanctotM.， SifreL.， KumaranD.， GraepelT.， LillicrapT.， SimonyanK.， HassabisD.
Mastering chess and shogi by self-play with a general reinforcement learning algorithm
使用通用强化学习算法通过自我对弈来掌握国际象棋和将棋
(2017)
arXiv:1712.01815, URL https://arxiv.org/pdf/1712.01815.pdf
Google Scholar Google 学术搜索
[6]
Silver D., Schrittwieser J., Simonyan K., Antonoglou I., Huang A., Guez A., Hubert T., Baker L.R., Lai M., Bolton A., chen Y., Lillicrap T.P., Hui F., Sifre L., van den Driessche G., Graepel T., Hassabis D.
SilverD.， SchrittwieserJ.， SimonyanK.， AntonoglouI.， HuangA.， GuezA.， HubertT.， BakerL.R.， LaiM.， BoltonA.， chenY.， LillicrapT.P.， HuiF.， SifreL.， van den DriesscheG.， GraepelT.， HassabisD.
Mastering the game of Go without human knowledge
在人类不知情的情况下掌握围棋游戏
Nature, 550 (2017), pp. 354-359
《自然（Nature）》，第 550 卷（2017 年），第 354-359 页
URL https://www.nature.com/articles/nature24270
View PDF
Your institution provides access to this article.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
[7]
Andrychowicz O.M., Baker B., Chociej M., Józefowicz R., McGrew B., Pachocki J., Petron A., Plappert M., Powell G., Ray A., et al.
AndrychowiczO.M.， BakerB.， ChociejM.， JózefowiczR.， McGrewB.， PachockiJ.， PetronA.， PlappertM.， PowellG.， RayA.， et al.
Learning dexterous in-hand manipulation
学习灵巧的手操作
Int. J. Robot. Res. (2019), Article 027836491988744, 10.1177/0278364919887447
Int. J. 机器人。第（2019）号决议，第027836491988744条，10.1177/0278364919887447
View PDF
This article is free to access.
Google Scholar Google 学术搜索
[8]
Kalashnikov D., Irpan A., Pastor P., Ibarz J., Herzog A., Jang E., Quillen D., Holly E., Kalakrishnan M., Vanhoucke V., Levine S.
卡拉什尼科夫冲锋枪D.， IrpanA.， PastorP.， IbarzJ.， HerzogA.， JangE.， QuillenD.， HollyE.， KalakrishnanM.， VanhouckeV.， LevineS.
QT-OPt: Scalable deep reinforcement learning for vision-based robotic manipulation
QT-OPt：用于基于视觉的机器人操作的可扩展深度强化学习
(2018)
arXiv:1806.10293, URL https://arxiv.org/pdf/1806.10293.pdf
Google Scholar Google 学术搜索
[9]
Mao H., Alizadeh M., Menache I., Kandula S.
MaoH.， AlizadehM.， MenacheI.， KandulaS.
Resource management with deep reinforcement learning
使用深度强化学习进行资源管理
Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets ’16, ACM, New York, NY, USA (2016), pp. 50-56, 10.1145/3005745.3005750
Proceedings of the 15th ACM Workshop on Hot Topics in Networks， HotNets '16， ACM， New York， NY， USA （2016）， pp. 50-56， 10.1145/3005745.3005750
URL http://doi.acm.org/10.1145/3005745.3005750
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[10]
Arel I., Liu C., Urbanik T., Kohls A.G.
ArelI.， LiuC.， UrbanikT.， KohlsA.G.
Reinforcement learning-based multi-agent system for network traffic signal control
基于强化学习的网络交通信号控制多智能体系统
IET Intell. Transp. Syst., 4 (2) (2010), pp. 128-135, 10.1049/iet-its.2009.0070
IET智能。《系统》，第 4 卷第 2 期（2010 年），第 128-135 页，10.1049/iet-its.2009.0070
URL http://web.eecs.utk.edu/ ielhanan/Papers/IET_ITS_2010.pdf
网址 http://web.eecs.utk.edu/ ielhanan/Papers/IET_ITS_2010.pdf
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[11]
Zhou Z., Li X., Zare R.N. 周志， LiX.， ZareR.N.
Optimizing chemical reactions with deep reinforcement learning
通过深度强化学习优化化学反应
ACS Cent. Sci., 3 (12) (2017), pp. 1337-1344, 10.1021/acscentsci.7b00492
《ACS Cent. Sci.》，第 3 卷第 12 期（2017 年），第 1337-1344 页，10.1021/acscentsci.7b00492
PMID: 29296675 PMID：29296675
View PDF
This article is free to access.
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[12]
Zheng G., Zhang F., Zheng Z., Xiang Y., Yuan N., Xie X., Li Z.
郑彦，张芳，郑湘，袁倩，谢淑，李婷婷
DRN: A deep reinforcement learning framework for news recommendation
DRN：用于新闻推荐的深度强化学习框架
(2018), pp. 167-176, 10.1145/3178876.3185994
（2018），第 167-176 页，10.1145/3178876.3185994
URL http://www.personal.psu.edu/ gjz5038/paper/www2018_reinforceRec/www2018_reinforceRec.pdf
网址 http://www.personal.psu.edu/ gjz5038/paper/www2018_reinforceRec/www2018_reinforceRec.pdf
View article Google Scholar Google 学术搜索
[13]
Arrieta A.B., Díaz-Rodríguez N., Ser J.D., Bennetot A., Tabik S., Barbado A., García S., Gil-López S., Molina D., Benjamins R., Chatila R., Herrera F.
ArrietaA.B.， Díaz-RodríguezN.， SerJ.D.， BennetotA.， TabikS.， BarbadoA.， GarcíaS.， Gil-LópezS.， MolinaD.， BenjaminsR.， ChatilaR.， HerreraF.
Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
可解释的人工智能（XAI）：负责任的 AI 的概念、分类法、机遇和挑战
(2019)
arXiv:1910.10045, URL https://arxiv.org/pdf/1910.10045.pdf
Google Scholar Google 学术搜索
[14]
Gunning D., Aha D.W. GunningD.， AhaD.W.
DARPA’s explainable artificial intelligence program
DARPA的可解释人工智能程序
AI Mag., 40 (2) (2019), pp. 44-58
《人工智能杂志（AI Mag.）》，第 40 卷第 2 期（2019 年），第 44-58 页
URL https://search.proquest.com/openview/df03d4be3ad9847da3e414fb57bc1f10
View PDF
This article is free to access.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
[15]
Ribeiro M.T., Singh S., Guestrin C.
RibeiroM.T.， SinghS.， GuestrinC.
”Why Should I Trust You?”: Explaining the predictions of any classifier
“我为什么要相信你？”：解释任何分类器的预测
(2016)
arXiv:1602.04938 arXiv：1602.04938
Google Scholar Google 学术搜索
[16]
Lundberg S., Lee S.-I. 伦德伯格S.，李S.-I.
A unified approach to interpreting model predictions
解释模型预测的统一方法
(2017)
arXiv:1705.07874, URL https://arxiv.org/pdf/1705.07874
Google Scholar Google 学术搜索
[17]
Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D.
SelvarajuR.R.， CogswellM.， DasA.， VedantamR.， ParikhD.， BatraD.
Grad-CAM: Visual explanations from deep networks via gradient-based localization
Grad-CAM：通过基于梯度的定位从深度网络进行可视化解释
Int. J. Comput. Vis. (2019), 10.1007/s11263-019-01228-7
国际 J. Comput.可见。（2019）， 10.1007/s11263-019-01228-7
View PDF
Your institution provides access to this article.
Google Scholar Google 学术搜索
[18]
Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D.M., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., Amodei D.
BrownT.B.， MannB.， RyderN.， SubbiahM.， KaplanJ.， DhariwalP.， NeelakantanA.， ShyamP.， SastryG.， AskellA.， AgarwalS.， Herbert-VossA.， KruegerG.， HenighanT.， ChildR.， RameshA.， ZieglerD.M.， WuJ.， WinterC.， HesseC.， ChenM.， SiglerE.， LitwinM.， GrayS.， ChessB.， ClarkJ.， BernerC.， McCandlishS.， RadfordA.， SutskeverI.， AmodeiD.
Language models are few-shot learners
语言模型是少数学习者
(2020)
arXiv:2005.14165 arXiv：2005.14165
Google Scholar Google 学术搜索
[19]
Doshi-Velez F., Kim B. Doshi-VelezF.， KimB.
Towards a rigorous science of interpretable machine learning
迈向可解释机器学习的严谨科学
(2017)
arXiv:1702.08608, URL https://arxiv.org/pdf/1702.08608.pdf
Google Scholar Google 学术搜索
[20]
Hoffman R.R., Mueller S.T., Klein G., Litman J.
霍夫曼R.R.，穆勒S.T.，克莱因G.，利特曼J.
Metrics for explainable AI: Challenges and prospects
可解释人工智能的指标：挑战与前景
(2018)
arXiv:1812.04608, URL https://arxiv.org/pdf/1812.04608
Google Scholar Google 学术搜索
[21]
Zambaldi V., Raposo D., Santoro A., Bapst V., Li Y., Babuschkin I., Tuyls K., Reichert D., Lillicrap T., Lockhart E., Shanahan M., Langston V., Pascanu R., Botvinick M., Vinyals O., Battaglia P.
赞波第V.，拉波索D.，桑托罗A.，巴普斯特V.， LiY.，巴布施金I.，图伊尔斯K.，赖克特D.，利利克拉普T.，洛克哈特E.，沙纳汉M.，兰斯顿V.，帕斯卡努R.，博特维尼克M.，维尼亚尔斯O.，巴塔利亚P.
Relational deep reinforcement learning
关系深度强化学习
(2018)
arXiv:1806.01830, URL https://arxiv.org/pdf/1806.01830
Google Scholar Google 学术搜索
[22]
d’Avila Garcez A., Resende Riquetti Dutra A., Alonso E.
达维拉·加尔塞斯A.，雷森德·里奎蒂·杜特拉A.，阿隆索.
Towards symbolic reinforcement learning with common sense
迈向具有常识的符号强化学习
(2018)
arXiv e-prints arXiv:1804.08597, URL https://ui.adsabs.harvard.edu/abs/2018arXiv180408597D
arXiv e-prints arXiv：1804.08597， URL https://ui.adsabs.harvard.edu/abs/2018arXiv180408597D
Google Scholar Google 学术搜索
[23]
Raffin A., Hill A., Traoré K.R., Lesort T., Díaz-Rodríguez N., Filliat D.
RaffinA.， HillA.， TraoréK.R.， LesortT.， Díaz-RodríguezN.， FilliatD.
Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics
将特征提取与策略学习解耦：评估状态表示学习在基于目标的机器人技术中的好处
(2019)
CoRR abs/1901.08651, arXiv:1901.08651
CoRR abs/1901.08651， arXiv：1901.08651
Google Scholar Google 学术搜索
[24]
Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, F. Doshi-Velez, Explainable reinforcement learning via reward decomposition, URL http://web.engr.oregonstate.edu/ afern/papers/reward_decomposition__workshop_final.pdf.
Z. Juozapaitis， A. Koul， A. Fern， M. Erwig， F. Doshi-Velez，通过奖励分解的可解释强化学习， URL http://web.engr.oregonstate.edu/ afern/papers/reward_decomposition__workshop_final.pdf.
Google Scholar Google 学术搜索
[25]
Madumal P., Miller T., Sonenberg L., Vetere F.
MadumalP.， MillerT.， SonenbergL.， VetereF.
Explainable reinforcement learning through a causal lens
通过因果透镜的可解释强化学习
(2019)
arXiv:1905.10958, URL https://arxiv.org/pdf/1905.10958.pdf
Google Scholar Google 学术搜索
[26]
Wang J., Zhang Y., Kim T.-K., Gu Y.
王杰，张彦， KimT.-K.， GuY.
Shapley Q-value: A local reward approach to solve global reward games
Shapley Q值：解决全球奖励博弈的本地奖励方法
(2019)
arXiv:1907.05707, URL https://arxiv.org/pdf/1907.05707.pdf
Google Scholar Google 学术搜索
[27]
Beyret B., Shafti A., Faisal A.
贝雷特B.，沙夫蒂A.，费萨拉.
Dot-to-Dot: Explainable hierarchical reinforcement learning for robotic manipulation
点对点：用于机器人操作的可解释分层强化学习
(2019)
URL https://arxiv.org/pdf/1904.06703.pdf
Google Scholar Google 学术搜索
[28]
Cideron G., Seurin M., Strub F., Pietquin O.
CideronG.， SeurinM.， StrubF.， PietquinO.
Self-educated language agent with hindsight experience replay for instruction following
自学成才的语言代理，具有事后经验，可重播指导
(2019)
arXiv:1910.09451, URL https://arxiv.org/pdf/1910.09451.pdf
Google Scholar Google 学术搜索
[29]
D. Yu, C. Khatri, A. Papangelis, A. Madotto, M. Namazifar, J. Huizinga, A. Ecoffet, H. Zheng, P. Molino, J. Clune, et al. Commonsense and semantic-guided navigation through language in embodied environment, URL https://vigilworkshop.github.io/static/papers/49.pdf.
D. Yu， C. Khatri， A. Papangelis， A. Madotto， M. Namazifar， J. Huizinga， A. Ecoffet， H. Zheng， P. Molino， J. Clune， et al.通过具身环境中的语言进行常识性和语义引导导航，URL https://vigilworkshop.github.io/static/papers/49.pdf。
Google Scholar Google 学术搜索
[30]
Tasse G.N., James S., Rosman B.
塔塞G.N.，詹姆斯S.，罗斯曼B.
A boolean task algebra for reinforcement learning
用于强化学习的布尔任务代数
(2020)
arXiv:2001.01394, URL https://arxiv.org/pdf/2001.01394.pdf
Google Scholar Google 学术搜索
[31]
Greydanus S., Koul A., Dodge J., Fern A.
GreydanusS.， KoulA.， DodgeJ.， FernA.
Visualizing and understanding atari agents
可视化和理解雅达利代理
(2017)
arXiv:1711.00138, URL https://arxiv.org/pdf/1711.00138
Google Scholar Google 学术搜索
[32]
Sequeira P., Yeh E., Gervasio M.T.
SequeiraP.， YehE.， GervasioM.T.
Interestingness elements for explainable reinforcement learning through introspection
通过内省进行可解释强化学习的有趣元素
IUI Workshops (2019) 人工授精工作坊（2019）
URL https://explainablesystems.comp.nus.edu.sg/2019/wp-content/uploads/2019/02/IUI19WS-ExSS2019-1.pdf
Google Scholar Google 学术搜索
[33]
Sequeira P., Gervasio M. SequeiraP.， GervasioM.
Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitations
可解释强化学习的趣味性元素：了解智能体的能力和局限性
(2019)
arXiv:1912.09007, URL https://arxiv.org/pdf/1912.09007
Google Scholar Google 学术搜索
[34]
Haarnoja T., Pong V., Zhou A., Dalal M., Abbeel P., Levine S.
HaarnojaT.， PongV.， ZhouA.， DalalM.， AbbeelP.， LevineS.
Composable deep reinforcement learning for robotic manipulation
用于机器人操作的可组合深度强化学习
2018 IEEE International Conference on Robotics and Automation, ICRA, IEEE (2018), 10.1109/icra.2018.8460756
2018 IEEE机器人与自动化国际会议， ICRA， IEEE （2018）， 10.1109/icra.2018.8460756
URL https://arxiv.org/pdf/1803.06773.pdf
View article Google Scholar Google 学术搜索
[35]
Al-Yacoub A., Zhao Y., Lohse N., Goh M., Kinnell P., Ferreira P., Hubbard E.-M.
Al-YacoubA.， ZhaoY.， LohseN.， GohM.， KinnellP.， FerreiraP.， HubbardE.-M.
Symbolic-based recognition of contact states for learning assembly skills
基于符号的接触状态识别，用于学习装配技能
Front. Robot. AI, 6 (2019), p. 99, 10.3389/frobt.2019.00099
前额。机器人。AI， 6 （2019）， p. 99， 10.3389/frobt.2019.00099
URL https://www.frontiersin.org/article/10.3389/frobt.2019.00099
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[36]
Lütjens B., Everett M., How J.P.
LütjensB.， EverettM.， HowJ.P.
Safe reinforcement learning with model uncertainty estimates
基于模型不确定性估计的安全强化学习
(2018)
arXiv:1810.08700 arXiv：1810.08700
Google Scholar Google 学术搜索
[37]
Lesort T., Díaz-Rodríguez N., Goudou J.-F.c., Filliat D.
LesortT.， Díaz-RodríguezN.， GoudouJ.-F.c.， FilliatD.
State representation learning for control: An overview
控制的状态表示学习：概述
Neural Netw., 108 (2018), pp. 379-392, 10.1016/j.neunet.2018.07.006
《神经网络》，第 108 卷（2018 年），第 379-392 页，10.1016/j.neunet.2018.07.006
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
[38]
Bengio Y., Courville A., Vincent P.
BengioY.， CourvilleA.， VincentP.
Representation learning: A review and new perspectives
表征学习：回顾和新视角
(2012)
arXiv:1206.5538 arXiv：1206.5538
Google Scholar Google 学术搜索
[39]
Lesort T., Seurin M., Li X., Díaz-Rodríguez N., Filliat D.
LesortT.， SeurinM.， LiX.， Díaz-RodríguezN.， FilliatD.
Deep unsupervised state representation learning with robotic priors: a robustness analysis
基于机器人先验的深度无监督状态表示学习：鲁棒性分析
2019 International Joint Conference on Neural Networks, IJCNN (2019), pp. 1-8
2019 年神经网络国际联合会议，IJCNN （2019），第 1-8 页
URL https://hal.archives-ouvertes.fr/hal-02381375/document
Google Scholar Google 学术搜索
[40]
Raffin A., Hill A., Traoré R., Lesort T., Díaz-Rodríguez N., Filliat D.
RaffinA.， HillA.， TraoréR.， LesortT.， Díaz-RodríguezN.， FilliatD.
S-RL toolbox: Environments, datasets and evaluation metrics for state representation learning
S-RL工具箱：用于状态表示学习的环境、数据集和评估指标
(2018)
arXiv:1809.09369 arXiv：1809.09369
Google Scholar Google 学术搜索
[41]
Traoré R., Caselles-Dupré H., Lesort T., Sun T., Cai G., Díaz-Rodríguez N., Filliat D.
TraoréR.， Caselles-DupréH.， LesortT.， SunT.， CaiG.， Díaz-RodríguezN.， FilliatD.
DisCoRL: Continual reinforcement learning via policy distillation
DisCoRL：通过策略蒸馏进行持续强化学习
(2019)
arXiv:1907.05855 arXiv：1907.05855
Google Scholar Google 学术搜索
[42]
Doncieux S., Bredeche N., Goff L.L., t Girard B., Coninx A., Sigaud O., Khamassi M., Díaz-Rodríguez N., Filliat D., Hospedales T., Eiben A., Duro R.
DoncieuxS.， BredecheN.， GoffL.L.， t GirardB.， ConinxA.， SigaudO.， KhamassiM.， Díaz-RodríguezN.， FilliatD.， HospedalesT.， EibenA.， DuroR.
DREAM architecture: a developmental approach to open-ended learning in robotics
DREAM架构：机器人开放式学习的发展方法
(2020)
arXiv:2005.06223 arXiv：2005.06223
Google Scholar Google 学术搜索
[43]
Doncieux S., Filliat D., Díaz-Rodríguez N., Hospedales T., Duro R., Coninx A., Roijers D.M., Girard B., Perrin N., Sigaud O.
DoncieuxS.， FilliatD.， Díaz-RodríguezN.， HospedalesT.， DuroR.， ConinxA.， RoijersD.M.， GirardB.， PerrinN.， SigaudO.
Open-ended learning: A conceptual framework based on representational redescription
开放式学习：基于表征重描述的概念框架
Front. Neurorobotics, 12 (2018), p. 59, 10.3389/fnbot.2018.00059
前面。神经机器人学，12（2018 年），第 59 页，10.3389/fnbot.2018.00059
URL https://www.frontiersin.org/article/10.3389/fnbot.2018.00059
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[44]
Alvernaz S., Togelius J. AlvernazS.， TogeliusJ.
Autoencoder-augmented neuroevolution for visual doom playing
自动编码器增强的神经进化，用于视觉厄运游戏
(2017)
arXiv:1707.03902, URL https://arxiv.org/pdf/1707.03902
Google Scholar Google 学术搜索
[45]
Finn C., Tan X.Y., Duan Y., Darrell T., Levine S., Abbeel P.
FinnC.， TanX.Y.， DuanY.， DarrellT.， LevineS.， AbbeelP.
Deep spatial autoencoders for visuomotor learning
用于视觉运动学习的深度空间自动编码器
(2015)
arXiv:1509.06113, URL https://arxiv.org/pdf/1509.06113
Google Scholar Google 学术搜索
[46]
van Hoof H., Chen N., Karl M., van der Smagt P., Peters J.
的 HoofH.， ChenN.， KarlM.， van der SmagtP.， PetersJ.
Stable reinforcement learning with autoencoders for tactile and visual data
使用自动编码器对触觉和视觉数据进行稳定的强化学习
2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS (2016), pp. 3928-3934
2016 IEEE/RSJ 智能机器人与系统国际会议，IROS （2016），第 3928-3934 页
URL https://www.ias.informatik.tu-darmstadt.de/uploads/Site/EditPublication/hoof2016IROS.pdf
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[47]
Pathak D., Agrawal P., Efros A.A., Darrell T.
PathakD.， AgrawalP.， EfrosA.A.， DarrellT.
Curiosity-driven exploration by self-supervised prediction
基于自监督预测的好奇心驱动探索
(2017)
arXiv:1705.05363 arXiv：1705.05363
Google Scholar Google 学术搜索
[48]
Shelhamer E., Mahmoudieh P., Argus M., Darrell T.
ShelhamerE.， MahmoudiehP.， ArgusM.， DarrellT.
Loss is its own reward: Self-supervision for reinforcement learning
损失本身就是回报：强化学习的自我监督
(2016)
arXiv:1612.07307 arXiv：1612.07307
Google Scholar Google 学术搜索
[49]
Jonschkowski R., Brock O. JonschkowskiR.，布罗克O.
Learning state representations with robotic priors
使用机器人先验学习状态表示
Auton. Robots (2015), pp. 407-428
奥托恩。机器人（2015），第 407-428 页
https://doi.org/10.1007/s10514-015-9459-7
View PDF
Your institution provides access to this article.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
[50]
Higgins I., Amos D., Pfau D., Racaniere S., Matthey L., Rezende D., Lerchner A.
HigginsI.， AmosD.， PfauD.， RacaniereS.， MattheyL.， RezendeD.， LerchnerA.
Towards a definition of disentangled representations
走向解纠缠表示的定义
(2018)
arXiv:1812.02230 arXiv：1812.02230
Google Scholar Google 学术搜索
[51]
Achille A., Eccles T., Matthey L., Burgess C., Watters N., Lerchner A., Higgins I.
阿喀琉斯A.，埃克尔斯T.，马特伊L.，伯吉斯C.，沃特斯N.，勒希纳A.，希金斯I.
Life-long disentangled representation learning with cross-domain latent homologies
具有跨域潜在同源性的终身解缠表示学习
Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R. (Eds.), Advances in Neural Information Processing Systems, Vol. 31, Curran Associates, Inc. (2018), pp. 9873-9883
BengioS.， WallachH.， LarochelleH.， GraumanK.， Cesa-BianchiN.， GarnettR.（编辑），《神经信息处理系统进展》，第 31 卷，Curran Associates， Inc. （2018），第 9873-9883 页
URL http://papers.nips.cc/paper/8193-life-long-disentangled-representation-learning-with-cross-domain-latent-homologies.pdf
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[52]
Achille A., Soatto S. 阿喀琉斯，索托斯。
Emergence of invariance and disentanglement in deep representations
深度表征中不变性和解缠的出现
(2017)
arXiv:1706.01350 arXiv：1706.01350
Google Scholar Google 学术搜索
[53]
Caselles-Dupré H., Garcia Ortiz M., Filliat D.
Caselles-DupréH.， Garcia OrtizM.， FilliatD.
Symmetry-based disentangled representation learning requires interaction with environments
基于对称的解纠缠表示学习需要与环境交互
Wallach H., Larochelle H., Beygelzimer A., d Alché-Buc F., Fox E., Garnett R. (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc. (2019), pp. 4606-4615
WallachH.， LarochelleH.， BeygelzimerA.， d Alché-BucF.， FoxE.， GarnettR.（编辑），《神经信息处理系统进展》，第 32 卷，Curran Associates， Inc. （2019），第 4606-4615 页
URL http://papers.nips.cc/paper/8709-symmetry-based-disentangled-representation-learning-requires-interaction-with-environments.pdf
Google Scholar Google 学术搜索
[54]
Kingma D.P., Welling M. KingmaD.P.，威灵M.
Auto-encoding variational Bayes
自动编码变分贝叶斯
(2013)
arXiv:1312.6114, URL https://arxiv.org/pdf/1312.6114.pdf
Google Scholar Google 学术搜索
[55]
Lesort T., Lomonaco V., Stoian A., Maltoni D., Filliat D., Díaz-Rodríguez N.
LesortT.， LomonacoV.， StoianA.， MaltoniD.， FilliatD.， Díaz-RodríguezN.
Continual learning for robotics
机器人技术的持续学习
(2019)
URL https://arxiv.org/pdf/1907.00182.pdf
Google Scholar Google 学术搜索
[56]
Chen X., Duan Y., Houthooft R., Schulman J., Sutskever I., Abbeel P.
ChenX.， DuanY.， HouthooftR.， SchulmanJ.， SutskeverI.， AbbeelP.
InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets
InfoGAN：通过信息最大化生成对抗网络进行可解释的表示学习
(2016)
arXiv:1606.03657 arXiv：1606.03657
Google Scholar Google 学术搜索
[57]
Garnelo M., Shanahan M. GarneloM.， ShanahanM.
Reconciling deep learning with symbolic artificial intelligence: representing objects and relations
调和深度学习与符号人工智能：表示对象和关系
Curr. Opin. Behav. Sci., 29 (2019), pp. 17-23, 10.1016/j.cobeha.2018.12.010
卷曲。意见。管理。《科学》，第 29 卷（2019 年），第 17-23 页，10.1016/j.cobeha.2018.12.010
URL https://www.researchgate.net/publication/336180670_Reconciling_deep_learning_with_symbolic_artificial_intelligence_representing_objects_and_relations
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
[58]
Garcez A., Besold T., De Raedt L., Földiák P., Hitzler P., Icard T., Kühnberger K.-U., Lamb L., Miikkulainen R., Silver D.
GarcezA.， BesoldT.， De RaedtL.， FeldiaP.， HitzlerP.， IcardT.， KühnbergerK.-U.， LambL.， MiikkulainenR.， SilverD.
Neural-symbolic learning and reasoning: Contributions and challenges
神经符号学习和推理：贡献与挑战
(2015), 10.13140/2.1.1779.4243
URL https://www.researchgate.net/publication/268034411_Neural-Symbolic_Learning_and_Reasoning_Contributions_and_Challenges
View article Google Scholar Google 学术搜索
[59]
Santoro A., Raposo D., Barrett D.G.T., Malinowski M., Pascanu R., Battaglia P., Lillicrap T.
SantoroA.， RaposoD.， BarrettD.G.T.， MalinowskiM.， PascanuR.， BattagliaP.， LillicrapT.
A simple neural network module for relational reasoning
用于关系推理的简单神经网络模块
(2017)
arXiv:1706.01427, URL https://arxiv.org/pdf/1706.01427
Google Scholar Google 学术搜索
[60]
Garnelo M., Arulkumaran K., Shanahan M.
GarneloM.， ArulkumaranK.， ShanahanM.
Towards deep symbolic reinforcement learning
迈向深度符号强化学习
(2016)
arXiv:1609.05518, URL https://arxiv.org/pdf/1609.05518
Google Scholar Google 学术搜索
[61]
Denil M., Colmenarejo S.G., Cabi S., Saxton D., de Freitas N.
DenilM.， ColmenarejoS.G.， CabiS.， SaxtonD.， de FreitasN.
Programmable agents 可编程代理
(2017)
arXiv:1706.06383 arXiv：1706.06383
Google Scholar Google 学术搜索
[62]
Kipf T.N., Welling M. KipfT.N.，威灵M.
Semi-supervised classification with graph convolutional networks
基于图卷积网络的半监督分类
(2016)
arXiv:1609.02907 arXiv：1609.02907
Google Scholar Google 学术搜索
[63]
Battaglia P.W., Hamrick J.B., Bapst V., Sanchez-Gonzalez A., Zambaldi V., Malinowski M., Tacchetti A., Raposo D., Santoro A., Faulkner R., Gulcehre C., Song F., Ballard A., Gilmer J., Dahl G., Vaswani A., Allen K., Nash C., Langston V., Dyer C., Heess N., Wierstra D., Kohli P., Botvinick M., Vinyals O., Li Y., Pascanu R.
BattagliaP.W.， HamrickJ.B.， BapstV.， Sanchez-GonzalezA.， ZambaldiV.， MalinowskiM.， TacchettiA.， RaposoD.， SantoroA.， FaulknerR.， GulcehreC.， SongF.， BallardA.， GilmerJ.， DahlG.， VaswaniA.， AllenK.， NashC.， LangstonV.， DyerC.， HeessN.， WierstraD.， KohliP.， BotvinickM.， VinyalsO.， LiY.， PascanuR.
Relational inductive biases, deep learning, and graph networks
关系归纳偏差、深度学习和图网络
(2018)
arXiv:1806.01261 arXiv：1806.01261
Google Scholar Google 学术搜索
[64]
Scarselli F., Gori M., Tsoi A.C., Hagenbuchner M., Monfardini G.
ScarselliF.， GoriM.， TsoiA.C.， HagenbuchnerM.， MonfardiniG.
The graph neural network model
图神经网络模型
IEEE Trans. Neural Netw., 20 (1) (2009), pp. 61-80
《IEEE Trans. Neural Netw.》，第 20 卷第 1 期（2009 年），第 61-80 页
URL https://persagen.com/files/misc/scarselli2009graph.pdf
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[65]
Zhang A., Satija H., Pineau J.
ZhangA.， SatijaH.， PineauJ.
Decoupling dynamics and reward for transfer learning
迁移学习的解耦动态和奖励
(2018)
arXiv:1804.10689 arXiv：1804.10689
Google Scholar Google 学术搜索
[66]
Halpern J.Y., Pearl J. HalpernJ.Y.， PearlJ.
Causes and explanations: A structural-model approach. Part I: Causes
原因和解释：一种结构模型方法。第一部分：原因
British J. Philos. Sci., 56 (4) (2005), pp. 843-887, 10.1093/bjps/axi147
《英国哲学科学（British J. Philos. Sci.）》，第 56 卷第 4 期（2005 年），第 843-887 页，10.1093/bjps/axi147
arXiv:https://academic.oup.com/bjps/article-pdf/56/4/843/4256158/axi147.pdf
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[67]
Gal Y., Ghahramani Z. GalY.， GhahramaniZ.
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning
作为贝叶斯近似的 Dropout：表示深度学习中的模型不确定性
(2015)
arXiv:1506.02142 arXiv：1506.02142
Google Scholar Google 学术搜索
[68]
Osband I., Blundell C., Pritzel A., Roy B.V.
OsbandI.， BlundellC.， PritzelA.， RoyB.V.
Deep exploration via bootstrapped DQN
通过自举 DQN 进行深入探索
(2016)
arXiv:1602.04621 arXiv：1602.04621
Google Scholar Google 学术搜索
[69]
García J., Fern C., Fernández o.
GarcíaJ.， FernC.， Fernándezo.
A comprehensive survey on safe reinforcement learning
安全强化学习的综合调查
J. Mach. Learn. Res., 16 (42) (2015), pp. 1437-1480
J. Mach. 学习。《研究》，第 16 卷第 42 期（2015 年），第 1437-1480 页
URL http://jmlr.org/papers/v16/garcia15a.html
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[70]
Lowe R., Wu Y., Tamar A., Harb J., Abbeel P., Mordatch I.
LoweR.， WuY.， TamarA.， HarbJ.， AbbeelP.， MordatchI.
Multi-agent actor-critic for mixed cooperative-competitive environments
合作-竞争混合环境中的多主体行动者-批评家
Neural Information Processing Systems NIPS (2017)
神经信息处理系统 NIPS （2017）
URL https://arxiv.org/pdf/1706.02275.pdf
Google Scholar Google 学术搜索
[71]
van Seijen H., Fatemi M., Romoff J., Laroche R., Barnes T., Tsang J.
van SeijenH.， FatemiM.， RomoffJ.， LarocheR.， BarnesT.， TsangJ.
Hybrid reward architecture for reinforcement learning
用于强化学习的混合奖励架构
(2017)
arXiv:1706.04208, URL https://arxiv.org/pdf/1706.04208
Google Scholar Google 学术搜索
[72]
Kawano H. 川野H。
Hierarchical sub-task decomposition for reinforcement learning of multi-robot delivery mission
面向多机器人投递任务强化学习的分层子任务分解
2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, May 6–10, 2013 (2013), pp. 828-835, 10.1109/ICRA.2013.6630669
2013 IEEE机器人与自动化国际会议，德国卡尔斯鲁厄，2013年5月6-10日（2013），第828-835页，10.1109/ICRA.2013.6630669
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[73]
Andrychowicz M., Wolski F., Ray A., Schneider J., Fong R., Welinder P., McGrew B., Tobin J., Abbeel P., Zaremba W.
AndrychowiczM.， WolskiF.， RayA.， SchneiderJ.， FongR.， WelinderP.， McGrewB.， TobinJ.， AbbeelP.， ZarembaW.
Hindsight experience replay
事后诸葛亮的经验回放
(2017)
arXiv:1707.01495, URL https://arxiv.org/pdf/1707.01495.pdf
Google Scholar Google 学术搜索
[74]
Chevalier-Boisvert M., Willems L., Pal S.
Chevalier-BoisvertM.， WillemsL.， PalS.
Minimalistic gridworld environment for Open AI Gym
Open AI Gym 的简约网格世界环境
GitHub Repository, GitHub (2018)
GitHub 存储库，GitHub （2018）
https://github.com/maximecb/gym-minigrid
Google Scholar Google 学术搜索
[75]
Mundhenk T.N., Chen B.Y., Friedland G.
MundhenkT.N.， ChenB.Y.， FriedlandG.
Efficient saliency maps for explainable AI
用于可解释 AI 的高效显著性图
(2019)
arXiv:1911.11293, URL https://arxiv.org/pdf/1911.11293
Google Scholar Google 学术搜索
[76]
Jain S., Wallace B.C. 耆那教，华莱士B.C.
Attention is not explanation
注意不是解释
(2019)
arXiv:1902.10186, URL https://arxiv.org/pdf/1902.10186
Google Scholar Google 学术搜索
[77]
Simonyan K., Vedaldi A., Zisserman A.
SimonyanK.， VedaldiA.， ZissermanA.
Deep inside convolutional networks: Visualising image classification models and saliency maps
深入卷积网络：可视化图像分类模型和显著性图
(2013)
arXiv:1312.6034, URL https://arxiv.org/pdf/1312.6034.pdf
Google Scholar Google 学术搜索
[78]
Mnih V., Badia A.P., Mirza M., Graves A., Lillicrap T.P., Harley T., Silver D., Kavukcuoglu K.
MnihV.， BadiaA.P.， MirzaM.， GravesA.， LillicrapT.P.， HarleyT.， SilverD.， KavukcuogluK.
Asynchronous methods for deep reinforcement learning
深度强化学习的异步方法
(2016)
arXiv:1602.01783, URL https://arxiv.org/pdf/1602.01783
Google Scholar Google 学术搜索
[79]
Kindermans P.-J., Hooker S., Adebayo J., Alber M., Schütt K.T., Dähne S., Erhan D., Kim B.
KindermansP.-J.， HookerS.， AdebayoJ.， AlberM.， SchüttK.T.， DähneS.， ErhanD.， KimB.
The (un)reliability of saliency methods
显著性方法的（不）可靠性
(2017)
arXiv:1711.00867, URL https://arxiv.org/pdf/1711.00867
Google Scholar Google 学术搜索
[80]
Adebayo J., Gilmer J., Muelly M., Goodfellow I., Hardt M., Kim B.
阿德巴约J.，吉尔默J.， MuellyM.， GoodfellowI.， HardtM.， KimB.
Sanity checks for saliency maps
显著性图的健全性检查
(2018)
arXiv:1810.03292, URL https://arxiv.org/pdf/1810.03292
Google Scholar Google 学术搜索
[81]
Caselles-Dupré H., Garcia-Ortiz M., Filliat D.
卡塞勒斯-杜普雷H.，加西亚-奥尔蒂斯M.，菲利亚特D.
Symmetry-based disentangled representation learning requires interaction with environments
基于对称的解纠缠表示学习需要与环境交互
(2019)
arXiv:1904.00243 arXiv：1904.00243
Google Scholar Google 学术搜索
[82]
Zhang Q., Zhu S.-C. ZhangQ.， ZhuS.-C.
Visual interpretability for deep learning: a survey
深度学习的视觉可解释性：一项调查
(2018)
arXiv:1802.00614, URL https://arxiv.org/pdf/1802.00614.pdf
Google Scholar Google 学术搜索
[83]
Zhang Q., Cao R., Wu Y.N., Zhu S.-C.
张琦，曹，吴英恪，朱S.-C.
Growing interpretable part graphs on convnets via multi-shot learning
通过多样本学习在卷积网络上增加可解释的零件图
(2016)
arXiv:1611.04246, URL https://arxiv.org/pdf/1611.04246
Google Scholar Google 学术搜索
[84]
Zhang Q., Cao R., Shi F., Wu Y.N., Zhu S.-C.
张琦，曹琦，石峰，吴彦恪，朱S.-C.
Interpreting CNN knowledge via an explanatory graph
通过解释性图表解释 CNN 知识
(2017)
arXiv:1708.01785, URL https://arxiv.org/pdf/1708.01785
Google Scholar Google 学术搜索
[85]
Wu T., Sun W., Li X., Song X., Li B.
WuT.， SunW.， LiX.， SongX.， LiB.
Towards interpretable R-CNN by unfolding latent structures
通过展开潜在结构实现可解释的 R-CNN
(2017)
arXiv:1711.05226, URL https://arxiv.org/pdf/1711.05226v2
Google Scholar Google 学术搜索
[86]
Díaz-Rodríguez N., Lomonaco V., Filliat D., Maltoni D.
迪亚斯-罗德里格斯N.， LomonacoV.， FilliatD.， MaltoniD.
Don’t forget, there is more than forgetting: new metrics for continual learning
别忘了，不仅仅是遗忘：持续学习的新指标
(2018)
arXiv:1810.13166, URL https://arxiv.org/pdf/1810.13166
Google Scholar Google 学术搜索
[87]
Dhar P., Singh R.V., Peng K.-C., Wu Z., Chellappa R.
DharP.， SinghR.V.， PengK.-C.， WuZ.， ChellappaR.
Learning without memorizing
不死板地学习
(2018)
arXiv:1811.08051, URL https://arxiv.org/pdf/1811.08051v2.pdf
Google Scholar Google 学术搜索
[88]
Li Z., Hoiem D. LiZ.， HoiemD.
Learning without forgetting
学习不忘
(2016)
arXiv:1606.09282, URL https://arxiv.org/pdf/1606.09282
Google Scholar Google 学术搜索
[89]
Hu R., Rohrbach A., Darrell T., Saenko K.
HuR.， RohrbachA.， DarrellT.， SaenkoK.
Language-conditioned graph networks for relational reasoning
用于关系推理的语言条件图网络
(2019)
arXiv:1905.04405 arXiv：1905.04405
Google Scholar Google 学术搜索
[90]
Baroni M. 巴罗尼姆。
Linguistic generalization and compositionality in modern artificial neural networks
现代人工神经网络中的语言泛化和组合性
Philos. Trans. R. Soc. B, 375 (1791) (2019), Article 20190307, 10.1098/rstb.2019.0307
Philos. Trans. R. Soc. B， 375 （1791）（2019）， Article 20190307， 10.1098/rstb.2019.0307
View article Google Scholar Google 学术搜索
[91]
Pierrot T., Ligner G., Reed S., Sigaud O., Perrin N., Laterre A., Kas D., Beguir K., de Freitas N.
PierrotT.， LignerG.， ReedS.， SigaudO.， PerrinN.， LaterreA.， KasD.， BeguirK.， de FreitasN.
Learning compositional neural programs with recursive tree search and planning
使用递归树搜索和规划学习组合神经程序
(2019)
arXiv:1905.12941, URL https://arxiv.org/pdf/1905.12941
Google Scholar Google 学术搜索
[92]
Ziebart B.D., Maas A.L., Bagnell J.A., Dey A.K.
ZiebartB.D.， MaasA.L.， BagnellJ.A.， DeyA.K.
Maximum entropy inverse reinforcement learning
最大熵逆强化学习
(2008)
URL https://arxiv.org/pdf/1507.04888.pdf
Google Scholar Google 学术搜索
[93]
Haarnoja T., Tang H., Abbeel P., Levine S.
HaarnojaT.， TangH.， AbbeelP.， LevineS.
Reinforcement learning with deep energy-based policies
基于深度能量策略的强化学习
(2017)
arXiv:1702.08165 arXiv：1702.08165
Google Scholar Google 学术搜索
[94]
Han D., Doya K., Tani J. HanD.， DoyaK.， TaniJ.
Emergence of hierarchy via reinforcement learning using a multiple timescale stochastic RNN
使用多时间尺度随机RNN的强化学习出现层次结构
(2019)
arXiv:1901.10113 arXiv：1901.10113
Google Scholar Google 学术搜索
[95]
Kharitonov E., Baroni M. 哈里托诺夫E.，巴罗尼姆。
Emergent language generalization and acquisition speed are not tied to compositionality
涌现语言泛化和习得速度与组合性无关
(2020)
arXiv:2004.03420 arXiv：2004.03420
Google Scholar Google 学术搜索
[96]
Chaabouni R., Kharitonov E., Dupoux E., Baroni M.
ChaabouniR.， KharitonovE.， DupouxE.， BaroniM.
Anti-efficient encoding in emergent communication
紧急通信中的反高效编码
Advances in Neural Information Processing Systems (2019), pp. 6290-6300
神经信息处理系统进展（2019 年），第 6290-6300 页
URL https://papers.nips.cc/paper/8859-anti-efficient-encoding-in-emergent-communication.pdf
Google Scholar Google 学术搜索
[97]
Traoré R., Caselles-Dupré H., Lesort T., Sun T., Cai G., Díaz-Rodríguez N., Filliat D.
TraoréR.， Caselles-DupréH.， LesortT.， SunT.， CaiG.， Díaz-RodríguezN.， FilliatD.
DISCORL: Continual reinforcement learning via policy distillation
DISCORL：通过策略蒸馏进行持续强化学习
(2019)
arXiv preprint arXiv:1907.05855, URL https://arxiv.org/pdf/1907.05855
arXiv 预印本 arXiv：1907.05855， URL https://arxiv.org/pdf/1907.05855
Google Scholar Google 学术搜索
[98]
Abbeel P., Ng A. AbbeelP.， NgA.
Apprenticeship learning via inverse reinforcement learning
通过逆强化学习进行学徒学习
Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004 (2004), 10.1007/978-0-387-30164-8˙417
Proceedings， Twenty-First International Conference on Machine Learning， ICML 2004 （2004）， 10.1007/978-0-387-30164-8 ̇417
View article Google Scholar Google 学术搜索
[99]
Christiano P., Leike J., Brown T.B., Martic M., Legg S., Amodei D.
ChristianoP.， LeikeJ.， BrownT.B.， MarticM.， LeggS.， AmodeiD.
Deep reinforcement learning from human preferences
从人类偏好中深度强化学习
(2017)
arXiv:1706.03741, URL https://arxiv.org/pdf/1706.03741
Google Scholar Google 学术搜索
[100]
Kim J., Moon S., Rohrbach A., Darrell T., Canny J.
KimJ.， MoonS.， RohrbachA.， DarrellT.， CannyJ.
Advisable learning for self-driving vehicles by internalizing observation-to-action rules
建议通过内化观察到行动规则来学习自动驾驶汽车
The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2020)
IEEE/CVF 计算机视觉和模式识别会议，CVPR （2020）
URL https://openaccess.thecvf.com/content_CVPR_2020/html/Kim_Advisable_Learning_for_Self-Driving_Vehicles_by_Internalizing_Observation-to-Action_Rules_CVPR_2020_paper.html
Google Scholar Google 学术搜索
[101]
Theodorou A., Wortham R.H., Bryson J.J.
西奥多鲁A.，沃瑟姆R.H.，布赖森J.J.
Designing and implementing transparency for real time inspection of autonomous robots
设计和实现自主机器人实时检测的透明度
Connect. Sci., 29 (3) (2017), pp. 230-241, 10.1080/09540091.2017.1310182
连接。《科学（Sci.）》，第 29 卷第 3 期（2017 年），第 230-241 页，10.1080/09540091.2017.1310182
View PDF
This article is free to access.
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[102]
Matarese M., Rossi S., Sciutti A., Rea F.
MatareseM.， RossiS.， SciuttiA.， ReaF.
Towards transparency of TD-RL robotic systems with a human teacher
通过人类教师实现TD-RL机器人系统的透明度
(2020)
arXiv:2005.05926 arXiv：2005.05926
Google Scholar Google 学术搜索
[103]
Locatello F., Bauer S., Lucic M., Rätsch G., Gelly S., Schölkopf B., Bachem O.
LocatelloF.， BauerS.， LucicM.， RätschG.， GellyS.， SchölkopfB.， BachemO.
Challenging common assumptions in the unsupervised learning of disentangled representations
挑战解纠缠表示的无监督学习中的常见假设
(2018)
arXiv preprint arXiv:1811.12359 URL https://arxiv.org/pdf/1811.12359.pdf
arXiv 预印本 arXiv：1811.12359 URL https://arxiv.org/pdf/1811.12359.pdf
Google Scholar Google 学术搜索
[104]
Achille A., Soatto S. 阿喀琉斯，索托斯。
A separation principle for control in the age of deep learning
深度学习时代的控制分离原则
(2017)
arXiv preprint arXiv:1711.03321
arXiv 预印本 arXiv：1711.03321
Google Scholar Google 学术搜索
[105]
Bennetot A., Charisi V., Díaz-Rodríguez N.
贝内托特A.， CharisiV.， Díaz-RodríguezN.
Should artificial agents ask for help in human-robot collaborative problem-solving?
人工智能体是否应该在人机协作解决问题方面寻求帮助？
(2020)
arXiv preprint arXiv:2006.00882, URL https://arxiv.org/pdf/2006.00882.pdf
arXiv 预印本 arXiv：2006.00882， URL https://arxiv.org/pdf/2006.00882.pdf
Google Scholar Google 学术搜索
[106]
Portelas R., Colas C., Weng L., Hofmann K., Oudeyer P.-Y.
PortelasR.， ColasC.， WengL.， HofmannK.， OudeyerP.-Y.
Automatic curriculum learning for deep RL: A short survey
深度强化学习的自动化课程学习：一项简短的调查
(2020)
arXiv preprint arXiv:2003.04664, URL https://arxiv.org/pdf/2003.04664
arXiv 预印本 arXiv：2003.04664， URL https://arxiv.org/pdf/2003.04664
Google Scholar Google 学术搜索
[107]
Mouret J.-B., Clune J. MouretJ.-B.， CluneJ.
Illuminating search spaces by mapping elites
通过地图精英照亮搜索空间
(2015)
arXiv:1504.04909 arXiv：1504.04909
Google Scholar Google 学术搜索
[108]
Pugh J.K., Soros L.B., Stanley K.O.
PughJ.K.， SorosL.B.， StanleyK.O.
Quality diversity: A new frontier for evolutionary computation
质量多样性：进化计算的新前沿
Front. Robot. AI, 3 (2016), p. 40
前面。机器人。《大赦国际》，第 3 期（2016 年），第 40 页
URL https://www.frontiersin.org/articles/10.3389/frobt.2016.00040/full
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[109]
Staniak M., Biecek P. StaniakM.， BiecekP.
Explanations of model predictions with live and breakdown packages
使用实时和细分包对模型预测进行解释
(2018)
arXiv preprint arXiv:1804.01955, URL https://arxiv.org/pdf/1804.01955.pdf
arXiv 预印本 arXiv：1804.01955， URL https://arxiv.org/pdf/1804.01955.pdf
Google Scholar Google 学术搜索
[110]
Sundararajan M., Najmi A. SundararajanM.，NajmiA。
The many shapley values for model explanation
用于模型解释的许多 shapley 值
(2019)
arXiv:1908.08474 arXiv：1908.08474
Google Scholar Google 学术搜索
[111]
Sundararajan M., Taly A., Yan Q.
SundararajanM.， TalyA.， YanQ.
Axiomatic attribution for deep networks
深度网络的公理化归因
Precup D., Teh Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, 70, PMLR, International Convention Centre, Sydney, Australia (2017), pp. 3319-3328
PrecupD.， TehY.W.（编辑），第 34 届机器学习国际会议论文集，机器学习研究论文集，70，PMLR，国际会议中心，悉尼，澳大利亚（2017 年），第 3319-3328 页
URL http://proceedings.mlr.press/v70/sundararajan17a.html
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[112]
Foerster J., Farquhar G., Afouras T., Nardelli N., Whiteson S.
Foerster J.， FarquharG.， AfourasT.， NardelliN.， WhitesonS.
Counterfactual multi-agent policy gradients
反事实多智能体策略梯度
(2017)
arXiv:1705.08926, URL https://arxiv.org/pdf/1705.08926.pdf
Google Scholar Google 学术搜索
[113]
LeCun Y., Bengio Y., et al.
LeCunY.， BengioY.， et al.
Convolutional networks for images, speech, and time series
用于图像、语音和时间序列的卷积网络
The Handbook of Brain Theory and Neural Networks, Vol. 3361, No. 10 (1995), p. 1995
《脑理论与神经网络手册》，第 3361 卷，第 10 期（1995 年），第 1995 页
URL https://www.researchgate.net/publication/2453996_Convolutional_Networks_for_Images_Speech_and_Time-Series
Google Scholar Google 学术搜索
[114]
Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D.
LillicrapT.P.， HuntJ.J.， PritzelA.， HeessN.， ErezT.， TassaY.， SilverD.， WierstraD.
Continuous control with deep reinforcement learning
通过深度强化学习实现持续控制
(2015)
arXiv:1509.02971 arXiv：1509.02971
Google Scholar Google 学术搜索
[115]
Goodfellow I.J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y.
GoodfellowI.J.， Pouget-AbadieJ.， MirzaM.， XuB.， Warde-FarleyD.， OzairS.， CourvilleA.， BengioY.
Generative adversarial networks
生成对抗网络
(2014)
arXiv:1406.2661, URL https://arxiv.org/pdf/1406.2661.pdf
Google Scholar Google 学术搜索
[116]
Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O.
舒尔曼J.，沃尔斯基F.，达里瓦尔P.，拉德福德A.，克里莫夫O.
Proximal policy optimization algorithms
近端策略优化算法
(2017)
arXiv:1707.06347, URL https://arxiv.org/pdf/1707.06347.pdf
Google Scholar Google 学术搜索
[117]
Girshick R., Donahue J., Darrell T., Malik J.
GirshickR.， DonahueJ.， DarrellT.， MalikJ.
Rich feature hierarchies for accurate object detection and semantic segmentation
丰富的特征层次结构，可实现准确的对象检测和语义分割
(2013)
arXiv:1311.2524, URL https://arxiv.org/pdf/1311.2524.pdf
Google Scholar Google 学术搜索
[118]
Rummery G., Niranjan M. RummeryG.， NiranjanM.
On-Line Q-Learning Using Connectionist Systems: Technical Report CUED/F-INFENG/TR 166
使用连接主义系统的在线 Q 学习：技术报告 CUED/F-INFENG/TR 166
(1994)
URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2539&rep=rep1&type=pdf
Google Scholar Google 学术搜索