(DOL) 【多目标深度强化学习】

资源存储库

已于 2024-03-31 20:56:00 修改

阅读量2.4k

点赞数 9

文章标签：学习

于 2024-03-18 16:04:22 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/136811284

版权

Multi-Objective Deep Reinforcement Learning
多目标深度强化学习

Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, Shimon Whiteson
霍萨姆·莫萨拉姆、雅尼斯·阿塞尔、迪德里克·罗伊耶斯、西蒙·怀特森
Department of Computer Science University of Oxford Oxford, United Kingdom {ms15ham, yannis.assael, diederik.roijers, shimon.whiteson}@cs.ox.ac.uk
牛津大学计算机科学系牛津，英国 {ms15ham， yannis.assael， diederik.roijers， shimon.whiteson}@cs.ox.ac.uk

Abstract 摘要

We propose Deep Optimistic Linear Support Learning (DOL) to solve high-dimensional multi-objective decision problems where the relative importances of the objectives are not known a priori. Using features from the high-dimensional inputs, DOL computes the convex coverage set containing all potential optimal solutions of the convex combinations of the objectives. To our knowledge, this is the first time that deep reinforcement learning has succeeded in learning multi-objective policies. In addition, we provide a testbed with two experiments to be used as a benchmark for deep multi-objective reinforcement learning.
我们提出了深度乐观线性支持学习（DOL）来解决高维多目标决策问题，其中目标的相对重要性是先验未知的。DOL 使用来自高维输入的特征，计算包含目标凸组合的所有潜在最优解的凸覆盖集。据我们所知，这是深度强化学习首次成功学习多目标策略。此外，我们还提供了一个带有两个实验的测试平台，作为深度多目标强化学习的基准。

1 Introduction

1引言

In recent years, advances in deep learning have been instrumental in solving a number of challenging reinforcement learning (RL) problems, including high-dimensional robot control Levine et al. (2015); Assael et al. (2015); Watter et al. (2015), visual attention Ba et al. (2015), solving riddles Foerster et al. (2016), the Atari learning environment (ALE) Guo et al. (2014); Mnih et al. (2015); Stadie et al. (2015); Wang et al. (2015); Schaul et al. (2016); van Hasselt et al. (2016); Oh et al. (2015); Bellemare et al. (2016); Nair et al. (2015) and Go Maddison et al. (2015); Silver et al. (2016).
近年来，深度学习的进步有助于解决许多具有挑战性的强化学习（RL）问题，包括高维机器人控制Levine等人（2015）;Assael 等人（2015 年）;Watter et al. （ 2015）， visual attention Ba et al. （ 2015）， solving puzzles Foerster et al. （ 2016）， Atari learning environment （ALE） Guo et al. （ 2014）;Mnih等人（2015）;Stadie等人（2015）;Wang 等人（ 2015）;Schaul 等人（ 2016）;van Hasselt 等人（2016 年）;Oh 等人（ 2015）;Bellemare等人（2016）;Nair等人（2015）和Go Maddison等人（2015）;Silver等人（2016）。

While the aforementioned approaches have focused on single-objective settings, many real-world problems have multiple possibly conflicting objectives. For example, an agent that may want to maximise the performance of a web application server, while minimising its power consumption Tesauro et al. (2007). Such problems can be modelled as multi-objective Markov decision processes (MOMDPs), and solved with multi-objective reinforcement learning (MORL) Roijers et al. (2013). Because it is typically not clear how to evaluate available trade-offs between different objectives a priori, there is no single optimal policy. Hence, it is desirable to produce a coverage set (CS) which contains at least one optimal policy (and associated value vector) for each possible utility function that a user might have.
虽然上述方法侧重于单一目标设置，但许多现实世界的问题具有多个可能相互冲突的目标。例如，一个代理可能希望最大限度地提高Web应用程序服务器的性能，同时最小化其功耗Tesauro et al. （2007）。这些问题可以建模为多目标马尔可夫决策过程（MOMDP），并用多目标强化学习（MORL）Roijers等人（2013）解决。由于通常不清楚如何先验地评估不同目标之间的可用权衡，因此没有单一的最优策略。因此，需要为用户可能拥有的每个可能的效用函数生成一个覆盖率集（CS），其中包含至少一个最优策略（和关联的值向量）。

So far, deep learning methods for Markov decision processes (MDPs) have not been extended to MOMDPs. One reason is that it is not clear how neural networks can account for unknown preferences and the resulting sets of value vectors. In this paper, we circumvent this issue by taking an outer loop approach Roijers et al. (2015a) to multi-objective reinforcement learning, i.e., we aim to learn an approximate coverage set of policies, each represented by a neural network, by evaluating a sequence of scalarised single-objective problems. In order to enable the use of deep Q-Networks Mnih et al. (2015) for learning in MOMDPs, we build off the state-of-the-art optimistic linear support (OLS) framework Roijers (2016); Roijers et al. (2015a). OLS is a generic outer loop method for solving multi-objective decision problems, i.e., it repeatedly calls a single-objective solver as a subroutine. OLS terminates after a finite number of calls to that subroutine and produces an approximate CS. In principle any single-objective solver can be used, as long as it is OLS-compliant, i.e., produces policy value vectors rather than scalar values. Making a single-objective solver OLS-compliant typically requires little effort.
到目前为止，马尔可夫决策过程（MDP）的深度学习方法尚未扩展到MOMDP。原因之一是尚不清楚神经网络如何解释未知偏好和由此产生的价值向量集。在本文中，我们通过采用Roijers等人（2015a）的外循环方法来规避这个问题，即，我们的目标是通过评估一系列标度化的单目标问题来学习一个近似的覆盖策略集，每个策略由一个神经网络表示。为了能够使用深度Q-Networks Mnih等人（2015）在MOMDP中学习，我们建立了最先进的乐观线性支持（OLS）框架Roijers（2016）;Roijers等人（2015a）。OLS是一种用于求解多目标决策问题的通用外循环方法，即它反复调用单目标求解器作为子程序。OLS 在对该子例程的有限次数调用后终止，并生成近似 CS。原则上，可以使用任何单目标求解器，只要它符合 OLS，即生成策略值向量而不是标量值。使单目标求解器符合 OLS 标准通常不需要付出多少努力。

We present three new deep multi-objective RL algorithms. First, we investigate how the learning setting effects OLS, and how deep RL can be made OLS-compliant. Using an OLS-compliant neural network combined with the OLS framework results in Deep OLS Learning (DOL). Our empirical evaluation shows that DOL can tackle multi-objective problems with much larger inputs than classical multi-objective RL algorithms. We improve upon DOL by leveraging the fact that the OLS framework solves a series of single-objective problems that become increasingly similar as the series progresses Roijers et al. (2015b), which results in increasingly similar optimal value vectors. Deep Q-networks produce latent embeddings of the features of a problem w.r.t. the function value. Hence, we hypothesise that we can reuse parts of the network used to solve the previous single-objective problem, in order to speed up learning on the next one. This results in two new algorithms that we call Deep OLS Learning with Full Reuse (DOL-FR), which reuses all parameter values of neural networks, and Deep OLS Learning with Partial Reuse (DOL-PR) which reuses all parameter values of neural networks, except those for the last layer of the network. We show empirically that reusing only part of the network (DOL-PR) is more effective than reusing the entire network (DOL-FR) and drastically improves the performance compared to DOL without reuse.
我们提出了三种新的深度多目标强化学习算法。首先，我们研究了学习设置如何影响OLS，以及RL的深度可以达到OLS的一致性。将符合 OLS 标准的神经网络与 OLS 框架结合使用，可实现深度 OLS 学习（DOL）。我们的实证评估表明，与经典的多目标强化学习算法相比，DOL可以解决输入更大的多目标问题。我们通过利用OLS框架解决一系列单目标问题的事实来改进DOL，这些问题随着系列的进展而变得越来越相似Roijers等人（2015b），这导致了越来越相似的最优值向量。深度 Q 网络生成问题特征与函数值的潜在嵌入。因此，我们假设我们可以重用用于解决前一个单目标问题的网络部分，以加快下一个目标问题的学习速度。这导致了两种新算法，我们称之为完全重用的深度OLS学习（DOL-FR），它重用了神经网络的所有参数值，以及重用了部分重用的深度OLS学习（DOL-PR），它重用了神经网络的所有参数值，除了网络最后一层的参数值。我们通过经验表明，仅重用部分网络（DOL-PR）比重用整个网络（DOL-FR）更有效，并且与不重用的 DOL 相比，性能显着提高。

2 Background

2背景

In a single-objective RL setting Sutton and Barto (1998), an agent observes the current state ��∈𝒮 at each discrete time step �, chooses an action ��∈𝒜 according to a potentially stochastic policy �, observes a reward signal �(��,��)=��∈ℛ, and transitions to a new state ��+1. Its objective is to maximise an expectation over the discounted return, ��=��+��+1+�2��+2+⋯, where �� is the reward received at time � and �∈[0,1] is a discount factor.
在Sutton和Barto（1998）的单目标RL设置中，智能体观察每个离散时间步的当前状态 ��∈𝒮 ，根据潜在的随机策略 � 选择动作 ��∈𝒜 ，观察奖励信号 �(��,��)=��∈ℛ ，并过渡到新的状态 ��+1 。 � 其目标是最大化对贴现回报的期望， ��=��+��+1+�2��+2+⋯ 其中 �� 是当时 � 收到的奖励， �∈[0,1] 并且是贴现因子。

Markov Decision Process (MDP). Such sequential decision problems are commonly modelled as a finite single-objective Markov decision process (MDP), a tuple of 〈𝒮,𝒜,�,�,�〉. The �-function of a policy � is ��(�,�)=𝔼[��|��=�,��=�]. The optimal action-value function �*(�,�)=max�⁡��(�,�) obeys the Bellman optimality equation:
马尔可夫决策过程（MDP）。这种顺序决策问题通常被建模为有限单目标马尔可夫决策过程（MDP），即 〈𝒮,𝒜,�,�,�〉 的元组。策略 � 的 � -function 是 ��(�,�)=𝔼[��|��=�,��=�] 。最优动作值函数 �*(�,�)=max�⁡��(�,�) 服从贝尔曼最优方程：

�*(�,�)=𝔼�′[�(�,�)+�max�′⁡�*(�′,�′)|�,�].

(1)

Deep Q-Networks (DQN). Deep �-learning Mnih et al. (2015) uses neural networks parameterised by � to represent �(�,�;�). DQNs are optimised by minimising:
深度 Q 网络（DQN）。深度 � 学习 Mnih et al. （ 2015）使用参数化的神经网络 � 来表示 �(�,�;�) .DQN 通过最小化以下功能进行优化：

ℒ�(��)=𝔼�,�,�,�′[(��−�(�,�;��))2],

(2)

at each iteration �, with target ��=�+�max�′⁡�(�′,�′;��−). Here, ��− are the parameters of a target network that is frozen for a number of iterations while updating the online network �(�,�;��) by gradient descent. The action � is chosen from �(�,�;��) by an action selector, which typically implements an �-greedy policy that selects the action that maximises the Q-value with a probability of 1−� and chooses randomly with a probability of �. DQN uses experience replay Lin (1993): during learning, the agent builds a dataset of episodic experiences and is then trained by sampling mini-batches of experiences. Experience replay is used in Mnih et al. (2015) to reduce variance by breaking correlation among the samples, whilst, it enables re-use of past experiences for learning.
在每次迭代时 � ，以目标 ��=�+�max�′⁡�(�′,�′;��−) .这里是目标网络的参数， ��− 该目标网络在通过梯度下降更新在线网络 �(�,�;��) 时被冻结了多次迭代。操作 � 由操作选择器选择， �(�,�;��) 该选择器通常实现一个 � -greedy 策略，该策略选择以的概率最大化 Q 值的操作，并以的 1−� 概率随机选择 � 。DQN 使用经验回放 Lin （ 1993）：在学习过程中，智能体构建一个情景体验数据集，然后通过对小批量的体验进行采样进行训练。Mnih等人（2015）使用经验回放，通过打破样本之间的相关性来减少方差，同时，它能够重用过去的经验进行学习。

Refer to caption

Figure 1:The two corner weights of a ��*(𝐰) with � containing three value vectors for a 2-objective MOMDP.
图 1： � 包含 2 目标 MODDP 的三个值向量的 a ��*(𝐰) 的两个角权重。

Multi-Objective MDPs (MOMDP). An MOMDP, is an MDP in which the reward function 𝐑(��,��)=𝐫�∈ℛ� describes a vector of � rewards, one for each objective Roijers et al. (2013). We use bold variables to denote vectors. The solution to an MOMDP is a set of policies called a coverage set, that contains at least one optimal policy for each possible preference, i.e., utility or scalarisation function, �, that a user might have. This scalarisation function maps each possible policy value vector, 𝐕� onto a scalar value. In this paper, we focus on the highly prevalent case where the scalarisation function, is linear, i.e., �(𝐕�,𝐰)=𝐰⋅𝐕�, where 𝐰 is a vector that determines the relative importance of the objectives, such that �(𝐕�,𝐰) is a convex combination of the objectives. The corresponding coverage set is called the convex coverage set (CCS) Roijers et al. (2013).
多目标 MDP （MOMDP）。MODDP 是一种 MDP，其中奖励函数 𝐑(��,��)=𝐫�∈ℛ� 描述了一个 � 奖励向量，每个目标一个 Roijers et al. （2013）。我们使用粗体变量来表示向量。MOMDP 的解决方案是一组称为覆盖率集的策略，它包含针对用户可能拥有的每个可能首选项（即效用或标量化函数 � ）的至少一个最佳策略。此标量化函数将每个可能的策略值向量映射 𝐕� 到标量值上。在本文中，我们重点关注标量化函数是线性的非常普遍的情况，即 �(𝐕�,𝐰)=𝐰⋅𝐕� ，其中 𝐰 是确定目标相对重要性的向量，因此 �(𝐕�,𝐰) 是目标的凸组合。相应的覆盖集称为凸覆盖集（CCS）Roijers et al. （2013）。

Optimistic Linear Support (OLS). OLS takes an outer loop approach in which the CCS is incrementally constructed by solving a series of scalarised, i.e., single-objective, MDPs for different linear scalarisation vectors 𝐰. This enables the use of DQNs as a single-objective MDP solver. In each iteration, OLS finds one policy by solving a scalarised MDP, and its value vector 𝐕� is added to an intermediate approximate coverage set, �.
乐观线性支持（OLS）。OLS采用外循环方法，通过求解不同线性标量化向量的一系列标量化（即单目标）MDP，逐步构建CCS 𝐰 。这样就可以将 DQN 用作单目标 MDP 求解器。在每次迭代中，OLS 通过求解标量化的 MDP 来找到一个策略，并将其值向量 𝐕� 添加到中间近似覆盖率集中。 �

Unlike other outer loop methods, OLS uses the concept of corner weights to pick the weights to use for creating scalarised instances and the concept of estimated improvement to prioritise those corner weights. To define corner weights, we first define the scalarised value function ��*(𝐰)=max��∈�⁡𝐰⋅𝐕�, as a function of the linear scalarisation vector 𝐰, for a set of value vectors �. ��*(𝐰) for an � containing three value vectors is depicted in Figure 1. ��*(𝐰) forms a piecewise linear and convex function that comprise the upper surface of the scalarised values of each value vector. The corner weights are the weights at the corners of the convex upper surface Cheng (1988), marked with crosses in the figure. OLS always selects the corner weight 𝐰 that maximises an optimistic upper bound on the difference between ��*(𝐰) and the optimal scalarised value function, i.e., ��*(𝐰)−��*(𝐰), and solves the single-objective MDP scalarized by the selected 𝐰.
与其他外循环方法不同，OLS 使用角权重的概念来选择用于创建标度实例的权重，并使用估计改进的概念来确定这些角权重的优先级。为了定义角权重，我们首先将标量值函数 ��*(𝐰)=max��∈�⁡𝐰⋅𝐕� 定义为一组值向量的线性标量化向量 𝐰 的函数 � 。 ��*(𝐰) 对于 � 包含三个值的向量，如图 1 所示。 ��*(𝐰) 形成分段线性函数和凸函数，该函数包含每个值向量的标量值的上表面。角重是凸上表面角的砝码 Cheng （ 1988），图中标有十字。OLS 始终选择角权重，该角权 𝐰 重使与最优标量值函数之间的 ��*(𝐰) 差值的乐观上限最大化，即 ��*(𝐰)−��*(𝐰) ，并求解由所选 𝐰 标量化的单目标 MDP。

In the planning setting for which OLS was devised, such an upper bound can typically be computed using upper bounds on the error with respect to the optimal value of the scalarised policy values at each previous 𝐰 in the series, in combination with linear programs. The error bounds at the previous 𝐰 stem from the approximation quality of the single-objective planning methods that OLS uses. However, in reinforcement learning, the true �� is fundamentally unknown and no upper bounds can be given on the approximation quality of deep �-learning. Therefore, we use ��¯*(𝐰)−��*(𝐰) as a heuristic to determine the priority, where ��¯*(𝐰) is defined as maximal attainable scalarised value if we assume that the values found for previous 𝐰 in the series were optimal for those 𝐰.
在设计 OLS 的规划设置中，这种上限通常可以结合线性规划，使用相对于序列中每个前一个 𝐰 标度策略值的最佳值的误差上限来计算。前面 𝐰 的误差边界源于OLS使用的单目标规划方法的近似质量。然而，在强化学习中，真实 �� 从根本上是未知的，深度学习的 � 近似质量没有上限。因此，我们使用 ��¯*(𝐰)−��*(𝐰) 启发式方法来确定优先级，如果我们 ��¯*(𝐰) 假设在序列中找到的上一个 𝐰 值对于那些 𝐰 .

3Methodology 3方法论

In this section, we propose our algorithms for MORL that employ deep Q-learning. Firstly, we propose our basic deep OLS learning (DOL) algorithm; we build off the OLS framework for multi-objective learning and integrate DQN. Then, we improve on this algorithm by introducing Deep OLS Learning with Partial (DOL-PR) and Full Reuse (DOL-FR). DOL, DOL-PR, and DOL-FR make use of a single-objective subroutine, which is defined together with DOL in Section 3.1.
在本节中，我们提出了采用深度 Q 学习的 MORL 算法。首先，提出了基本的深度OLS学习（DOL）算法;我们构建了用于多目标学习的 OLS 框架并集成了 DQN。然后，我们通过引入部分深度OLS学习（DOL-PR）和完全重用（DOL-FR）来改进该算法。DOL、DOL-PR 和 DOL-FR 使用单目标子程序，该子程序在第 3.1 节中与 DOL 一起定义。

3.1Deep OLS Learning (DOL)
3.1深度OLS学习（DOL）

There are two requirements to make use of the OLS framework. We first need a scalarized, i.e., single-objective learning algorithm that is OLS compliant. OLS compliance entails that rather than learning a single value per �(�,�), we need a vector-valued Q-value 𝐐(�,�). The estimates of 𝐐(�,�) need to be accurate enough to determine the next corner weight in the series of linear scalarisation weights, 𝐰, that OLS is going to generate. To satisfy those requirements we adjust our neural network architectures to output a matrix of |𝒜|×� (where � is the number of objectives) instead of just |𝒜|, and we train for an extended number of episodes.
使用 OLS 框架有两个要求。我们首先需要一种符合OLS的标量化，即单目标学习算法。OLS 合规性意味着，我们需要一个向量值的 Q 值 𝐐(�,�) ，而不是学习每个 �(�,�) 的单个值。的估计 𝐐(�,�) 需要足够准确，以确定 OLS 将要生成的线性标量权 𝐰 重系列中的下一个角权重。为了满足这些要求，我们调整了神经网络架构，以输出一个矩阵 |𝒜|×� （其中 � 是目标的数量），而不仅仅是 |𝒜| ，并且我们训练了更多的剧集。

We define scalarised deep Q-learning, which uses this network architecture, and optimises the parameters to maximise the inner product of 𝐰 and the 𝐐-values for a given 𝐰 instead of the scalar �-values as in standard deep Q-learning. Using scalarised deep Q-learning as a subroutine in OLS results in our first algorithm: deep OLS learning (DOL).
我们定义了标量深度 Q 学习，它使用这种网络架构，并优化参数以最大化给定 𝐰 的 𝐰 内积和 𝐐 -值，而不是像标准深度 Q 学习那样的标量 � -值。在 OLS 中使用标度深度 Q 学习作为子程序会产生我们的第一个算法：深度 OLS 学习（DOL）。

3.2Deep OLS Learning with Full (DOL-FR) and Partial Reuse (DOL-PR)
3.2具有完整（DOL-FR）和部分重用（DOL-PR）的深度 OLS 学习

While DOL can already tackle very large MOMDPs, re-learning the parameters for the entire network when we move to the next 𝐰 in the sequence is rather inefficient. Fortunately, we can exploit the following observation: the optimal value vectors (and thus optimal policies) for a scalarised MOMDP with a 𝐰 and a 𝐰′ that are close together, are typically close as well Roijers et al. (2015b). Because deep Q-networks learn to extract the features of a problem that are relevant to the rewards of an MOMDP, we can speed up computation by reusing the neural network parameters that were trained earlier in the sequence.
虽然 DOL 已经可以处理非常大的 MOMDP，但当我们移动到序列中的下一个 𝐰 时，重新学习整个网络的参数是相当低效的。幸运的是，我们可以利用以下观察结果：对于具有 a 𝐰 和 a 𝐰′ 的标度化 MOMDP，最优值向量（以及最优策略）通常也很接近 Roijers 等人（ 2015b）。由于深度 Q 网络学习提取与 MONDP 奖励相关的问题特征，因此我们可以通过重用序列中早期训练的神经网络参数来加快计算速度。

In Algorithm 1, we present an umbrella version of three novel algorithms, which we denote 𝙳𝙾𝙻. The different algorithms are obtained by setting the 𝚛𝚎𝚞𝚜𝚎 parameter (i.e., the type of reuse) to one of three values: DOL (without reuse) is obtained by setting 𝚛𝚎𝚞𝚜𝚎 to ‘none’, DOL with full reuse (DOL-FR) is obtained by setting 𝚛𝚎𝚞𝚜𝚎 to ‘full’, and DOL with partial reuse (DOL-PR) is obtained by setting 𝚛𝚎𝚞𝚜𝚎 to ‘partial’.
在算法 1 中，我们提出了三种新算法的总称版本，我们将其表示 𝙳𝙾𝙻 为。不同的算法是通过将 𝚛𝚎𝚞𝚜𝚎 参数（即重用类型）设置为以下三个值之一来获得的：通过设置为 𝚛𝚎𝚞𝚜𝚎 “无”获得 DOL（无重用），通过设置为“完全”获得完全重用的 DOL （DOL-FR），通过设置为 𝚛𝚎𝚞𝚜𝚎 𝚛𝚎𝚞𝚜𝚎 “部分”获得部分重用（DOL-PR）。

Algorithm 1 Deep OLS Learning (with different types of reuse)
算法 1：深度 OLS 学习（具有不同类型的重用）

1: function DOL(�,�,𝚝𝚎𝚖𝚙𝚕𝚊𝚝𝚎,𝚛𝚎𝚞𝚜𝚎)
1：功能 DOL（ �,�,𝚝𝚎𝚖𝚙𝚕𝚊𝚝𝚎,𝚛𝚎𝚞𝚜𝚎 ）

2: ▷ Where, � – the (MOMDP) environment, � – improvement threshold,
2： ▷ 其中， � – （MOMDP）环境， � – 改进阈值，

3: ▷ 𝚝𝚎𝚖𝚙𝚕𝚊𝚝𝚎 – specification of DQN architecture, 𝚛𝚎𝚞𝚜𝚎 – the type of reuse
3： ▷ 𝚝𝚎𝚖𝚙𝚕𝚊𝚝𝚎 – DQN 架构规范， 𝚛𝚎𝚞𝚜𝚎 – 重用类型

4: �= empty partial CSS
4： �= 空部分CSS

5: �= empty list of explored corner weights
5： �= 探索角权重的空列表

6: �= priority queue initialised with the extrema weights simplex with infinite priority
6： �= 使用具有无限优先级的极值权重单纯形初始化的优先级队列

7: 𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜= empty table of DQNs, indexed by the weight, 𝐰, for which it was learnt
7： 𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜= DQN 的空表，按权重索引， 𝐰 学习

8: while � is not empty ∧𝚒𝚝≤𝚖𝚊𝚡_𝚒𝚝 do
8：虽然 � 不空 ∧𝚒𝚝≤𝚖𝚊𝚡_𝚒𝚝 做

9: 𝐰=�.pop()

10: if 𝚛𝚎𝚞𝚜𝚎= ‘none’ ∨𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜 is empty then
10：如果 𝚛𝚎𝚞𝚜𝚎= “none”为空， ∨𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜 则

11: 𝚖𝚘𝚍𝚎𝚕= a randomly initialised DQN, from a pre-specified architecture template
11： 𝚖𝚘𝚍𝚎𝚕= 来自预先指定的体系结构模板的随机初始化的 DQN

12: else 12：其他

13: 𝚖𝚘𝚍𝚎𝚕=copyNearestModel(𝐰,DQN_Models)

14: if 𝚛𝚎𝚞𝚜𝚎= ‘partial’ then reinitialise the last layer of 𝚖𝚘𝚍𝚎𝚕 with random weights
14：如果是“部分”，则 𝚛𝚎𝚞𝚜𝚎= 𝚖𝚘𝚍𝚎𝚕 使用随机权重重新初始化最后一层

15: 𝐕,𝚗𝚎𝚠_𝚖𝚘𝚍𝚎𝚕=scalarisedDeepQLearning(�,𝐰,𝚖𝚘𝚍𝚎𝚕)

16: �=�∪𝐰

17: if (∃𝐰′)𝐰′⋅𝐕>max𝐔∈�⁡𝐰′⋅𝐔 then
17：如果 (∃𝐰′)𝐰′⋅𝐕>max𝐔∈�⁡𝐰′⋅𝐔 那么

18: ��=��∪ corner weights made obsolete by 𝐕 from �
18： ��=��∪ 角重 𝐕 因 �

19: ��=��∪{𝐰}

20: Remove �� from �
20：从中删除 ��

21: Remove vectors from � that are no longer optimal for any 𝐰 after adding 𝐕
21：从 � 添加 𝐕 后不再适合任何 𝐰 向量的向量中删除

22: �𝐕 = newCornerWeights(S, 𝐕)
22： �𝐕 = newCornerWeights（S， 𝐕 ）

23: �=�∪{𝐕}

24: DQN_Models[𝐰]=𝚗𝚎𝚠_𝚖𝚘𝚍𝚎𝚕

25: for each 𝐰′∈�𝐕 do
25：对于每个 𝐰′∈�𝐕 do

26: if estimateImprovement(𝐰′,�,�) > � then
26：如果 estimateImprovement（ 𝐰′,�,� ） > � 那么

27: �.add(𝐰′)

28: 𝚒𝚝++

29: return �, 𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜
29：返回 � ， 𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜

DOL-FR applies full deep Q-network reuse; we start learning for a new scalarisation weight 𝐰′, using the complete network we optimised for the previous 𝐰 that is closest to 𝐰′ in the sequence of scalarisation weights that OLS generated so far. DOL-PR applies partial deep Q-network reuse; we take the same network as for full reuse, but we reinitialise the last layer of the network randomly, in order to escape local optima. DOL (without reuse) does no reuse whatsoever, i.e., all network parameters are initialised randomly at the start of each iteration.
DOL-FR应用全深度Q网络复用;我们开始学习新的标量权重 𝐰′ ，使用我们为之前 𝐰 优化的完整网络，该网络最接近 𝐰′ OLS迄今为止生成的标量权重序列。DOL-PR应用部分深度Q网络复用;我们采用与完全重用相同的网络，但我们随机重新初始化网络的最后一层，以逃避局部最优。DOL（无重用）不重用，即所有网络参数在每次迭代开始时都是随机初始化的。

𝙳𝙾𝙻 keeps track of the partial CCS, �, to which at most one value vector will be added at each iteration (line 4). To find these vectors, scalarised deep Q-learning (Section 3.1) is run for different corner weights. The corner weights that are not yet explored are kept in a priority queue, �, and after they have been explored, are stored in a list � (line 5 and 6). � is initialised with the extrema weights and keeps track of the scalarisation weights ordered by estimated improvement. In order to reuse the learnt parameters in DOL-PR/FR, 𝙳𝙾𝙻 keeps track of them along with the corner weight 𝐰 for which they were found in 𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜.
𝙳𝙾𝙻 跟踪部分 CCS， � 在每次迭代时最多添加一个值向量（第 4 行）。为了找到这些向量，针对不同的角权重运行标量化深度 Q 学习（第 3.1 节）。尚未探索的拐角权重保留在优先级队列中， � 在探索后，存储在列表中 � （第 5 行和第 6 行）。 � 使用极值权重初始化，并跟踪按估计改进排序的标量化权重。为了在 DOL-PR/FR 中重用学习到的参数， 𝙳𝙾𝙻 请跟踪它们以及在中找到 𝙳𝚀𝙽_𝙼𝚘𝚍𝚎𝚕𝚜 它们的角权重 𝐰 。

Following the OLS framework, at each iteration of 𝙳𝙾𝙻, the weight with the highest improvement is popped (line 9). After selecting 𝐰, 𝙳𝙾𝙻 now reuses the DQNs it learnt in previous iterations (depending on the parameter 𝚛𝚎𝚞𝚜𝚎). The function copyNearestModel finds the network learnt for the closest weight to the current corner weight on line 13. In the case of full reuse (𝚛𝚎𝚞𝚜𝚎=`��′), all parameter values are copied. In the case of partial reuse (𝚛𝚎𝚞𝚜𝚎=`��′), the last layer is reinitialised with random parameter values (line 14), and in the case of no reuse (𝚛𝚎𝚞𝚜𝚎=`��′) all the network parameters are reset (line 11).
按照 OLS 框架，在的每次迭代 𝙳𝙾𝙻 中，都会弹出改进最大的权重（第 9 行）。选择 𝐰 后 𝙳𝙾𝙻 ，现在重用它在以前的迭代中学习的 DQN（取决于参数 𝚛𝚎𝚞𝚜𝚎 ）。函数 copyNearestModel 查找学习到的网络，以获得与第 13 行当前角权重最接近的权重。在完全重用（ 𝚛𝚎𝚞𝚜𝚎=`��′ ）的情况下，将复制所有参数值。在部分重用（ 𝚛𝚎𝚞𝚜𝚎=`��′ ）的情况下，使用随机参数值重新初始化最后一层（第 14 行），在不重用（ 𝚛𝚎𝚞𝚜𝚎=`��′ ）的情况下，重置所有网络参数（第 11 行）。

Following the different types of reuse, scalarised deep Q-learning, as described in Section 3.1 is invoked for the 𝐰 popped off of � on line 9. Scalarised deep Q-learning returns a value vector, 𝐕, corresponding to the learnt policy represented by a DQN, which is also returned (line 15). The current corner weight is added to the list of explored weights (line 16), which is used to determine the priorities for subsequently discovered corner weights in the current and future iterations. If there is a weight vector 𝐰 in the weight simplex for which the scalarised value is higher than for any of the vectors in �, the value vector is added to �, and new corner weights are determined and stored (lines 18-27). The DQN that corresponds to 𝐕 is stored in 𝙳𝚀𝙽_𝚖𝚘𝚍𝚎𝚕𝚜[𝐰]. If 𝐕 does not improve upon � for any 𝐰, it is discarded.
在不同类型的重用之后，在第 3.1 节中描述的标化深度 Q 学习被调用，用于 𝐰 第 9 行的弹出 � 。标量化深度 Q 学习返回一个值向量， 𝐕 对应于 DQN 表示的学习策略，该策略也返回（第 15 行）。当前拐角权重将添加到探索权重列表（第 16 行）中，该权重用于确定当前和未来迭代中后续发现的拐角权重的优先级。如果权重单纯形 𝐰 中有一个权重向量，其标度值高于 � 中的任何向量，则将该值向量添加到 � 中，并确定并存储新的角权重（第 18-27 行）。对应的 𝐕 DQN 存储在 𝙳𝚀𝙽_𝚖𝚘𝚍𝚎𝚕𝚜[𝐰] 中。如果 𝐕 对任何 𝐰 没有改进， � 则将其丢弃。

Extending � with 𝐕 leads to new corner weights. These new corner weights and their estimated improvement are calculated using the 𝚗𝚎𝚠𝙲𝚘𝚛𝚗𝚎𝚛𝚆𝚎𝚒𝚐𝚑𝚝𝚜 and 𝚎𝚜𝚝𝚒𝚖𝚊𝚝𝚎𝙸𝚖𝚙𝚛𝚘𝚟𝚎𝚖𝚎𝚗𝚝 methods of OLS Roijers (2016). The new corner weights are added to � if their improvement value is greater than the threshold � (lines 25-27). Also, corner weights in � which are made obsolete (i.e. are no longer on the convex upper surface) by the new value vector are removed (line 18-19). This is repeated until there are no more corner weights in �, at which point 𝙳𝙾𝙻 terminates.
延伸 � 𝐕 引出新的角重。这些新的角权重及其估计的改进是使用 OLS Roijers （2016）的 𝚗𝚎𝚠𝙲𝚘𝚛𝚗𝚎𝚛𝚆𝚎𝚒𝚐𝚑𝚝𝚜 和 𝚎𝚜𝚝𝚒𝚖𝚊𝚝𝚎𝙸𝚖𝚙𝚛𝚘𝚟𝚎𝚖𝚎𝚗𝚝 方法计算的。 � 如果新的拐角权重的改进值大于阈值 � （第 25-27 行），则将添加到这些角权重中。此外，被新值向量淘汰（即不再在凸上表面）的角权重 � 也被移除（第 18-19 行）。重复此操作，直到中 � 不再有角权重，此时 𝙳𝙾𝙻 终止。

4Experimental Evaluation 4实验评估

In this section, we evaluate the performance of DOL and DOL-PR/FR. We make use of two multi-objective reinforcement learning problems called mountain car (MC) and deep sea treasure (DST) . We first show, how DOL and DOL-PR/FR are able to learn the correct CSS, using direct access to the state �� of the problems. Then, we explore the scalability of our proposed methods, and evaluate the performance of weight reuse, we create an image version of the DST problem, in which we use a bitmap as input for scalarised deep Q-learning.
在本节中，我们评估了 DOL 和 DOL-PR/FR 的性能。我们利用了两个多目标强化学习问题，称为山地汽车（MC）和深海宝藏（DST）。我们首先展示 DOL 和 DOL-PR/FR 如何通过直接访问问题状态 �� 来学习正确的 CSS。然后，我们探索了我们提出的方法的可扩展性，并评估了权重重用的性能，我们创建了DST问题的图像版本，其中我们使用位图作为标化深度Q学习的输入。

4.1Setup 4.1设置

For both the raw and image problems we follow the DQN setup of Mnih et al. (2015), employing experience replay and a target network to stabilise learning. We use an �-greedy exploration policy with � annealing from �=1 to 0.05, for the first 2000 and 3000 episodes, respectively, and learning continues for an equal number of episodes. The discount factor is �=0.97, and the target network is reset every 100 episodes. To stabilise learning, we execute parallel episodes in batches of 32. The parameters are optimised using Adam and a learning rate of 10−3. In each experiment we average over 5 runs.
对于原始和图像问题，我们遵循 Mnih 等人（2015）的 DQN 设置，使用经验回放和目标网络来稳定学习。我们使用 � 贪婪探索策略，分别对第一 2000 集和 3000 剧集进行 � 从 �=1 到 0.05 、的退火，并且对相同数量的剧集继续学习。折扣系数为 �=0.97 ，目标网络每 100 集重置一次。为了稳定学习，我们分批执行并行剧集 32 。参数使用 Adam 和的学习率进行优化 10−3 。在每个实验中，我们都会对 5 运行次数进行平均。

Refer to caption

Figure 2:DST architecture.
图 2：DST 体系结构。

For the raw state model we used an MLP architecture with 1 hidden layer of 100 neurons, and rectified linear unit activations. To process the 3×11×10 image inputs of Deep Sea we employed two convolutional layers of 16×3×3 and 32×3×3 and a fully connected layer on top. Finally, to facilitate future research we publish the source-code to replicate our experiments 11GitHub - hossam-mossalam/multi-objective-deep-rl: Multi-Objective Deep Reinforcement Learning
对于原始状态模型，我们使用了具有 1 个隐藏神经元层的 100 MLP 架构，并纠正了线性单元激活。为了处理 Deep Sea 的图像输入， 3×11×10 我们使用了 16×3×3 和 32×3×3 的两个卷积层，以及顶部的全连接层。最后，为了促进未来的研究，我们发布了源代码来复制我们的实验 1.

4.2Multi-Objective Mountain Car
4.2多目标山地车

In order to show that DOL, DOL-FR, and DOL-PR can learn a CCS, we first test on the multi-objective mountain car problem (MC). MC is a variant of the famous mountain car problem introduced in Sutton and Barto (1998). In single-objective mountain car problem, the agent controls a car located in a valley between two hills and it tries to get the car to reach the top of the hill on the right side. The car has a limited engine power, thus, the agent needs to oscillate the car between both hills until the car has gathered enough inertia that would let it reach the goal.
为了证明 DOL、DOL-FR 和 DOL-PR 可以学习 CCS，我们首先对多目标山地车问题（MC）进行了测试。MC是Sutton和Barto（1998）中引入的著名山地车问题的变体。在单目标山地汽车问题中，智能体控制一辆位于两座山丘之间的山谷中的汽车，并试图让汽车到达右侧的山顶。汽车的发动机功率有限，因此，代理需要在两个山丘之间振荡汽车，直到汽车聚集了足够的惯性使其达到目标。

Refer to caption

Figure 3:MC raw version mean CSS error.
图 3：MC 原始版本意味着 CSS 错误。

The reward in the single-objective variant is −1 for all time steps and 0 when the goal is reached. Our multi-objective variant adds another reward which is the fuel consumption for each time step, which is proportional to the force exerted by the car. In MC, there are only 2 value vectors in the CCS, and is thus a small problem.
单一目标变体中的奖励是 −1 针对所有时间步长和 0 达到目标的时间。我们的多目标变体增加了另一个奖励，即每个时间步的油耗，这与汽车施加的力成正比。在 MC 中，CCS 中只有 2 个值向量，因此是一个小问题。

Raw version. We evaluate our proposed methods within the MC environment with the agent having direct access to the ��. We employ the same neural network architecture as for DST. However, for MC, we used the CCS obtained by q-table algorithm as the true CCS which was then used for Max CCS error calculations as the true CCS. As it can be seen in Figure 3, the three algorithms achieve very similar results on the MC problem with DOL-PR achieving the least error. The algorithms learn a good approximation to the CCS in 2 iterations. After that, they continue making tiny improvements to these vectors that are not visible on the graph. The different algorithms behave equally well, which is due to the fact that for the extrema of the weight space, i.e., the first two iterations, the optimal policies are very different, and reuse does not contribute significantly.
原始版本。我们在 MC 环境中评估我们提出的方法，代理可以直接访问 �� .我们采用与 DST 相同的神经网络架构。然而，对于MC，我们使用q-table算法得到的CCS作为真正的CCS，然后用于最大CCS误差计算作为真正的CCS。从图 3 中可以看出，这三种算法在 MC 问题上取得了非常相似的结果，其中 DOL-PR 的误差最小。这些算法在 2 次迭代中学习了 CCS 的良好近似值。之后，他们继续对这些在图上不可见的向量进行微小的改进。不同的算法表现得同样好，这是因为对于权重空间的极值，即前两次迭代，最优策略非常不同，重用没有显着贡献。

4.3Deep Sea Treasure 4.3深海宝藏

Refer to caption

Figure 4:Image DST map. 图 4：图像 DST 映射。

To test the performance of our algorithms on a problem with a larger CCS, we adapt the well-known deep sea treasure (DST) Vamplew et al. (2011) benchmark for MORL. In DST, the agent controls a submarine searching for treasures in a 10×11 grid. The state �� consists of the current agent’s coordinates (�,�). The grid contains 10 treasures that their rewards increase in proportion to the distance from the starting point �0=(0,0). The agent’s action spaces is formed by navigation in four directions, and the map is depicted in Figure 4.
为了测试我们的算法在大型CCS问题上的性能，我们采用了著名的深海宝藏（DST）Vamplew等人（2011）的MORL基准。在夏令时中，特工控制一艘潜艇在 10×11 网格中寻找宝藏。状态 �� 由当前代理的坐标组成 (�,�) 。网格包含 10 个宝藏，它们的奖励与与起点的距离成比例增加 �0=(0,0) 。智能体的行动空间是通过在四个方向上导航形成的，地图如图 4 所示。

At each time-step the agent gets rewarded for the two different objectives. The first is zero unless a treasure value was received, and the second is a time penalty of −1 for each time-step. To be able to learn a CCS instead of a Pareto front, as it was in the original work Vamplew et al. (2011), we have adapted the values of the treasures such that the value of the most efficient policy for reaching each treasure is in the CCS. The rewards for both objectives were normalised between [0−1] to facilitate the learning.
在每个时间步，代理都会因两个不同的目标而获得奖励。第一个是零，除非收到宝藏值，第二个是每个时间步长 −1 的时间惩罚。为了能够学习 CCS 而不是帕累托前沿，就像在原始工作 Vamplew 等人（2011）中一样，我们调整了宝藏的值，以便到达每个宝藏的最有效策略的价值在 CCS 中。两个目标的奖励在两者之间 [0−1] 标准化，以促进学习。

Refer to caption

Figure 5:CCS after 4000 episodes in DST raw version.
图 5：DST 原始版本中剧集后的 4000 CCS。

Raw version. We first evaluate our proposed methods, in a simple scenario, where the agent has direct access to the ��. Hence, we employ a simple neural network architecture, to measure the maximum error in scalarised value with respect to the true CCS. The true CSS is obtained by planning with an exact algorithm on the underlying MOMDP. We refer to this error as Max CCS Error. An analytical visualisation of measuring the true CSS and the discovered CSS difference, is illustrated in Figure 5. As it can be seen in Figure 5(a), DOL exhibits the highest error. Contrary to the preliminary expectations, having access to the raw state information �� does not make the feature extraction and reuse redundant. Furthermore, we discovered that when DOL-FR was used and the initialisation model already corresponded to an optimal policy, the miss-approximation error increased significantly, and less so for DOL-PR. We therefore conclude that our algorithms can efficiently approximate a CCS, and that reuse enables more accurate learning.
原始版本。我们首先在一个简单的场景中评估我们提出的方法，其中代理可以直接访问 �� .因此，我们采用简单的神经网络架构来测量标量值相对于真实 CCS 的最大误差。真正的 CSS 是通过在底层 MONDP 上使用精确算法进行规划而获得的。我们将此错误称为最大 CCS 误差。图 5 显示了测量真实 CSS 和发现的 CSS 差异的分析可视化。从图5（a）中可以看出，DOL的误差最大。与初步预期相反，访问原始状态信息 �� 并不会使特征提取和重用变得多余。此外，我们发现，当使用DOL-FR并且初始化模型已经对应于最优策略时，误差逼近误差显著增加，而DOL-PR则更少。因此，我们得出结论，我们的算法可以有效地近似CCS，并且重用可以实现更准确的学习。

Refer to caption

(a)DST raw version. （一）DST 原始版本。

Refer to caption

(b)DST image version. （二）DST 映像版本。

Refer to caption

(c)DST episodes vs accuracy.
（三）DST 剧集与准确性。

Figure 6:The Figures (a) and (b) illustrate the maximum CSS error in DST raw and image versions, respectivly. The results are averaged over 5 experiments. Figure (c) shows the accuracy achieved for different number of episodes for DOL-PR.
图 6：图（a）和（b）分别说明了 DST raw 和图像版本中的最大 CSS 错误。结果是实验的平均 5 值。图（c）显示了DOL-PR在不同次数的发作中达到的准确性。

Image version. Similar to the raw version, our deep convolutional architectures for the image version, are still able to approximate the CCS with a high accuracy. As seen in Figure 5(b), the reuse methods show higher performance than DOL, and DOL-PR exhibits the highest stability as well. This is attributed to the fact that the network has learned to encode the state-space, which paves the way towards efficient learning of the Q-values. DOL-PR exhibits the highest performance, as by resetting the last layer, we keep this encoded state-space, but we still allow DOL to train a new set of Q-values from scratch. We therefore conclude that DOL-PR is the preferred algorithm.
图像版本。与原始版本类似，我们针对图像版本的深度卷积架构仍然能够高精度地逼近 CCS。如图5（b）所示，重用方法显示出比DOL更高的性能，并且DOL-PR也表现出最高的稳定性。这归因于网络已经学会了对状态空间进行编码，这为有效学习 Q 值铺平了道路。DOL-PR 表现出最高的性能，因为通过重置最后一层，我们保留了这个编码的状态空间，但我们仍然允许 DOL 从头开始训练一组新的 Q 值。因此，我们得出结论，DOL-PR是首选算法。

Accuracy vs Episodes. We further investigated the effects of the number of training episodes on the max CCS error. As can be seen in Figure 5(c), the error is highly affected by the number of training episodes. Specifically, for a small number of episodes DOL-PR is unable to providine sufficient accuracy to build the CCS. It is interesting to note that though the error decreases up to 4000 episodes, at 10000 episodes the network is overfitting which results in lower performance.
准确性与剧集。我们进一步研究了训练次数对最大CCS误差的影响。从图 5（c）中可以看出，误差受训练集数的影响很大。具体来说，对于少量事件，DOL-PR 无法提供足够的准确性来构建 CCS。有趣的是，尽管误差会随着 4000 剧集的增加而减少，但在剧集中 10000 ，网络会过度拟合，从而导致性能降低。

5Related Work 5相关工作

Multi-objective reinforcement learning Roijers et al. (2013); Vamplew et al. (2011) has recently seen a renewed interest. Most algorithms in the literature Barrett and Narayanan (2008); Moffaert and Nowé (2014); Wiering et al. (2014) are however based on an inner loop approach, i.e., replacing the inner workings of single-objective solvers to work with sets of value vectors in the innermost workings of the algorithm. This is a fundamentally different approach, of which it is not clear how it could be applied to DQN, i.e., back-propagation cannot be transformed into a multi-objective algorithm in such a way. Other work does apply an outer loop approach but does not employ Deep RL Yahyaa et al. (2014); Van Moffaert et al. (2014); Natarajan and Tadepalli (2005). We argue that enabling deep RL is essential for scaling up to larger problems.
多目标强化学习 Roijers et al. （ 2013）;Vamplew et al. （ 2011）最近重新引起了人们的兴趣。文献中的大多数算法 Barrett 和 Narayanan （ 2008）;Moffaert 和 Nowé （ 2014）;然而，Wiering et al. （2014）基于内部循环方法，即替换单目标求解器的内部工作，以在算法最内部的工作中使用值向量集。这是一种根本不同的方法，目前尚不清楚它如何应用于DQN，即反向传播不能以这种方式转换为多目标算法。其他工作确实应用了外循环方法，但没有采用Deep RL Yahyaa等人（2014）;Van Moffaert 等人（2014 年）;Natarajan和Tadepalli（2005）。我们认为，启用深度强化学习对于扩展到更大的问题至关重要。

Another popular class of MORL algorithms are heuristic policy search methods that find a set of alternative policies. These are for example based on multi-objective evolutionary algorithms (MOEAs) Coello et al. (2007); Handa (2009) or Pareto local search (PLS) Kooijman et al. (2015). Especially MOEAs are compatible with neural networks, but evolutionary optimisation of NNs is typically rather slow compared to back-propagation (which is what the deep Q-learning algorithm that we employ in this paper as a single-objective subroutine uses).
另一类流行的 MORL 算法是启发式策略搜索方法，用于查找一组替代策略。例如，这些算法基于多目标进化算法（MOEA），Coello等人（2007）;Handa （ 2009）或 Pareto local search （PLS） Kooijman et al. （ 2015）。特别是MOEA与神经网络兼容，但与反向传播相比，神经网络的进化优化通常相当缓慢（这是我们在本文中用作单目标子程序的深度Q学习算法）。

Outside of MORL, there are algorithms that are based on OLS but apply to different problem settings. Notably, the OLSAR algorithm Roijers et al. (2015b) does planning in multi-objective partially observable MDPs (POMDPs), and applies reuse to the alpha matrices that it makes use of to represent the multi-objective value function. Unlike in our work, however, these alpha matrices form a guaranteed lower bound on the value function and can be reused fully without affecting the necessary exploration for learning in later iterations. Furthermore, the variational OLS (VOLS) algorithm Roijers et al. (2015c), applies OLS to multi-objective coordination graphs and reuses reparameterisations of these graphs that are returned by the single-objective variational inference methods that VOLS uses as a subroutine. These variational subroutines are not made OLS compliant, like the DQNs in this paper, but the value vectors are retrieved by a separate policy evaluation step (which would be suboptimal in the context of deep RL).
在 MORL 之外，有一些算法基于 OLS，但适用于不同的问题设置。值得注意的是，OLSAR算法Roijers等人（2015b）在多目标部分可观察MDP（POMDP）中进行规划，并将重用应用于它用来表示多目标值函数的alpha矩阵。然而，与我们的工作不同的是，这些 alpha 矩阵在值函数上形成了一个有保证的下限，并且可以完全重用，而不会影响以后迭代中学习的必要探索。此外，变分OLS（VOLS）算法Roijers等人（2015c）将OLS应用于多目标协调图，并重用这些图的重参数化，这些图由VOLS用作子程序的单目标变分推理方法返回。这些变分子例程不像本文中的 DQN 那样符合 OLS，而是通过单独的策略评估步骤检索值向量（在深度 RL 的上下文中，这将是次优的）。

6Discussion 6讨论

In this work, we proposed three new algorithms that enable the usage of deep Q-learning for multi-objective reinforcement learning. Our algorithms build off the recent optimistic linear support framework, and as such tackle the problem by learning one policy and corresponding value vector per iteration. Further, we extend the main deep OLS learning (DOL), to take advantage of the nature of neural networks, and introduce full (DOL-FR) and partial (DOL-PR) parameter reuse, in between the iterations, to pave the way towards faster learning.
在这项工作中，我们提出了三种新的算法，这些算法能够将深度Q学习用于多目标强化学习。我们的算法建立在最新的乐观线性支持框架之上，因此通过每次迭代学习一个策略和相应的值向量来解决这个问题。此外，我们扩展了主要的深度OLS学习（DOL），以利用神经网络的性质，并在迭代之间引入完整（DOL-FR）和部分（DOL-PR）参数重用，为更快的学习铺平道路。

We showed empirically that in problems with large inputs, our algorithms can learn CCS with high accuracy. For these problems DOL-PR outperforms DOL and DOL-FR, indicating that a) reuse is useful, and b) doing partial reuse rather than full reuse effectively prevents the model from getting stuck in a policy that was optimal for a previous 𝐰. In future work, we are planning to incorporate early stopping technique, and optimise our model for the accuracy requirements of OLS, while lowering the number of episodes required.
我们通过经验表明，在大输入的问题中，我们的算法可以高精度地学习CCS。对于这些问题，DOL-PR 优于 DOL 和 DOL-FR，这表明 a）重用是有用的，并且 b）进行部分重用而不是完全重用可以有效地防止模型陷入对以前的 𝐰 策略最优的策略中。在未来的工作中，我们计划采用早期停止技术，并针对OLS的准确性要求优化我们的模型，同时减少所需的集数。

Acknowledgements 确认

This work is in part supported by the TERESA project (EC-FP7 grant #611153).
这项工作部分得到了 TERESA 项目（EC-FP7 资助 #611153）的支持。

References 引用

Levine et al. (2015) Levine等人（2015）S. Levine, C. Finn, T. Darrell, and P. Abbeel.End-to-end training of deep visuomotor policies.arXiv preprint arXiv:1504.00702, 2015.
S. Levine、C. Finn、T. Darrell 和 P. Abbeel。深层视觉运动策略的端到端培训。arXiv 预印本 arXiv：1504.00702， 2015.
Assael et al. (2015) Assael等人（2015）Y. M. Assael, N. Wahlström, T. B. Schön, and M. P. Deisenroth.Data-efficient learning of feedback policies from image pixels using deep dynamical models.NIPS Deep Reinforcement Learning Workshop, 2015.
Y. M. Assael、N. Wahlström、TB Schön 和 MP Deisenroth。使用深度动力学模型从图像像素中高效学习反馈策略。NIPS深度强化学习研讨会，2015年。
Watter et al. (2015) Watter 等人（2015 年）M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller.Embed to control: A locally linear latent dynamics model for control from raw images.In NIPS, 2015.
M. Watter、J. T. Springenberg、J. Boedecker 和 MA Riedmiller。嵌入到控制：用于从原始图像进行控制的局部线性潜在动力学模型。在 NIPS 中，2015 年。
Ba et al. (2015) Ba等人（2015）J. Ba, V. Mnih, and K. Kavukcuoglu.Multiple object recognition with visual attention.In ICLR, 2015.
J. Ba、V. Mnih 和 K. Kavukcuoglu。具有视觉注意力的多对象识别。在ICLR，2015年。
Foerster et al. (2016) Foerster等人（2016）J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson.Learning to communicate with deep multi-agent reinforcement learning.arXiv preprint arXiv:1605.06676, 2016.
J. N. Foerster、Y. M. Assael、N. de Freitas 和 S. Whiteson。学习与深度多智能体强化学习进行沟通。arXiv预印本arXiv：1605.06676,2016。
Guo et al. (2014) Guo等人（2014）X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang.Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning.In NIPS, pages 3338–3346. 2014.
X. Guo、S. Singh、H. Lee、RL Lewis 和 X. Wang。使用离线蒙特卡洛树搜索计划进行实时雅达利游戏的深度学习。在NIPS中，第3338-3346页。2014.
Mnih et al. (2015) Mnih等人（2015）V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis.Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015.
V. Mnih、K. Kavukcuoglu、D. Silver、AA Rusu、J. Veness、MG Bellemare、A. Graves、M. Riedmiller、AK Fidjeland、G. Ostrovski、S. Petersen、C. Beattie、A. Sadik、I. Antonoglou、H. King、D. Kumaran、D. Wierstra、S. Legg 和 D. Hassabis。通过深度强化学习实现人性化控制。自然， 518（7540）：529–533， 2015.
Stadie et al. (2015) Stadie等人（2015）B. C. Stadie, S. Levine, and P. Abbeel.Incentivizing exploration in reinforcement learning with deep predictive models.arXiv preprint arXiv:1507.00814, 2015.
BC Stadie、S. Levine 和 P. Abbeel。使用深度预测模型激励强化学习的探索。arXiv预印本arXiv：1507.00814,2015。
Wang et al. (2015) Wang等人（2015）Z. Wang, N. de Freitas, and M. Lanctot.Dueling network architectures for deep reinforcement learning.arXiv preprint 1511.06581, 2015.
Z. Wang、N. de Freitas 和 M. Lanctot。用于深度强化学习的决斗网络架构。arXiv预印本1511.06581,2015年。
Schaul et al. (2016) Schaul等人（2016）T. Schaul, J. Quan, I. Antonoglou, and D. Silver.Prioritized experience replay.In ICLR, 2016.
T. Schaul、J. Quan、I. Antonoglou 和 D. Silver。优先体验回放。在ICLR，2016年。
van Hasselt et al. (2016)
van Hasselt等人（2016）H. van Hasselt, A. Guez, and D. Silver.Deep reinforcement learning with double Q-learning.In AAAI, 2016.
H. van Hasselt、A. Guez 和 D. Silver。具有双 Q 学习的深度强化学习。在AAAI，2016年。
Oh et al. (2015) Oh等人（2015）J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh.Action-conditional video prediction using deep networks in Atari games.In NIPS, pages 2845–2853, 2015.
J. Oh、X. Guo、H. Lee、RL Lewis 和 S. Singh。在雅达利游戏中使用深度网络的动作条件视频预测。在NIPS中，第2845-2853页，2015年。
Bellemare et al. (2016) Bellemare等人（2016）M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos.Increasing the action gap: New operators for reinforcement learning.In AAAI, 2016.
MG Bellemare、G. Ostrovski、A. Guez、PS Thomas 和 R. Munos。增加行动差距：强化学习的新运算符。在AAAI，2016年。
Nair et al. (2015) Nair等人（2015）A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver.Massively parallel methods for deep reinforcement learning.In Deep Learning Workshop, ICML, 2015.
A. Nair、P. Srinivasan、S. Blackwell、C. Alcicek、R. Fearon、AD Maria、V. Panneershelvam、M. Suleyman、C. Beattie、S. Petersen、S. Legg、V. Mnih、K. Kavukcuoglu 和 D. Silver。深度强化学习的大规模并行方法。深度学习研讨会，ICML，2015 年。
Maddison et al. (2015) Maddison等人（2015）C. J. Maddison, A. Huang, I. Sutskever, and D. Silver.Move Evaluation in Go Using Deep Convolutional Neural Networks.In ICLR, 2015.
CJ 麦迪逊、A. Huang、I. Sutskever 和 D. Silver。使用深度卷积神经网络在 Go 中移动求值。在ICLR，2015年。
Silver et al. (2016) Silver等人（2016）D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016.
D. Silver、A. Huang、C. Maddison、A. Guez、L. Sifre、G. van den Driessche、J. Schrittwieser、I. Antonoglou、V. Panneershelvam、M. Lanctot、S. Dieleman、D. Grewe、J. Nham、N. Kalchbrenner、I. Sutskever、T. Lillicrap、M. Leach、K. Kavukcuoglu、T. Graepel 和 D. Hassabis。通过深度神经网络和树搜索掌握围棋游戏。自然， 529（7587）：484–489， 2016.
Tesauro et al. (2007) Tesauro 等人（2007 年）G. Tesauro, R. Das, H. Chan, J. O. Kephart, C. Lefurgy, D. W. Levine, and F. Rawson.Managing power consumption and performance of computing systems using reinforcement learning.In NIPS 2007: Advances in Neural Information Processing Systems 20, pages 1497–1504, 2007.
G. Tesauro、R. Das、H. Chan、J. O. Kephart、C. Lefurgy、DW Levine 和 F. Rawson。使用强化学习管理计算系统的功耗和性能。在NIPS 2007：神经信息处理系统的进展20，第1497-1504页，2007年。
Roijers et al. (2013) Roijers等人（2013）D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley.A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 47:67–113, 2013.
DM Roijers、P. Vamplew、S. Whiteson 和 R. Dazeley。多目标顺序决策的调查。人工智能研究杂志， 47：67–113， 2013.
Roijers et al. (2015a) Roijers等人（2015a）D. M. Roijers, S. Whiteson, and F. A. Oliehoek.Computing convex coverage sets for faster multi-objective coordination.Journal of Artificial Intelligence Research, 52:399–443, 2015a.
DM Roijers、S. Whiteson 和 FA Oliehoek。计算凸覆盖集，实现更快的多目标协调。人工智能研究杂志， 52：399–443， 2015a.
Roijers (2016) 罗伊耶斯（2016）D. M. Roijers.Multi-Objective Decision-Theoretic Planning.PhD thesis, University of Amsterdam, 2016.
DM Roijers。多目标决策理论规划。博士论文，阿姆斯特丹大学，2016年。
Roijers et al. (2015b) Roijers等人（2015b）D. M. Roijers, S. Whiteson, and F. A. Oliehoek.Point-based planning for multi-objective POMDPs.In IJCAI 2015: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, pages 1666–1672, July 2015b.
DM Roijers、S. Whiteson 和 FA Oliehoek。多目标 POMDP 的基于点的规划。IJCAI 2015：第二十四届国际人工智能联合会议论文集，第1666-1672页，2015年7月b。
Sutton and Barto (1998)
萨顿和巴托（1998）R. S. Sutton and A. G. Barto.Introduction to reinforcement learning.MIT Press, 1998.
R. S. Sutton 和 AG Barto。强化学习简介。麻省理工学院出版社，1998 年。
Lin (1993) 林（1993）L. Lin.Reinforcement Learning for Robots Using Neural Networks.PhD thesis, Carnegie Mellon University, Pittsburgh, January 1993.
L. Lin. 使用神经网络对机器人进行强化学习.1993年1月，匹兹堡卡内基梅隆大学博士论文。
Cheng (1988) 程（1988）H.-T. Cheng.Algorithms for partially observable Markov decision processes.PhD thesis, University of British Columbia, Vancouver, 1988.
H.-T. 程.部分可观察马尔可夫决策过程的算法。1988年，温哥华不列颠哥伦比亚大学博士论文。
Vamplew et al. (2011) Vamplew等人（2011）P. Vamplew, R. Dazeley, A. Berry, E. Dekker, and R. Issabekov.Empirical evaluation methods for multiobjective reinforcement learning algorithms.Machine Learning, 84(1-2):51–80, 2011.
P. Vamplew、R. Dazeley、A. Berry、E. Dekker 和 R. Issabekov。多目标强化学习算法的实证评估方法.机器学习， 84（1-2）：51–80， 2011.
Barrett and Narayanan (2008)
巴雷特和纳拉亚南（2008）L. Barrett and S. Narayanan.Learning all optimal policies with multiple criteria.In ICML, pages 41–47, 2008.
L. Barrett 和 S. Narayanan。学习具有多个标准的所有最佳策略。在ICML中，第41-47页，2008年。
Moffaert and Nowé (2014)
莫法特和诺维（2014）K. V. Moffaert and A. Nowé.Multi-objective reinforcement learning using sets of Pareto dominating policies.Journal of Machine Learning Research, 15:3483–3512, 2014.
K. V. Moffaert 和 A. Nowé。使用帕累托主导策略集的多目标强化学习。机器学习研究杂志， 15：3483–3512， 2014.
Wiering et al. (2014) Wiering等人（2014）M. A. Wiering, M. Withagen, and M. M. Drugan.Model-based multi-objective reinforcement learning.In ADPRL 2014: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 1–6, 2014.
MA Wiering、M. Withagen 和 MM Drugan。基于模型的多目标强化学习。ADPRL 2014：IEEE自适应动态规划和强化学习研讨会论文集，第1-6页，2014年。
Yahyaa et al. (2014) Yahyaa等人（2014）S. Q. Yahyaa, M. M. Drugan, and B. Manderick.The scalarized multi-objective multi-armed bandit problem: an empirical study of its exploration vs. exploitation tradeoff.In IJCNN 2014: Proceedings of the 2014 International Joint Conference on Neural Networks, pages 2290–2297, 2014.
SQ Yahyaa、MM Drugan 和 B. Manderick。标量化多目标多臂匪徒问题：探索与开发权衡的实证研究.IJCNN 2014：2014 年神经网络国际联合会议论文集，第 2290–2297 页，2014 年。
Van Moffaert et al. (2014)
Van Moffaert等人（2014）K. Van Moffaert, T. Brys, A. Chandra, L. Esterle, P. R. Lewis, and A. Nowé.A novel adaptive weight selection algorithm for multi-objective multi-agent reinforcement learning.In IJCNN 2014: Proceedings of the 2013 International Joint Conference on Neural Networks, pages 2306–2314, 2014.
K. Van Moffaert、T. Brys、A. Chandra、L. Esterle、PR Lewis 和 A. Nowé。一种用于多目标多智能体强化学习的自适应权重选择算法.在 IJCNN 2014：2013 年神经网络国际联合会议论文集，第 2306–2314 页，2014 年。
Natarajan and Tadepalli (2005)
纳塔拉詹和塔德帕利（2005）S. Natarajan and P. Tadepalli.Dynamic preferences in multi-criteria reinforcement learning.In ICML, 2005.
S. Natarajan 和 P. Tadepalli。多标准强化学习中的动态偏好。在ICML中，2005年。
Coello et al. (2007) Coello等人（2007）C. C. Coello, G. B. Lamont, and D. A. Van Veldhuizen.Evolutionary algorithms for solving multi-objective problems.Springer Science & Business Media, 2007.
CC Coello、GB、Lamont 和 DA Van Veldhuizen。用于解决多目标问题的进化算法。施普林格科学与商业媒体，2007年。
Handa (2009) 半田（2009）H. Handa.Solving multi-objective reinforcement learning problems by EDA-RL - acquisition of various strategies.In ISDA 2009: Proceedings of the Ninth Internatonal Conference on Intelligent Sysems Design and Applications, pages 426–431, 2009.
H.汉达。通过EDA-RL解决多目标强化学习问题 - 获取各种策略。ISDA 2009：第九届国际智能系统设计与应用会议论文集，第426-431页，2009年。
Kooijman et al. (2015) Kooijman等人（2015）C. Kooijman, M. de Waard, M. Inja, D. M. Roijers, and S. Whiteson.Pareto local policy search for MOMDP planning.In ESANN 2015: Special Session on Emerging Techniques and Applications in Multi-Objective Reinforcement Learning at the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning 2015, pages 53–58, April 2015.
C. Kooijman、M. de Waard、M. Inja、DM Roijers 和 S. Whiteson。帕累托地方政策搜索MOMDP规划。在 ESANN 2015 中：2015 年欧洲人工神经网络、计算智能和机器学习研讨会上关于多目标强化学习新兴技术和应用的特别会议，第 53-58 页，2015 年 4 月。
Roijers et al. (2015c) Roijers等人（2015c）D. M. Roijers, S. Whiteson, A. T. Ihler, and F. A. Oliehoek.Variational multi-objective coordination.In MALIC 2015: NIPS Workshop on Learning, Inference and Control of Multi-Agent Systems, 2015c.
DM Roijers、S. Whiteson、AT Ihler 和 FA Oliehoek。变分多目标协调。MALIC 2015：多智能体系统的学习、推理和控制NIPS研讨会，2015c。