Continuous control with deep reinforcement learning (DDPG强化学习) 论文翻译

分布式数据融合架构信息共享策略评估Continuous control with deep reinforcement learningTimothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,Yuval Tassa, David Silver & Daan Wierstra...
摘要由CSDN通过智能技术生成

连续控制与深度强化学习

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,Yuval Tassa, David Silver & Daan Wierstra

Abstract

我们将Deep Q-Learning成功的基础思想应用于连续的action领域。提出了一种基于确定性策略梯度的actor-critic,model-free算法,该算法可以在连续的动作空间中运行。 使用相同的学习算法,网络架构和超参数,我们的算法可以有力地解决20多个模拟物理任务,包括经典问题,如推车摆动,灵巧操纵,腿式运动和汽车驾驶。 我们的算法能够找到性能与规划算法相比有竞争力的策略,而规划算法可以获得域及其导数的变化。 我们进一步证明,对于许多任务,该算法可以学习“端到端”策略:直接从原始像素输入。

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies “end-to-end”: directly from raw pixel inputs.

1 INTRODUCTION

人工智能领域的主要目标之一是通过未处理的、高维的、感官输入来解决复杂的任务。 最近,通过将感知处理的深度学习(Krizhevsky et al., 2012)1 的进展与强化学习相结合,取得了重大进展,从而产生了"Deep Q Network"(DQN)算法(Mnih et al., 2015)2, 能够在许多使用未处理像素输入的Atari视频游戏中达到人类级别的表现。 为此,采用深度神经网络函数逼近器对action-value函数进行估计。

One of the primary goals of the field of artificial intelligence is to solve complex tasks from unprocessed, high-dimensional, sensory input. Recently, significant progress has been made by combining advances in deep learning for sensory processing (Krizhevsky et al., 2012) with reinforcement learning, resulting in the “Deep Q Network” (DQN) algorithm (Mnih et al., 2015) that is capable of human level performance on many Atari video games using unprocessed pixels for input. To do so, deep neural network function approximators were used to estimate the action-value function.

然而,虽然DQN解决了高维环境空间的问题,但它只能处理离散和低维动作空间。 许多感兴趣的任务,尤其是物理控制任务,具有连续(实际值)和高维度动作空间。 DQN不能直接应用于连续域,因为它依赖于找到最大化action-value函数的动作,而在连续值情况下,每一步都需要迭代优化过程。

However, while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces. DQN cannot be straightforwardly applied to continuous domains since it relies on a finding the action that maximizes the action-value function, which in the continuous valued case requires an iterative optimization process at every step.

将深度强化学习方法(如DQN)应用于连续域的一个明显方法是对动作空间进行简单的离散化。然而,这有许多限制,最明显的是维度的诅咒:操作的数量随着自由度的增加呈指数增长。例如,一个7自由度的系统(就像人的手臂上一样),每个关节的离散化在 a i ∈ { − k , 0 , k } a_i \in {\{-k,0,k\}} ai{ k,0,k} 中,导致一个维度为: 3 7 = 2187 3^7 = 2187 37=2187 的动作空间。对于需要精细控制操作的任务,情况甚至更糟,因为它们需要相应的更细粒度的离散化,从而导致离散操作数量的激增。如此大的行动空间很难有效地探索,因此在这种情况下成功地训练类似DQN的网络可能是很棘手的。此外,动作空间的朴素离散化不必要地丢弃了关于动作域结构的信息,这对于解决许多问题可能是必不可少的。

An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous domains is to to simply discretize the action space. However, this has many limitations, most notably the curse of dimensionality: the number of actions increases exponentially with the number of degrees of freedom. For example, a 7 degree of freedom system (as in the human arm) with the coarsest discretization a i ∈ { − k , 0 , k } a_i \in {\{-k,0,k\}} ai{ k,0,k} for each joint leads to an action space with dimensionality: 3 7 = 2187 3^7 = 2187 37=2187. The situation is even worse for tasks that require fine control of actions as they require a correspondingly finer grained discretization, leading to an explosion of the number of discrete actions. Such large action spaces are difficult to explore efficiently, and thus successfully training DQN-like networks in this context is likely intractable. Additionally, naive discretization of action spaces needlessly throws away information about the structure of the action domain, which may be essential for solving many problems.

在这项工作中,我们提出了一种model-free, off-policy actor-critic算法,它使用了能够在高维连续动作空间中学习策略的深层函数逼近器。我们的工作基于确定性策略梯度(deterministic policy gradient, DPG)算法(Hafner & Riedmiller, 2011)3(本身类似于NFQCA (Hafner & Riedmiller, 2011),类似的思想可以在(Prokhorov et al., 1997)4中找到)。然而,正如我们下面所示,对于具有挑战性的问题,这种带有神经函数逼近器的actor-critic方法的一种幼稚的应用是不稳定的。

In this work we present a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces. Our work is based on the deterministic policy gradient (DPG) algorithm (Silver et al., 2014) (itself similar to NFQCA (Hafner & Riedmiller, 2011), and similar ideas can be found in (Prokhorov et al., 1997)). However, as we show below, a naive application of this actor-critic method with neural function approximators is unstable for challenging problems.

在这里,我们将actor-critic方法与Deep Q Network(DQN)近期成功的见解相结合(Mnih et al., 2013; 2015)5 2。 在DQN之前,人们普遍认为使用大型非线性函数逼近器学习值函数是困难和不稳定的。 由于两项创新,DQN能够以稳定和稳健的方式使用这样的函数逼近器来学习价值函数:1. 利用重放缓冲区中的样本对网络进行离线训练,使样本之间的相关性最小化; 2.该网络使用target Q network进行训练,以便在时间差异备份期间提供一致的目标。。 在这项工作中,我们使用相同的想法,以及批量规范化(Ioffe&Szegedy,2015)6,这是深度学习的最新进展。

Here we combine the actor-critic approach with insights from the recent success of Deep Q Network (DQN) (Mnih et al., 2013; 2015). Prior to DQN, it was generally believed that learning value functions using large, non-linear function approximators was difficult and unstable. DQN is able to learn value functions using such function approximators in a stable and robust way due to two innovations: 1. the network is trained off-policy with samples from a replay buffer to minimize correlations between samples; 2. the network is trained with a target Q network to give consistent targets during temporal difference backups. In this work we make use of the same ideas, along with batch normalization (Ioffe & Szegedy, 2015), a recent advance in deep learning.

为了评估我们的方法,我们构建了各种具有挑战性的物理控制问题,涉及复杂的多关节运动,不稳定和丰富的接触动力学以及步态行为。 其中包括经典问题,如推车问题,以及许多新领域。 机器人控制的长期挑战是直接从原始感官输入(如视频)学习行动策略。 因此,我们在模拟器中放置了一个固定的视点相机,并尝试使用低维观测(如关节角度)和直接从像素点进行所有任务。

In order to evaluate our method we constructed a variety of challenging physical control problems that involve complex multi-joint movements, unstable and rich contact dynamics, and gait behavior. Among these are classic problems such as the cartpole swing-up problem, as well as many new domains. A long-standing challenge of robotic control is to learn an action policy directly from raw sensory input such as video. Accordingly, we place a fixed viewpoint camera in the simulator and attempted all tasks using both low-dimensional observations (e.g. joint angles) and directly from pixels.

我们称之为Deep DPG (Deep DPG)的无模型方法可以使用相同的超参数和网络结构,通过使用低维观察(例如笛卡尔坐标或关节角度),为我们的所有任务学习竞争策略。在许多情况下,我们还能够直接从像素中学习好的策略,同样保持超参数和网络结构不变。

Our model-free approach which we call Deep DPG (DDPG) can learn competitive policies for all of our tasks using low-dimensional observations (e.g. cartesian coordinates or joint angles) using the same hyper-parameters and network structure. In many cases, we are also able to learn good policies directly from pixels, again keeping hyperparameters and network structure constant 1.

该方法的一个关键特征是它的简单性:它只需要一个简单的 actor-critic 架构和学习算法,并且只需很少的“移动部件”,使其易于实现和扩展到更复杂的问题和更大的网络。 对于物理控制问题,我们将结果与规划者(Tassa et al., 2012)7 计算的基线进行比较,该基线可以完全获得底层模拟动态及其衍生物(参见补充信息)。 有趣的是,DDPG有时可以找到超出规划者表现的策略,在某些情况下甚至可以从像素中学习(规划者总是计划在潜在的低维状态空间)。

A key feature of the approach is its simplicity: it requires only a straightforward actor-critic architecture and learning algorithm with very few “moving parts”, making it easy to implement and scale to more difficult problems and larger networks. For the physical control problems we compare our results to a baseline computed by a planner (Tassa et al., 2012) that has full access to the underlying simulated dynamics and its derivatives (see supplementary information). Interestingly, DDPG can sometimes find policies that exceed the performance of the planner, in some cases even when learning from pixels (the planner always plans over the underlying low-dimensional state space).

2.0 BACKGROUND

我们考虑一种标准的强化学习设置,包括与离散时间步长中的环境E交互的agent。 在每个时间步 t t t,agent接收观察 x t x_t xt,采取行动 a t a_t at并获得标量奖励 r t r_t rt。 在这里考虑的所有环境中,动作都是2 IRN的实值。 一般情况下,可以对环境进行部分观察,以便观察的整个历史,可能需要action对 s t = ( x 1 , a 1 , . . . , a t − 1 , x t ) {s_t = (x_1,a_1,..., a_{t−1},x_t)} st=(x1,a1,...,at1,xt) 来描述状态。这里,我们假设环境是完全观察到的,因此 s t = x t s_t = x_t st=xt

We consider a standard reinforcement learning setup consisting of an agent interacting with an environment E in discrete timesteps. At each timestep t t t the agent receives an observation x t x_t xt, takes an action a t a_t at and receives a scalar reward r t r_t rt. In all the environments considered here the actions are real-valued at 2 IRN. In general, the environment may be partially observed so that the entire history of the observation, action pairs s t = ( x 1 , a 1 , . . . , a t − 1 , x t ) {s_t = (x_1,a_1,..., a_{t−1},x_t)} st=(x1,a1,...,at1,xt) may be required to describe the state. Here, we assumed the environment is fully-observed so s t = x t s_t = x_t st=xt.

agent的行为由策略 π π π定义,该策略将状态映射到动作 π : S → P ( A ) π: S \to P(A) π:SP(A) 的概率分布。 环境 E E E,也可能是随机的。 我们将其建模为马尔可夫决策过程,状态空间为 S S S,动作空间为 A = I R N A = IR^N A=IRN,初始状态分布为 p ( s 1 ) p(s_1) p(s1),transition动态为 p ( s t + 1 ∣ s t , a t ) p(s_{t+1}\mid s_t,a_t) p(st+1st,at) 和奖励函数 r ( s t , a t ) r(s_t, a_t) r(st,at)

An agent’s behavior is defined by a policy, π π π, which maps states to a probability distribution over the actions π : S → P ( A ) π: S \to P(A) π:SP(A). The environment, E E E, may also be stochastic. We model it as a Markov decision process with a state space S S S, action space A = I R N A = IR^N A=IRN, an initial state distribution p ( s 1 ) p(s_1) p(s1), transition dynamics p ( s t + 1 ∣ s t , a t ) p(s_{t+1}\mid s_t, a_t) p(st+1s

  • 11
    点赞
  • 46
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

菜菜菜菜菜菜菜

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值