Twin Delayed DDPG(TD3)-强化学习算法

最新推荐文章于 2024-09-01 11:12:15 发布

Mystery_zero

最新推荐文章于 2024-09-01 11:12:15 发布

阅读量4.7k

点赞数 1

分类专栏：强化学习文章标签：深度学习

原文链接：https://github.com/openai

版权

强化学习专栏收录该内容

10 篇文章 0 订阅

订阅专栏

文章目录

Background
Quick Facts
Key Equations
Exploration vs. Exploitation
Pseudocode
Documentation

Background

尽管DDPG有时可以实现出色的性能，但它在超参数和其他类型的调整方面通常很脆弱。 DDPG的常见故障模式是，学习到的Q函数开始显着高估Q值，然后导致策略中断，因为它利用了Q函数中的错误。双延迟DDPG（TD3）是一种通过引入三个关键技巧来解决此问题的算法：

Trick One: Clipped Double-Q Learning. TD3学习两个Q-functions(因此命名为“twin")，还用了较小的两个Q-值去构造Bellman误差损失函数的目标s。、
Trick Two: “Delayed” Policy Updates. TD3更新策略(和目标网络)的频次比Q-function要少。文章建议每两次Q-function的更新再更新一次策略。
Trick Three: Target Policy Smoothing. TD3对目标动作加入了噪声，通过根据操作变化平滑Q，使策略更难以利用Q函数误差。

总之，这三个技巧使性能大大超过了基准DDPG。

Quick Facts

TD3 is an off-policy algorithm.
TD3 can only be used for environments with continuous action spaces.
SpinningUP中的TD3不支持并行运算

Key Equations

TD3同时学习两个Q-functions, $Q_{\phi_1} and \ Q_{\phi_2}$ 通过均方Bellman误差最小化（与DDPG学习它的单Q-函数几乎同样的方式）。为了更准确的说明TD3怎么做以及它与一般的DDPG到底有什么不同，我们将从损失函数的最内部开始进行工作。

1：**目标策略平滑化. ** 用于构建Q-learning目标的动作是基于目标策略 $\mu_{\theta_targ}$ 的, 但是伴随着clipped噪声加入动作的每个维度。在加入截断噪声，然后将目标动作裁剪为位于有效动作范围内(所有有效动作, $a$ ,满足 $a_{Low}\leq a\leq a_{High}$ )。因此目标动作为： $\text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma)$ 目标策略平滑实质上是算法的正则化器。它解决了DDPG中可能发生的特定故障模式：如果Q函数逼近器为某些动作形成了一个不正确的尖峰，该策略将迅速利用该峰，然后出现脆性或不正确的表现。可以通过对Q函数进行平滑处理来避免此类行为，这是针对目标策略进行平滑处理而设计的。
2:截断double-Q learning 两个Q-函数使用一个目标，计算时使用两个Q-函数中给出的最小目标值的那个： $\gamma (1 - d) \min_{i=1,2} Q_{\phi_{i, \text{targ}}}(s', a'(s')),$ 然后都可以通过退回到这个目标来学习： $L(\phi_1, {\mathcal D}) = E_{(s,a,r,s',d) \sim {\mathcal D}}{ \Bigg( Q_{\phi_1}(s,a) - y(r,s',d) \Bigg)^2 },$ $L(\phi_2, {\mathcal D}) = E_{(s,a,r,s',d) \sim {\mathcal D}}{ \Bigg( Q_{\phi_2}(s,a) - y(r,s',d) \Bigg)^2 }.$ 对目标使用较小的Q值，然后逐步回归该值，有助于避免Q函数的过高估计。
3: 仅通过最大化来学习策略 $Q_{\phi_1}:$ $\max_{\theta} \underset{s \sim {\mathcal D}}{{\mathrm E}}\left[ Q_{\phi_1}(s, \mu_{\theta}(s)) \right],$ 这个同DDPG没啥改变。然而，在TD3,策略比Q-函数更新得慢。由于策略的更新如何地更改目标，这种方式有助于抑制DDPG中通常出现的波动。

Exploration vs. Exploitation

TD3以off-policy方式训练确定性策略。由于该策略是确定性的，因此如果代理要探索策略，则一开始它可能不会尝试采取足够多种措施来找到有用的学习信号。为了使TD3策略更好地探索，我们在训练时在其操作中增加了噪声，通常是不相关的均值零高斯噪声。为了便于获取更高质量的训练数据，您可以在训练过程中减小噪声的大小。（在实现过程中，我们不会这样做，并且始终将噪声等级保持固定。）

在测试时，要查看策略如何充分利用它所学到的知识，我们不会在操作中增加噪音。

TD3实施在训练开始时使用了一个技巧来改进探索。对于开始时有固定数量的步骤（使用start_steps关键字参数设置），代理将采取从有效动作的均匀随机分布中采样的动作。之后，它将返回到正常的TD3探索。

Pseudocode

在这里插入图片描述

Documentation

spinup.td3(env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, target_noise=0.2, noise_clip=0.5, policy_delay=2, max_ep_len=1000, logger_kwargs={}, save_freq=1)
Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
actor_critic – A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:
ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TD3.
seed (int) – Seed for random number generators.
steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
epochs (int) – Number of epochs to run and train agent.
replay_size (int) – Maximum length of replay buffer.
gamma (float) – Discount factor. (Always between 0 and 1.)
polyak (float) – Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to: $\theta_{targ}\leftarrow\rho\theta_{targ}+(1-\rho)\theta$ where $\rho$ is polyak. (Always between 0 and 1, usually close to 1.)
pi_lr (float) – Learning rate for policy.
q_lr (float) – Learning rate for Q-networks.
batch_size (int) – Minibatch size for SGD.
start_steps (int) – Number of steps for uniform-random action selection, before running real policy. Helps exploration.
act_noise (float) – Stddev for Gaussian exploration noise added to policy at training time. (At test time, no noise is added.)
target_noise (float) – Stddev for smoothing noise added to target policy.
noise_clip (float) – Limit for absolute value of target policy smoothing noise.
policy_delay (int) – Policy will only be updated once every policy_delay times for each update of the Q-networks.
max_ep_len (int) – Maximum length of trajectory / episode / rollout.
logger_kwargs (dict) – Keyword args for EpochLogger.
save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.