Background
尽管DDPG有时可以实现出色的性能,但它在超参数和其他类型的调整方面通常很脆弱。 DDPG的常见故障模式是,学习到的Q函数开始显着高估Q值,然后导致策略中断,因为它利用了Q函数中的错误。 双延迟DDPG(TD3)是一种通过引入三个关键技巧来解决此问题的算法:
- Trick One: Clipped Double-Q Learning. TD3学习两个Q-functions(因此命名为“twin"),还用了较小的两个Q-值去构造Bellman误差损失函数的目标s。、
- Trick Two: “Delayed” Policy Updates. TD3更新策略(和目标网络)的频次比Q-function要少。文章建议每两次Q-function的更新再更新一次策略。
- Trick Three: Target Policy Smoothing. TD3对目标动作加入了噪声,通过根据操作变化平滑Q,使策略更难以利用Q函数误差。
总之,这三个技巧使性能大大超过了基准DDPG。
Quick Facts
- TD3 is an off-policy algorithm.
- TD3 can only be used for environments with continuous action spaces.
- SpinningUP中的TD3不支持并行运算
Key Equations
TD3同时学习两个Q-functions, Q ϕ 1 a n d Q ϕ 2 Q_{\phi_1} and \ Q_{\phi_2} Qϕ1and Qϕ2通过均方Bellman误差最小化(与DDPG学习它的单Q-函数几乎同样的方式)。为了更准确的说明TD3怎么做以及它与一般的DDPG到底有什么不同,我们将从损失函数的最内部开始进行工作。
- 1:**目标策略平滑化. ** 用于构建Q-learning目标的动作是基于目标策略 μ θ t a r g \mu_{\theta_targ} μθtarg的, 但是伴随着clipped噪声加入动作的每个维度。在加入截断噪声,然后将目标动作裁剪为位于有效动作范围内(所有有效动作, a a a,满足 a L o w ≤ a ≤ a H i g h a_{Low}\leq a\leq a_{High} aLow≤a≤aHigh)。因此目标动作为: a ′ ( s ′ ) = clip ( μ θ targ ( s ′ ) + clip ( ϵ , − c , c ) , a L o w , a H i g h ) , ϵ ∼ N ( 0 , σ ) a'(s') = \text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma) a′(s′)=clip(μθtarg(s′)+clip(ϵ,−c,c),aLow,aHigh),ϵ∼N(0,σ)目标策略平滑实质上是算法的正则化器。 它解决了DDPG中可能发生的特定故障模式:如果Q函数逼近器为某些动作形成了一个不正确的尖峰,该策略将迅速利用该峰,然后出现脆性或不正确的表现。 可以通过对Q函数进行平滑处理来避免此类行为,这是针对目标策略进行平滑处理而设计的。
- 2:截断double-Q learning 两个Q-函数 使用一个目标,计算时使用两个Q-函数中给出的最小目标值的那个:
y
(
r
,
s
′
,
d
)
=
r
+
γ
(
1
−
d
)
min
i
=
1
,
2
Q
ϕ
i
,
targ
(
s
′
,
a
′
(
s
′
)
)
,
y(r,s',d) = r + \gamma (1 - d) \min_{i=1,2} Q_{\phi_{i, \text{targ}}}(s', a'(s')),
y(r,s′,d)=r+γ(1−d)i=1,2minQϕi,targ(s′,a′(s′)),然后都可以通过退回到这个目标来学习:
L
(
ϕ
1
,
D
)
=
E
(
s
,
a
,
r
,
s
′
,
d
)
∼
D
(
Q
ϕ
1
(
s
,
a
)
−
y
(
r
,
s
′
,
d
)
)
2
,
L(\phi_1, {\mathcal D}) = E_{(s,a,r,s',d) \sim {\mathcal D}}{ \Bigg( Q_{\phi_1}(s,a) - y(r,s',d) \Bigg)^2 },
L(ϕ1,D)=E(s,a,r,s′,d)∼D(Qϕ1(s,a)−y(r,s′,d))2,
L
(
ϕ
2
,
D
)
=
E
(
s
,
a
,
r
,
s
′
,
d
)
∼
D
(
Q
ϕ
2
(
s
,
a
)
−
y
(
r
,
s
′
,
d
)
)
2
.
L(\phi_2, {\mathcal D}) = E_{(s,a,r,s',d) \sim {\mathcal D}}{ \Bigg( Q_{\phi_2}(s,a) - y(r,s',d) \Bigg)^2 }.
L(ϕ2,D)=E(s,a,r,s′,d)∼D(Qϕ2(s,a)−y(r,s′,d))2.对目标使用较小的Q值,然后逐步回归该值,有助于避免Q函数的过高估计。
3: 仅通过最大化来学习策略 Q ϕ 1 : Q_{\phi_1}: Qϕ1: max θ E s ∼ D [ Q ϕ 1 ( s , μ θ ( s ) ) ] , \max_{\theta} \underset{s \sim {\mathcal D}}{{\mathrm E}}\left[ Q_{\phi_1}(s, \mu_{\theta}(s)) \right], θmaxs∼DE[Qϕ1(s,μθ(s))],这个同DDPG没啥改变。然而,在TD3,策略比Q-函数更新得慢。由于策略的更新如何地更改目标,这种方式有助于抑制DDPG中通常出现的波动。
Exploration vs. Exploitation
TD3以off-policy方式训练确定性策略。 由于该策略是确定性的,因此如果代理要探索策略,则一开始它可能不会尝试采取足够多种措施来找到有用的学习信号。 为了使TD3策略更好地探索,我们在训练时在其操作中增加了噪声,通常是不相关的均值零高斯噪声。 为了便于获取更高质量的训练数据,您可以在训练过程中减小噪声的大小。 (在实现过程中,我们不会这样做,并且始终将噪声等级保持固定。)
在测试时,要查看策略如何充分利用它所学到的知识,我们不会在操作中增加噪音。
TD3实施在训练开始时使用了一个技巧来改进探索。 对于开始时有固定数量的步骤(使用start_steps关键字参数设置),代理将采取从有效动作的均匀随机分布中采样的动作。 之后,它将返回到正常的TD3探索。
Pseudocode
Documentation
spinup.td3(env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, target_noise=0.2, noise_clip=0.5, policy_delay=2, max_ep_len=1000, logger_kwargs={}, save_freq=1)
Parameters:
- env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
- actor_critic – A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:
- ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TD3.
- seed (int) – Seed for random number generators.
- steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
- epochs (int) – Number of epochs to run and train agent.
- replay_size (int) – Maximum length of replay buffer.
- gamma (float) – Discount factor. (Always between 0 and 1.)
- polyak (float) – Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to: θ t a r g ← ρ θ t a r g + ( 1 − ρ ) θ \theta_{targ}\leftarrow\rho\theta_{targ}+(1-\rho)\theta θtarg←ρθtarg+(1−ρ)θwhere ρ \rho ρ is polyak. (Always between 0 and 1, usually close to 1.)
- pi_lr (float) – Learning rate for policy.
- q_lr (float) – Learning rate for Q-networks.
- batch_size (int) – Minibatch size for SGD.
- start_steps (int) – Number of steps for uniform-random action selection, before running real policy. Helps exploration.
- act_noise (float) – Stddev for Gaussian exploration noise added to policy at training time. (At test time, no noise is added.)
- target_noise (float) – Stddev for smoothing noise added to target policy.
- noise_clip (float) – Limit for absolute value of target policy smoothing noise.
- policy_delay (int) – Policy will only be updated once every policy_delay times for each update of the Q-networks.
- max_ep_len (int) – Maximum length of trajectory / episode / rollout.
- logger_kwargs (dict) – Keyword args for EpochLogger.
- save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.