文章目录
- 论文题目:Addressing Function Approximation Error in Actor-Critic Methods
所解决的问题?
value-base
的强化学习值函数的近似估计会过估计值函数(DQN
),作者将Double Q-Learning
处理过拟合的思想引入actor critic
算法中。(过估计的问题就在于累计误差会使得某些不好的state
的value
变地很高(exploration
不充分所导致的))。还花了很大的心血在处理过估计问题修正后带来的方差过高的问题。
作者将过估计的问题引入到continuous action space
中,在continuous action space
中处理过估计问题的难点在于policy
的change
非常缓慢,导致current
和target
的value
差距不大, too similar to avoid maximization bias。
背景
以往的算法解决过估计问题的就是Double Q Learning
那一套,但是这种方法虽然说会降低bias
但是会引入高的variance
(在选择下一个时刻s‘
的action
的时候,不确定性变得更大才将以往DQN
中max
这一步变得不是那么max
,与之带来的问题就是方差会变大),仍然会对policy
的优化起负面作用。作者是用clipped double q learning
来解决这个问题。
所采用的方法?
作者所采用的很多components
用于减少方差:
DQN
中的target network
用于variance reduction by reducing the accumulation of errors(不使用target network
的使用是振荡更新的)。- 为了解决
value
和policy
耦合的关系,提出了延迟更新(delaying policy updates
)的方式。(to address the coupling of value and policy, we propose delaying policy updates until the value estimate has converged) - 提出了
novel regularization
的更新方式SARSA-style
( the variance reduction by averaging over valueestimates)。这种方法参考的是18
年Nachum
的将值函数smooth
能够减少方差的算法。
- Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348, 2018.
当然multi-step return
也能够去权衡方差与偏差之间的关系,还有一些放在文末扩展阅读里面了。
作者将上述修正方法用于Deep Deterministic Policy Gradient
算法中并将其命名为Twin Delayed Deep Deterministic policy gradient (TD3
)算法中。一种考虑了在policy
和value
函数近似过程中所带来的一些误差对AC
框架所带来的影响。
前人算法回顾
首先回顾一下DPG
算法的更新公式:
∇ ϕ J ( ϕ ) = E s ∼ p π [ ∇ a Q π ( s , a ) ∣ a = π ( s ) ∇ ϕ π ϕ ( s ) ] \nabla_{\phi} J(\phi)=\mathbb{E}_{s \sim p_{\pi}}\left[\left.\nabla_{a} Q^{\pi}(s, a)\right|_{a=\pi(s)} \nabla_{\phi} \pi_{\phi}(s)\right] ∇ϕJ(ϕ)=Es∼pπ[∇aQπ(s,a)∣a=π(s)∇ϕπϕ(s)]
其中
Q
π
(
s
,
a
)
=
r
+
γ
E
s
′
,
a
′
[
Q
π
(
s
′
,
a
′
)
]
Q^{\pi}(s,a) = r+\gamma \mathbb{E}_{s^{\prime},a^{\prime}}[Q^{\pi}(s^{\prime},a^{\prime})]
Qπ(s,a)=r+γEs′,a′[Qπ(s′,a′)],
Q
π
(
s
,
a
)
Q^{\pi}(s,a)
Qπ(s,a)可以用参数
θ
\theta
θ 近似,在DQN
中还使用了frozen target network
Q
θ
′
(
s
,
a
)
Q_{\theta^{\prime}}(s,a)
Qθ′(s,a),更新的目标为:
y = r + γ Q θ ′ ( s ′ , a ′ ) , a ′ ∼ π ϕ ′ ( s ′ ) y=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime}, a^{\prime}\right), \quad a^{\prime} \sim \pi_{\phi^{\prime}}\left(s^{\prime}\right) y=r+γQθ′(s′,a′),a′∼πϕ′(s′)
如果受误差 ε \varepsilon ε 干扰,则有:
E ε [ max a ′ ( Q ( s ′ , a ′ ) + ε ) ] ≥ max a ′ Q ( s ′ , a ′ ) \mathbb{E}_{\varepsilon}[\max_{a^{\prime}}(Q(s^{\prime},a^{\prime})+\varepsilon)] \geq \max_{a^{\prime}}Q(s^{\prime},a^{\prime}) Eε[a′max(Q(s′,a′)+ε)]≥a′maxQ(s′,a′)
在AC
框架下,用
ϕ
a
p
p
r
o
x
\phi_{approx}
ϕapprox表示actor
能获得近似值函数
Q
θ
(
s
,
a
)
Q_{\theta}(s,a)
Qθ(s,a)的近似策略参数(
Q
θ
(
s
,
a
)
Q_{\theta}(s,a)
Qθ(s,a)所对应的那个策略参数),
ϕ
t
r
u
e
\phi_{true}
ϕtrue表示actor能获得真实准确
Q
π
(
s
,
a
)
Q^{\pi}(s,a)
Qπ(s,a)的参数(which is not known during learning)。
ϕ approx = ϕ + α Z 1 E s ∼ p π [ ∇ ϕ π ϕ ( s ) ∇ a Q θ ( s , a ) ∣ a = π ϕ ( s ) ] ϕ true = ϕ + α Z 2 E s ∼ p π [ ∇ ϕ π ϕ ( s ) ∇ a Q π ( s , a ) ∣ a = π ϕ ( s ) ] \begin{aligned} \phi_{\text {approx }} &=\phi+\frac{\alpha}{Z_{1}} \mathbb{E}_{s \sim p_{\pi}}[\nabla_{\phi} \pi_{\phi}(s) \nabla_{a} Q_{\theta}(s, a)|_{a=\pi_{\phi} (s)}]\\ \phi_{\text {true }} &=\phi+\frac{\alpha}{Z_{2}} \mathbb{E}_{s \sim p_{\pi}}[\nabla_{\phi} \pi_{\phi}(s) \nabla_{a} Q^{\pi}(s, a)|_{a=\pi_{\phi} (s)}] \end{aligned} ϕapprox ϕtrue =ϕ+Z1αEs∼pπ[∇ϕπϕ(s)∇aQθ(s,a)∣a=πϕ(s)]=ϕ+Z2αEs∼pπ[∇ϕπϕ(s)∇aQπ(s,a)∣a=πϕ(s)]
其中 Z 1 Z_{1} Z1, Z 2 Z_{2} Z2 是梯度归一化参数,有 Z − 1 ∣ ∣ E [ ⋅ ] ∣ ∣ = 1 Z^{-1}||\mathbb{E[\cdot]}|| =1 Z−1∣∣E[⋅]∣∣=1。这里做归一化的原因就是更容易保证收敛(Without normalized gradients, overestimation bias is still guaranteed to occur with slightly stricter conditions. )。
由于梯度方向是局部最大化的方向,存在一个足够小的
ε
1
\varepsilon_{1}
ε1,使得
α
≤
ε
1
\alpha \leq \varepsilon_{1}
α≤ε1时approximate value
of
π
a
p
p
r
o
x
\pi_{approx}
πapprox 会有一个下界 approximate value of
π
t
r
u
e
\pi_{true}
πtrue(approximate
会存在过估计问题,就是下面这个式子所描述的)。
E [ Q θ ( s , π a p p r o x ( s ) ) ] ≥ E [ Q θ ( s , π t r u e ( s ) ) ] \mathbb{E}[Q_{\theta}(s,\pi_{approx}(s))] \geq \mathbb{E}[Q_{\theta}(s,\pi_{true}(s))] E[Qθ(s,πapprox(s))]≥E[Qθ(s,πtrue(s))]
相反的,存在一个足够小的
ε
2
\varepsilon_{2}
ε2 使得
α
≤
ε
2
\alpha \leq \varepsilon_{2}
α≤ε2时,the true value of
π
a
p
p
r
o
x
\pi_{approx}
πapprox 会有一个上界 the true value of
π
t
r
u
e
\pi_{true}
πtrue (approximate policy
所得出来的动作在真实的action value function
中无法达到最优):
E [ Q π ( s , π t r u e ( s ) ) ] ≥ E [ Q π ( s , π a p p r o x ( s ) ) ] \mathbb{E}[Q^{\pi}(s,\pi_{true}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{approx}(s))] E[Qπ(s,πtrue(s))]≥E[Qπ(s,πapprox(s))]
the value estimate
会大于等于true value
E
[
Q
θ
(
s
,
π
t
r
u
e
(
s
)
)
]
≥
E
[
Q
π
(
s
,
π
t
r
u
e
(
s
)
)
]
\mathbb{E}[Q_{\theta}(s,\pi_{true}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{true}(s))]
E[Qθ(s,πtrue(s))]≥E[Qπ(s,πtrue(s))],三式联立有:
E [ Q θ ( s , π a p p r o x ( s ) ) ] ≥ E [ Q π ( s , π a p p r o x ( s ) ) ] \mathbb{E}[Q_{\theta}(s,\pi_{approx}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{approx}(s))] E[Qθ(s,πapprox(s))]≥E[Qπ(s,πapprox(s))]
Clipped Double Q-Learning
Double DQN
中的target
:
y = r + γ Q θ ′ ( s ′ , π ϕ ( s ′ ) ) y = r + \gamma Q_{\theta^{\prime}}(s^{\prime},\pi_{\phi}(s^{\prime})) y=r+γQθ′(s′,πϕ(s′))
Double Q-learning
:
y 1 = r + γ Q θ 2 ′ ( s ′ , π ϕ 1 ( s ′ ) ) y 2 = r + γ Q θ 1 ′ ( s ′ , π ϕ 2 ( s ′ ) ) \begin{array}{l} y_{1}=r+\gamma Q_{\theta_{2}^{\prime}}\left(s^{\prime}, \pi_{\phi_{1}}\left(s^{\prime}\right)\right) \\ y_{2}=r+\gamma Q_{\theta_{1}^{\prime}}\left(s^{\prime}, \pi_{\phi_{2}}\left(s^{\prime}\right)\right) \end{array} y1=r+γQθ2′(s′,πϕ1(s′))y2=r+γQθ1′(s′,πϕ2(s′))
Clipped Double Q-learning
:
y 1 = r + γ min i = 1 , 2 Q θ i ′ ( s ′ , π ϕ 1 ( s ′ ) ) y_{1} = r + \gamma \min_{i=1,2}Q_{\theta_{i}^{\prime}}(s^{\prime},\pi_{\phi_{1}}(s^{\prime})) y1=r+γi=1,2minQθi′(s′,πϕ1(s′))
这里的
ϕ
1
\phi_{1}
ϕ1指的是target actor
(可参见伪代码,只用了一个actor
)。这种方法会underestimation bias
,由于underestimation bias
这种方法就需要加大探索度,不然算法的效率就会很低。
如果
Q
θ
2
>
Q
θ
1
Q_{\theta_{2}} > Q_{\theta_{1}}
Qθ2>Qθ1,那么就相当于辅助的
Q
θ
2
Q_{\theta_{2}}
Qθ2没用到,那么就no additional bias
;如果
Q
θ
1
>
Q
θ
2
Q_{\theta_{1}} > Q_{\theta_{2}}
Qθ1>Qθ2那么就会取到
Q
θ
2
Q_{\theta_{2}}
Qθ2,作者原文附录里面有证明收敛性。
Addressing Variance
设置target network
用于减小policy
更新所带的的方差,不然state value approx
会很容易发散,不收敛。
作者使用policy
相比于value
做延迟更新(Delayed Policy Updates),这样保证策略更新的时候,先将TD
误差最小化,这样不会使得policy
更新的时候受误差影响,导致其方差高。
Target Policy Smoothing Regularization
作者认为similar actions should have similar value
,所以对某个action
周围加上少许噪声能够使得模型泛化能力更强。
y = r + γ Q θ ′ ( s ′ , π ϕ ′ ( s ′ ) + ϵ ) ϵ ∼ clip ( N ( 0 , σ ) , − c , c ) \begin{aligned} y &=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime}, \pi_{\phi^{\prime}}\left(s^{\prime}\right)+\epsilon\right) \\ \epsilon & \sim \operatorname{clip}(\mathcal{N}(0, \sigma),-c, c) \end{aligned} yϵ=r+γQθ′(s′,πϕ′(s′)+ϵ)∼clip(N(0,σ),−c,c)
相似的想法在Nachum et al.(2018)上也有设计,不过是smoothing Q θ Q_{\theta} Qθ,不是 Q θ ′ Q_{\theta^{\prime}} Qθ′。
- Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348, 2018.
算法伪代码:
取得的效果?
作者与当前的sota
算法对比,结果如下:
作者还验证了target neteork
对收敛性的影响:
最终的实验:
所出版信息?作者信息?
ICML2018
上的一篇文章,Scott Fujimoto
is a PhD student at McGill University and Mila. He is the author of TD3 as well as some of the recent developments in batch deep reinforcement learning.
他还有俩篇论文比较有意思:Off-Policy Deep Reinforcement Learning without Exploration
;Benchmarking Batch Deep Reinforcement Learning Algorithms
。
扩展阅读
- 论文代码:https://github.com/sfujim/TD3
作者为了验证论文的复现性,参考了2017
年Henderson, P
的文章实验了很多随机种子。
- 参考文献:Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. arXiv preprint arXiv:1709.06560, 2017
还有一些平衡bias
和variance
的方法,比如:
- importance sampling
- Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pp. 417–424, 2001.
- Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
- distributed methods
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Internationa lConference on Machine Learning, pp.1928– 1937, 2016.
- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- approximate bounds
- He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. arXiv preprint arXiv:1611.01606, 2016.
- reduce discount factor to reduce the contribution of each error
- Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Advancesin Neural Information Processing Systems, pp. 1265–1272, 2009.
我的微信公众号名称:小小何先生
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!