【5分钟 Paper】Deterministic Policy Gradient Algorithms

  • 论文题目:Deterministic Policy Gradient Algorithms

标题及作者信息

所解决的问题?

  stochastic policy的方法由于含有部分随机,所以效率不高,方差大,采用deterministic policy方法比stochastic policy的采样效率高,但是没有办法探索环境,因此只能采用off-policy的方法来进行了。

背景

  以往的action是一个动作分布 π θ ( a ∣ s ) \pi_{\theta}(a|s) πθ(as),作者所提出的是输出一个确定性的策略(deterministic policy) a = μ θ ( s ) a =\mu_{\theta}(s) a=μθ(s)

   In the stochastic case,the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space.

  • Stochastic Policy Gradient

   前人采用off-policy的随机策略方法, behaviour policy β ( a ∣ s ) ≠ π θ ( a ∣ s ) \beta(a|s) \neq \pi_{\theta}(a|s) β(as)=πθ(as)

J β ( π θ ) = ∫ S ρ β ( s ) V π ( s ) d s = ∫ S ∫ A ρ β ( s ) π θ ( a ∣ s ) Q π ( s , a ) d a d s \begin{aligned} J_{\beta}\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\pi}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \end{aligned} Jβ(πθ)=Sρβ(s)Vπ(s)ds=SAρβ(s)πθ(as)Qπ(s,a)dads

   Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)

∇ θ J β ( π θ ) ≈ ∫ S ∫ A ρ β ( s ) ∇ θ π θ ( a ∣ s ) Q π ( s , a ) d a d s = E s ∼ ρ β , a ∼ β [ π θ ( a ∣ s ) β θ ( a ∣ s ) ∇ θ log ⁡ π θ ( a ∣ s ) Q π ( s , a ) ] \begin{aligned} \nabla_{\theta} J_{\beta}\left(\pi_{\theta}\right) & \approx \int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \nabla_{\theta} \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}, a \sim \beta}\left[\frac{\pi_{\theta}(a | s)}{\beta_{\theta}(a | s)} \nabla_{\theta} \log \pi_{\theta}(a | s) Q^{\pi}(s, a)\right] \end{aligned} θJβ(πθ)SAρβ(s)θπθ(as)Qπ(s,a)dads=Esρβ,aβ[βθ(as)πθ(as)θlogπθ(as)Qπ(s,a)]

  This approximation drops a term that depends on the action-value gradient ∇ θ Q π ( s , a ) \nabla_{\theta}Q^{\pi}(s,a) θQπ(s,a); (Degris et al., 2012b)

   μ θ ( s ) \mu_{\theta}(s) μθ(s) 更新公式:

θ k + 1 = θ k + α E s ∼ ρ μ k [ ∇ θ Q μ k ( s , μ θ ( s ) ) ] \theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} Q^{\mu^{k}}\left(s, \mu_{\theta}(s)\right)\right] θk+1=θk+αEsρμk[θQμk(s,μθ(s))]

  引入链导法则:

θ k + 1 = θ k + α E s ∼ ρ μ k [ ∇ θ μ θ ( s ) ∇ a Q μ k ( s , a ) ∣ a = μ θ ( s ) ] \theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu^{k}}\left(s, a\right) |_{a=\mu_{\theta}(s)} \right] θk+1=θk+αEsρμk[θμθ(s)aQμk(s,a)a=μθ(s)]

所采用的方法?

  • On-Policy Deterministic Actor-Critic

  如果环境有大量噪声帮助智能体做exploration的话,这个算法还是可以的,使用sarsa更新critic,使用 Q w ( s , a ) Q^{w}(s,a) Qw(s,a) 近似true action-value Q μ Q^{\mu} Qμ

δ t = r t + γ Q w ( s t + 1 , a t + 1 ) − Q w ( s t , a t ) w t + 1 = w t + α w δ t ∇ w Q w ( s t , a t ) θ t + 1 = θ t + α θ ∇ θ μ θ ( s t ) ∇ a Q w ( s t , a t ) ∣ a = μ θ ( s ) \begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, a_{t+1}\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned} δtwt+1θt+1=rt+γQw(st+1,at+1)Qw(st,at)=wt+αwδtwQw(st,at)=θt+αθθμθ(st)aQw(st,at)a=μθ(s)

  • Off-Policy Deterministic Actor-Critic

  we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy

J β ( μ θ ) = ∫ S ρ β ( s ) V μ ( s ) d s = ∫ S ρ β ( s ) Q μ ( s , μ θ ( s ) ) d s \begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned} Jβ(μθ)=Sρβ(s)Vμ(s)ds=Sρβ(s)Qμ(s,μθ(s))ds

∇ θ J β ( μ θ ) ≈ ∫ S ρ β ( s ) ∇ θ μ θ ( a ∣ s ) Q μ ( s , a ) d s = E s ∼ ρ β [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] \begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a | s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}} [\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a =\mu_{\theta}(s)}] \end{aligned} θJβ(μθ)Sρβ(s)θμθ(as)Qμ(s,a)ds=Esρβ[θμθ(s)aQμ(s,a)a=μθ(s)]

  得到off-policy deterministic actorcritic (OPDAC) 算法:

δ t = r t + γ Q w ( s t + 1 , μ θ ( s t + 1 ) ) − Q w ( s t , a t ) w t + 1 = w t + α w δ t ∇ w Q w ( s t , a t ) θ t + 1 = θ t + α θ ∇ θ μ θ ( s t ) ∇ a Q w ( s t , a t ) ∣ a = μ θ ( s ) \begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, \mu_{\theta}\left(s_{t+1}\right)\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned} δtwt+1θt+1=rt+γQw(st+1,μθ(st+1))Qw(st,at)=wt+αwδtwQw(st,at)=θt+αθθμθ(st)aQw(st,at)a=μθ(s)

  与stochastic off policy算法不同的是由于这里是deterministic policy,所以不需要用重要性采样(importance sampling)。

取得的效果?

实验结果

所出版信息?作者信息?

  这篇文章是ICML2014上面的一篇文章。第一作者David SilverGoogle DeepMindresearch Scientist,本科和研究生就读于剑桥大学,博士于加拿大阿尔伯特大学就读,2013年加入DeepMind公司,AlphaGo创始人之一,项目领导者。

David Silver

参考链接

  • 参考文献:Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.

扩展阅读

  假定真实的action-value function Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a),用一个function近似它 Q w ( s , a ) ≈ Q π ( s , a ) Q^{w}(s,a) \approx Q^{\pi}(s,a) Qw(s,a)Qπ(s,a)。However, if the function approximator is compatible such that 1. Q w ( s , a ) = ∇ θ log ⁡ π θ ( a ∣ s ) ⊤ w Q^{w}(s, a)=\nabla_{\theta} \log \pi_{\theta}(a | s)^{\top} w Qw(s,a)=θlogπθ(as)w (linear in “fearure”) 2. the parameters w w w are chosen to minimise the mean-squared error ε 2 ( w ) = E s ∼ ρ π , a ∼ π θ [ ( Q w ( s , a ) − Q π ( s , a ) ) 2 ] \varepsilon^{2}(w) = \mathbb{E}_{s \sim \rho^{\pi},a \sim \pi_{\theta}}[(Q^{w}(s,a)-Q^{\pi}(s,a))^{2}] ε2(w)=Esρπ,aπθ[(Qw(s,a)Qπ(s,a))2] (linear regression problem form these feature ),then there is no bias (Sutton et al., 1999),

∇ θ J ( π θ ) = E s ∼ ρ π , a ∼ π θ [ ∇ θ log ⁡ π θ ( a ∣ s ) Q w ( s , a ) ] \nabla_{\theta} J\left(\pi_{\theta}\right)=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a | s) Q^{w}(s, a)\right] θJ(πθ)=Esρπ,aπθ[θlogπθ(as)Qw(s,a)]

  最后,论文给出了DPG的采用线性函数逼近定理,以及一些理论证明基础。

  • 参考文献:Sutton, R.S., McAllester D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.

  这篇文章以后有时间再读一遍吧,里面还是有些证明需要仔细推敲一下。

我的微信公众号名称:小小何先生
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值