[RL 6] Deterministic Policy Gradient Algorithms (ICML, 2014)

Deterministic Policy Gradient Algorithms (ICML, 2014)

Stochastic PGT (SPGT)

  1. Theorem
    ∇ θ J ( π θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π θ ( a ∣ s ) Q π ( s , a ) d a d s = E s ∼ ρ π , a ∼ π θ [ ∇ θ log ⁡ π θ ( a ∣ s ) Q π ( s , a ) ] \begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \int_{\mathcal{A}} \nabla_{\theta} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] \end{aligned} θJ(πθ)=Sρπ(s)Aθπθ(as)Qπ(s,a)dads=Esρπ,aπθ[θlogπθ(as)Qπ(s,a)]
    • Proof see: https://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf
  2. PGT derived algorithms
    1. on-policy AC
      1. actor update: PGT
      2. critic update: any TD leanring
    2. off-policy AC
      1. actor update: off-PGT (TODO proof see Degris 2012)
      2. critic: any TD (or TODO more general GAE)

Intuition DGPT

  1. Greedy policy improvement in GPI
    • argmax Q is not suitable for continues action space
  2. DGPT
    • move the policy in the direction of the gradient of Q, rather than globally maximising Q.
    • 思想同argmax, 改变policy选择Q value较大的action

Formal DPGT

  1. Settings
    • episode
    • with discount factor
    • for continous task, set γ = 1 \gamma=1 γ=1, and use state distribution μ θ ( S ) \mu_\theta(S) μθ(S) in RL charpt 9
  2. on-policy
    1. Objective
      J ( μ θ ) = ∫ S p 1 ( s ) V μ θ ( s ) d s J\left(\mu_{\theta}\right)= \int_{\mathcal{S}} p_{1}(s) V^{\mu_{\theta}}(s) \mathrm{d} s J(μθ)=Sp1(s)Vμθ(s)ds
    2. Theorem
      1. on-policy DPG
        ∇ θ J ( μ θ ) = ∫ S ρ μ ( s ) ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) d s = E s ∼ ρ μ [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] w h e r e , ρ μ ( s ) = ∫ S ∑ t = 1 ∞ γ t − 1 p 1 ( s ) p ( s → s ′ , t , π ) d s \begin{aligned} \nabla_{\theta} J\left(\mu_{\theta}\right) &=\left.\int_{\mathcal{S}} \rho^{\mu}(s) \nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)} \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\mu}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} \\ where,\rho^\mu(s) =\int_{\mathcal{S}} \sum_{t=1}^{\infty} \gamma^{t-1} p_{1}(s) p\left(s \rightarrow s^{\prime}, t, \pi\right) \mathrm{d} s θJ(μθ)=Sρμ(s)θμθ(s)aQμ(s,a)a=μθ(s)ds=Esρμ[θμθ(s)aQμ(s,a)a=μθ(s)]where,ρμ(s)=St=1γt1p1(s)p(ss,t,π)ds
        1. discount state distribution ρ μ ( s ) \rho^\mu(s) ρμ(s):
          • 定义: 状态分布, 可以理解为:依据policy μ θ \mu_\theta μθ, 遇到这个状态的概率
          • 计算: 求和在所有time step t遇到s的概率, 通过gamma加权
          • 从该分布采样: 直接用policy μ θ \mu_\theta μθ与环境交互即可, 因为policy考虑最大化accumulative reward, 越靠后的reward(以及相应的state)权重越小
    3. Regularity Conditions
       Regularity conditions A.1:  p ( s ′ ∣ s , a ) , ∇ a p ( s ′ ∣ s , a ) , μ θ ( s ) , ∇ θ μ θ ( s ) , r ( s , a ) , ∇ a r ( s , a ) , p 1 ( s )  are continuous in all   parameters and variables  s , a , s ′  and  x  .   Regularity conditions A.2: there exists a  b  and  L  such that  sup ⁡ s p 1 ( s ) < b , sup ⁡ a , s , s ′ p ( s ′ ∣ s , a ) < b , sup ⁡ a , s r ( s , a ) < b sup ⁡ a , s , s ′ ∥ ∇ a p ( s ′ ∣ s , a ) ∥ < L ,  and  sup ⁡ a , s ∥ ∇ a r ( s , a ) ∥ < L \begin{aligned} &\text { Regularity conditions A.1: } p\left(s^{\prime} \mid s, a\right), \nabla_{a} p\left(s^{\prime} \mid s, a\right), \mu_{\theta}(s), \nabla_{\theta} \mu_{\theta}(s), r(s, a), \nabla_{a} r(s, a), p_{1}(s) \text { are continuous in all }\\ &\text { parameters and variables } s, a, s^{\prime} \text { and } x \text { . }\\ &\text { Regularity conditions A.2: there exists a } b \text { and } L \text { such that } \sup _{s} p_{1}(s)<b, \sup _{a, s, s^{\prime}} p\left(s^{\prime} \mid s, a\right)<b, \sup _{a, s} r(s, a)<b\\ &\sup _{a, s, s^{\prime}}\left\|\nabla_{a} p\left(s^{\prime} \mid s, a\right)\right\|<L, \text { and } \sup _{a, s}\left\|\nabla_{a} r(s, a)\right\|<L \end{aligned}  Regularity conditions A.1: p(ss,a),ap(ss,a),μθ(s),θμθ(s),r(s,a),ar(s,a),p1(s) are continuous in all  parameters and variables s,a,s and x .  Regularity conditions A.2: there exists a b and L such that ssupp1(s)<b,a,s,ssupp(ss,a)<b,a,ssupr(s,a)<ba,s,ssupap(ss,a)<L, and a,ssupar(s,a)<L
      • A.1 保证了V可对 θ \theta θ求导, 同时使得推导过程可以使用
        1. Leibniz积分公式, 改变微分和积分次序: ∇ ∫ \nabla \int -> ∫ ∇ \int \nabla
        2. Fubini定理, 改变积分次序
      • A.2 保证了梯度有界
    4. Part of Proof
      ∇ θ V μ θ ( s ) = ∇ θ Q μ θ ( s , μ θ ( s ) ) = ∇ θ ( r ( s , μ θ ( s ) ) + ∫ S γ p ( s ′ ∣ s , μ θ ( s ) ) V μ θ ( s ′ ) d s ′ ) = ∇ θ μ θ ( s ) ∇ a r ( s , a ) ∣ a = μ θ ( s ) + ∇ θ ∫ S γ p ( s ′ ∣ s , μ θ ( s ) ) V μ θ ( s ′ ) d s ′ = ∇ θ μ θ ( s ) ∇ a r ( s , a ) ∣ a = μ θ ( s ) + ∫ S γ ( p ( s ′ ∣ s , μ θ ( s ) ) ∇ θ V μ θ ( s ′ ) + ∇ θ μ θ ( s ) ∇ a p ( s ′ ∣ s , a ) ∣ a = μ θ ( s ) V μ θ ( s ′ ) ) d s ′ = ∇ θ μ θ ( s ) ∇ a ( r ( s , a ) + ∫ S γ p ( s ′ ∣ s , a ) V μ θ ( s ′ ) d s ′ ) ∣ a = μ θ ( s ) + ∫ S γ p ( s ′ ∣ s , μ θ ( s ) ) ∇ θ V μ θ ( s ′ ) d s ′ = ∇ θ μ θ ( s ) ∇ a Q μ θ ( s , a ) ∣ a = μ θ ( s ) + ∫ S γ p ( s → s ′ , 1 , μ θ ) ∇ θ V μ θ ( s ′ ) d s ′ \begin{aligned} \nabla_{\theta} V^{\mu_{\theta}}(s)=& \nabla_{\theta} Q^{\mu_{\theta}}\left(s, \mu_{\theta}(s)\right) \\ =& \nabla_{\theta}\left(r\left(s, \mu_{\theta}(s)\right)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right) \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} r(s, a)\right|_{a=\mu_{\theta}(s)}+\nabla_{\theta} \int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \textcolor{red}\mu_{\theta}(s) \nabla_{a} \textcolor{red}r(s, a)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma\left(p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right)+\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} \textcolor{red}p\left(s^{\prime} \mid s, a\right)\right|_{a=\mu_{\theta}(s)} V^{\mu_{\theta}}\left(s^{\prime}\right)\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}\left(r(s, a)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, a\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu_{\theta}}(s, a)\right|_{a=\mu_{\theta}(s)}+\int_{\mathcal{S}} \gamma p\left(s \rightarrow s^{\prime}, 1, \mu_{\theta}\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \end{aligned} θVμθ(s)======θQμθ(s,μθ(s))θ(r(s,μθ(s))+Sγp(ss,μθ(s))Vμθ(s)ds)θμθ(s)ar(s,a)a=μθ(s)+θSγp(ss,μθ(s))Vμθ(s)dsθμθ(s)ar(s,a)a=μθ(s)+Sγ(p(ss,μθ(s))θVμθ(s)+θμθ(s)ap(ss,a)a=μθ(s)Vμθ(s))dsθμθ(s)a(r(s,a)+Sγp(ss,a)Vμθ(s)ds)a=μθ(s)+Sγp(ss,μθ(s))θVμθ(s)dsθμθ(s)aQμθ(s,a)a=μθ(s)+Sγp(ss,1,μθ)θVμθ(s)ds
      从标红的地方,可以知道:
      1. Action空间必须是连续的: 要求函数 μ θ \mu_{\theta} μθ (S->A的映射) 连续, 即要求action space A是连续的. 同时, 由于不涉及输出为state的函数, 所以实际建模的MDP中state space不一定需要连续, 只要以state作为输入的函数, 对state有连续的定义域
      2. reward function 连续
      3. transition function (概率函数) 连续
      4. TODO 试证明discrete state space & continue action space 的 PGT & DPG
  3. off-policy DPG
    1. Objective
      J β ( μ θ ) = ∫ S ρ β ( s ) V μ ( s ) d s = ∫ S ρ β ( s ) Q μ ( s , μ θ ( s ) ) d s \begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned} Jβ(μθ)=Sρβ(s)Vμ(s)ds=Sρβ(s)Qμ(s,μθ(s))ds
    2. Theorem
      ∇ θ J β ( μ θ ) ≈ ∫ S ρ β ( s ) ∇ θ μ θ ( a ∣ s ) Q μ ( s , a ) d s = E s ∼ ρ β [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] \begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a \mid s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} θJβ(μθ)Sρβ(s)θμθ(as)Qμ(s,a)ds=Esρβ[θμθ(s)aQμ(s,a)a=μθ(s)]
      1. 相比SPG的期望, DPG中没有important sampling ratio, 这是因为DPG中不涉及Action space的遍历, 因此也就不涉及如何将action space的遍历转化为期望以便对梯度进行采样估计的问题
      2. 此式给出的off-policy DPG actor的更新式, 但要实现off-policy control, critic也要能从off-policy data中学习
    3. Proof
      1. may be partly supported by Degris 2012 TODO

DPGT Derived AC Algorithms

  1. on-policy AC
    1. actor update: DPGT
    2. critic update: SARSR TD learning
  2. off-policy AC
    1. actor update: off-DPGT
    2. critic update: Q-learning TD learning

7. Discussion and Related Work

  1. problems of SPG
    1. policy gradient ∇ θ π θ ( a ∣ s ) \nabla_{\theta} \pi_{\theta}(a \mid s) θπθ(as) changes more rapidly near the mean
    2. The inner integral is computed by sampling a high dimensional action space.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值