Deterministic Policy Gradient Algorithms (ICML, 2014)
Stochastic PGT (SPGT)
- Theorem
∇ θ J ( π θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π θ ( a ∣ s ) Q π ( s , a ) d a d s = E s ∼ ρ π , a ∼ π θ [ ∇ θ log π θ ( a ∣ s ) Q π ( s , a ) ] \begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \int_{\mathcal{A}} \nabla_{\theta} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] \end{aligned} ∇θJ(πθ)=∫Sρπ(s)∫A∇θπθ(a∣s)Qπ(s,a)dads=Es∼ρπ,a∼πθ[∇θlogπθ(a∣s)Qπ(s,a)]- Proof see: https://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf
- PGT derived algorithms
- on-policy AC
- actor update: PGT
- critic update: any TD leanring
- off-policy AC
- actor update: off-PGT (TODO proof see Degris 2012)
- critic: any TD (or TODO more general GAE)
- on-policy AC
Intuition DGPT
- Greedy policy improvement in GPI
argmax Q
is not suitable for continues action space
- DGPT
- move the policy in the direction of the gradient of Q, rather than globally maximising Q.
- 思想同argmax, 改变policy选择Q value较大的action
Formal DPGT
- Settings
- episode
- with discount factor
- for continous task, set γ = 1 \gamma=1 γ=1, and use state distribution μ θ ( S ) \mu_\theta(S) μθ(S) in RL charpt 9
- on-policy
- Objective
J ( μ θ ) = ∫ S p 1 ( s ) V μ θ ( s ) d s J\left(\mu_{\theta}\right)= \int_{\mathcal{S}} p_{1}(s) V^{\mu_{\theta}}(s) \mathrm{d} s J(μθ)=∫Sp1(s)Vμθ(s)ds - Theorem
- on-policy DPG
∇ θ J ( μ θ ) = ∫ S ρ μ ( s ) ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) d s = E s ∼ ρ μ [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] w h e r e , ρ μ ( s ) = ∫ S ∑ t = 1 ∞ γ t − 1 p 1 ( s ) p ( s → s ′ , t , π ) d s \begin{aligned} \nabla_{\theta} J\left(\mu_{\theta}\right) &=\left.\int_{\mathcal{S}} \rho^{\mu}(s) \nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)} \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\mu}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} \\ where,\rho^\mu(s) =\int_{\mathcal{S}} \sum_{t=1}^{\infty} \gamma^{t-1} p_{1}(s) p\left(s \rightarrow s^{\prime}, t, \pi\right) \mathrm{d} s ∇θJ(μθ)=∫Sρμ(s)∇θμθ(s)∇aQμ(s,a)∣∣∣∣a=μθ(s)ds=Es∼ρμ[∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)]where,ρμ(s)=∫St=1∑∞γt−1p1(s)p(s→s′,t,π)ds- discount state distribution
ρ
μ
(
s
)
\rho^\mu(s)
ρμ(s):
- 定义: 状态分布, 可以理解为:依据policy μ θ \mu_\theta μθ, 遇到这个状态的概率
- 计算: 求和在所有time step t遇到s的概率, 通过gamma加权
- 从该分布采样: 直接用policy μ θ \mu_\theta μθ与环境交互即可, 因为policy考虑最大化accumulative reward, 越靠后的reward(以及相应的state)权重越小
- discount state distribution
ρ
μ
(
s
)
\rho^\mu(s)
ρμ(s):
- on-policy DPG
- Regularity Conditions
Regularity conditions A.1: p ( s ′ ∣ s , a ) , ∇ a p ( s ′ ∣ s , a ) , μ θ ( s ) , ∇ θ μ θ ( s ) , r ( s , a ) , ∇ a r ( s , a ) , p 1 ( s ) are continuous in all parameters and variables s , a , s ′ and x . Regularity conditions A.2: there exists a b and L such that sup s p 1 ( s ) < b , sup a , s , s ′ p ( s ′ ∣ s , a ) < b , sup a , s r ( s , a ) < b sup a , s , s ′ ∥ ∇ a p ( s ′ ∣ s , a ) ∥ < L , and sup a , s ∥ ∇ a r ( s , a ) ∥ < L \begin{aligned} &\text { Regularity conditions A.1: } p\left(s^{\prime} \mid s, a\right), \nabla_{a} p\left(s^{\prime} \mid s, a\right), \mu_{\theta}(s), \nabla_{\theta} \mu_{\theta}(s), r(s, a), \nabla_{a} r(s, a), p_{1}(s) \text { are continuous in all }\\ &\text { parameters and variables } s, a, s^{\prime} \text { and } x \text { . }\\ &\text { Regularity conditions A.2: there exists a } b \text { and } L \text { such that } \sup _{s} p_{1}(s)<b, \sup _{a, s, s^{\prime}} p\left(s^{\prime} \mid s, a\right)<b, \sup _{a, s} r(s, a)<b\\ &\sup _{a, s, s^{\prime}}\left\|\nabla_{a} p\left(s^{\prime} \mid s, a\right)\right\|<L, \text { and } \sup _{a, s}\left\|\nabla_{a} r(s, a)\right\|<L \end{aligned} Regularity conditions A.1: p(s′∣s,a),∇ap(s′∣s,a),μθ(s),∇θμθ(s),r(s,a),∇ar(s,a),p1(s) are continuous in all parameters and variables s,a,s′ and x . Regularity conditions A.2: there exists a b and L such that ssupp1(s)<b,a,s,s′supp(s′∣s,a)<b,a,ssupr(s,a)<ba,s,s′sup∥∇ap(s′∣s,a)∥<L, and a,ssup∥∇ar(s,a)∥<L- A.1 保证了V可对
θ
\theta
θ求导, 同时使得推导过程可以使用
- Leibniz积分公式, 改变微分和积分次序: ∇ ∫ \nabla \int ∇∫-> ∫ ∇ \int \nabla ∫∇
- Fubini定理, 改变积分次序
- A.2 保证了梯度有界
- A.1 保证了V可对
θ
\theta
θ求导, 同时使得推导过程可以使用
- Part of Proof
∇ θ V μ θ ( s ) = ∇ θ Q μ θ ( s , μ θ ( s ) ) = ∇ θ ( r ( s , μ θ ( s ) ) + ∫ S γ p ( s ′ ∣ s , μ θ ( s ) ) V μ θ ( s ′ ) d s ′ ) = ∇ θ μ θ ( s ) ∇ a r ( s , a ) ∣ a = μ θ ( s ) + ∇ θ ∫ S γ p ( s ′ ∣ s , μ θ ( s ) ) V μ θ ( s ′ ) d s ′ = ∇ θ μ θ ( s ) ∇ a r ( s , a ) ∣ a = μ θ ( s ) + ∫ S γ ( p ( s ′ ∣ s , μ θ ( s ) ) ∇ θ V μ θ ( s ′ ) + ∇ θ μ θ ( s ) ∇ a p ( s ′ ∣ s , a ) ∣ a = μ θ ( s ) V μ θ ( s ′ ) ) d s ′ = ∇ θ μ θ ( s ) ∇ a ( r ( s , a ) + ∫ S γ p ( s ′ ∣ s , a ) V μ θ ( s ′ ) d s ′ ) ∣ a = μ θ ( s ) + ∫ S γ p ( s ′ ∣ s , μ θ ( s ) ) ∇ θ V μ θ ( s ′ ) d s ′ = ∇ θ μ θ ( s ) ∇ a Q μ θ ( s , a ) ∣ a = μ θ ( s ) + ∫ S γ p ( s → s ′ , 1 , μ θ ) ∇ θ V μ θ ( s ′ ) d s ′ \begin{aligned} \nabla_{\theta} V^{\mu_{\theta}}(s)=& \nabla_{\theta} Q^{\mu_{\theta}}\left(s, \mu_{\theta}(s)\right) \\ =& \nabla_{\theta}\left(r\left(s, \mu_{\theta}(s)\right)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right) \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} r(s, a)\right|_{a=\mu_{\theta}(s)}+\nabla_{\theta} \int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \textcolor{red}\mu_{\theta}(s) \nabla_{a} \textcolor{red}r(s, a)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma\left(p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right)+\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} \textcolor{red}p\left(s^{\prime} \mid s, a\right)\right|_{a=\mu_{\theta}(s)} V^{\mu_{\theta}}\left(s^{\prime}\right)\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}\left(r(s, a)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, a\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu_{\theta}}(s, a)\right|_{a=\mu_{\theta}(s)}+\int_{\mathcal{S}} \gamma p\left(s \rightarrow s^{\prime}, 1, \mu_{\theta}\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \end{aligned} ∇θVμθ(s)======∇θQμθ(s,μθ(s))∇θ(r(s,μθ(s))+∫Sγp(s′∣s,μθ(s))Vμθ(s′)ds′)∇θμθ(s)∇ar(s,a)∣a=μθ(s)+∇θ∫Sγp(s′∣s,μθ(s))Vμθ(s′)ds′∇θμθ(s)∇ar(s,a)∣a=μθ(s)+∫Sγ(p(s′∣s,μθ(s))∇θVμθ(s′)+∇θμθ(s)∇ap(s′∣s,a)∣a=μθ(s)Vμθ(s′))ds′∇θμθ(s)∇a(r(s,a)+∫Sγp(s′∣s,a)Vμθ(s′)ds′)∣∣∣∣a=μθ(s)+∫Sγp(s′∣s,μθ(s))∇θVμθ(s′)ds′∇θμθ(s)∇aQμθ(s,a)∣a=μθ(s)+∫Sγp(s→s′,1,μθ)∇θVμθ(s′)ds′
从标红的地方,可以知道:- Action空间必须是连续的: 要求函数 μ θ \mu_{\theta} μθ (S->A的映射) 连续, 即要求action space A是连续的. 同时, 由于不涉及输出为state的函数, 所以实际建模的MDP中state space不一定需要连续, 只要以state作为输入的函数, 对state有连续的定义域
- reward function 连续
- transition function (概率函数) 连续
- TODO 试证明discrete state space & continue action space 的 PGT & DPG
- Objective
- off-policy DPG
- Objective
J β ( μ θ ) = ∫ S ρ β ( s ) V μ ( s ) d s = ∫ S ρ β ( s ) Q μ ( s , μ θ ( s ) ) d s \begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned} Jβ(μθ)=∫Sρβ(s)Vμ(s)ds=∫Sρβ(s)Qμ(s,μθ(s))ds - Theorem
∇ θ J β ( μ θ ) ≈ ∫ S ρ β ( s ) ∇ θ μ θ ( a ∣ s ) Q μ ( s , a ) d s = E s ∼ ρ β [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] \begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a \mid s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} ∇θJβ(μθ)≈∫Sρβ(s)∇θμθ(a∣s)Qμ(s,a)ds=Es∼ρβ[∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)]- 相比SPG的期望, DPG中没有important sampling ratio, 这是因为DPG中不涉及Action space的遍历, 因此也就不涉及如何将action space的遍历转化为期望以便对梯度进行采样估计的问题
- 此式给出的off-policy DPG actor的更新式, 但要实现off-policy control, critic也要能从off-policy data中学习
- Proof
- may be partly supported by Degris 2012 TODO
- Objective
DPGT Derived AC Algorithms
- on-policy AC
- actor update: DPGT
- critic update: SARSR TD learning
- off-policy AC
- actor update: off-DPGT
- critic update: Q-learning TD learning
7. Discussion and Related Work
- problems of SPG
- policy gradient ∇ θ π θ ( a ∣ s ) \nabla_{\theta} \pi_{\theta}(a \mid s) ∇θπθ(a∣s) changes more rapidly near the mean
- The inner integral is computed by sampling a high dimensional action space.