Trust Region Policy Optimization (ICML, 2015)
1 Introduction
- policy optimization categories
- policy iteration (GPI)
- PG (e.g. TRPO)
- derivative-free(无导数) optimization methods
2 Preliminaries
- Consider an infinite-horizon discounted MDP
- instead an average reward one
- objective measurement
η
(
π
~
)
\eta(\tilde{\pi})
η(π~) (Kakade, 2002, TODO)
η ( π ~ ) = η ( π ) + E s 0 , a 0 , ⋯ ∼ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0,a0,⋯∼π~[t=0∑∞γtAπ(st,at)]=η(π)+s∑ρπ~(s)a∑π~(a∣s)Aπ(s,a)- improving the expectation on right hand side leads to policy improvment
- but ρ π ~ ( s ) \rho_{\tilde{\pi}}(s) ρπ~(s) is hard to estimate
- so use local approximation instead
- local approximation objective measurement
L
π
(
π
~
)
L_{\pi} (\tilde{\pi})
Lπ(π~) (Kakade, 2002, TODO)
L π ( π ~ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ(π~)=η(π)+s∑ρπ(s)a∑π~(a∣s)Aπ(s,a)- match η ( π ~ ) \eta(\tilde{\pi}) η(π~) to first order
- lower bound
- limited form of policy improvement
3 Monotonic Improvement Guarantee for General Stochastic Policies
- Theorem 1: lower bound with General Stochastic Policies
η ( π ~ ) ≥ L π ( π ~ ) − C D K L max ( π , π ~ ) where C = 4 ϵ γ ( 1 − γ ) 2 , ϵ = max s , a ∣ A π ( s , a ) ∣ \begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)≥Lπ(π~)−CDKLmax(π,π~) where C=(1−γ)24ϵγ,ϵ=s,amax∣Aπ(s,a)∣- KL can be seen as penalty
- C C C is a large constant, thus leads to a small step size
- Algorithm 1: monotonic policy iteration
- advantage should be accurate
- can be extend to continunous state and action sapce
4 Optimization of Parameterized Policies
- Approximations
- trust region constrain
- step size in Algorithm 1 could be very small
- change KL penalty to a constrain
- average KL
max
operation involving traversal the whole state space- use average KL instead
D ˉ K L ρ o l d ( θ 1 , θ 2 ) : = E s ∼ ρ o l d [ D K L ( π θ 1 ( ⋅ ∣ s ) ∥ π θ 2 ( ⋅ ∣ s ) ) ] \bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right] DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]
- trust region constrain
- optimization problem up to now
maximize θ ∑ s ρ θ old ( s ) ∑ a π θ ( a ∣ s ) A θ old ( s , a ) subject to D ˉ K L ρ old ( θ old , θ ) ≤ δ \begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize∑sρθold (s)∑aπθ(a∣s)Aθold (s,a) subject to DˉKLρold (θold ,θ)≤δ
5 Sample-Based Estimation of the Objective and Constraint
maximize θ E s ∼ ρ θ o l d , a ∼ q [ π θ ( a ∣ s ) q ( a ∣ s ) Q θ o l d ( s , a ) ] subject to E s ∼ ρ θ o l d [ D K L ( π θ o l d ( ⋅ ∣ s ) ∥ π θ ( ⋅ ∣ s ) ) ] ≤ δ \begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximizeEs∼ρθold,a∼q[q(a∣s)πθ(a∣s)Qθold(s,a)] subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ
-
s
∼
ρ
θ
o
l
d
s \sim \rho_{\theta \mathrm{old}}
s∼ρθold
- by defination of expectation
-
a
∼
q
a \sim q
a∼q
- by important sampling
-
Q
θ
o
l
d
(
s
,
a
)
Q_{\theta_{\mathrm{old}}}(s, a)
Qθold(s,a)
- by the fact that
- A ( s , a ) = Q ( s , a ) − V ( s ) A(s, a)=Q(s, a)-V(s) A(s,a)=Q(s,a)−V(s)
- ∑ s V θ o l d ( s ) = c o n s t a n t \sum_{s}V^{\theta_{old}}(s)=constant ∑sVθold(s)=constant
- so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
- estimation methods
- single path
- Vine
- better estimation
- but need more simulator
- limit to the system where states can be recover
- by the fact that
6 Practical Algorithm
- Update Q by collected trajectoriess
- Calulate objective and its constrain
- Solve the constrain problem by Conjugate gradient alg TODO
- references
- paper Appendix C
- post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/
- references
8 Experiments
- TRPO is related to prior methods (e.g. natural policy
gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.- These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
- can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
- the method we proposed is scalable and has strong theoretical foundations.