强化学习—— Trust Region Policy Optimization (TRPO算法
1 Trust Region Algorithm 置信域算法
problem:
θ
⋆
=
a
r
g
m
a
x
θ
J
(
θ
)
\theta^\star=\mathop{argmax}\limits_{\theta} J(\theta)
θ⋆=θargmaxJ(θ)
repeat:
- Approximation: 给定 θ o l d \theta_{old} θold, 构建 L ( θ ∣ θ o l d ) L(\theta|\theta_{old}) L(θ∣θold)去近似 J ( θ ) J(\theta) J(θ),其中 θ \theta θ需要满足 θ o l d \theta_{old} θold的置信域,即 N ( θ o l d ) N(\theta_{old}) N(θold)。
- Maximization: 在置信域内,求取优化后的 θ \theta θ: θ n e w = a r g m a x θ ∈ N ( θ o l d ) L ( θ ∣ θ o l d ) \theta_{new}=\mathop{argmax}\limits_{\theta\in N(\theta_{old})}L(\theta|\theta_{old}) θnew=θ∈N(θold)argmaxL(θ∣θold)
2 Trust Region Policy Optimization (TRPO算法)
- state-value function:
V π ( s ) = ∑ a π ( a ∣ s ; θ ) Q ( s , a ) = E A π [ Q π ( s , A ) ] V_{\pi}(s)=\sum_{a}\pi(a|s;\theta)Q(s,a)=E_{A~\pi}[Q_\pi(s,A)] Vπ(s)=a∑π(a∣s;θ)Q(s,a)=EA π[Qπ(s,A)] - objective function:
J ( θ ) = E S [ V π ( S ) ] J(\theta)=E_S[V_\pi(S)] J(θ)=ES[Vπ(S)] - approximation:
V π ( s ) = ∑ a π ( a ∣ s ; θ ) π ( a ∣ , s ; θ o l d ) ⋅ Q π ( s , a ) ⋅ π ( a ∣ s ; θ o l d ) = E A π ( ⋅ ∣ s ; θ o l d ) [ π ( A ∣ s ; θ ) π ( A ∣ s ; θ o l d ) ⋅ Q π ( s , A ) ] V_\pi(s)=\sum_a\frac{\pi(a|s;\theta)}{\pi(a|,s;\theta_{old})}\cdot Q_\pi(s,a)\cdot \pi(a|s;\theta_{old})=E_{A~\pi(\cdot|s;\theta_{old})}[\frac{\pi(A|s;\theta)}{\pi(A|s;\theta_{old})}\cdot Q_\pi(s,A)] Vπ(s)=a∑π(a∣,s;θold)π(a∣s;θ)⋅Qπ(s,a)⋅π(a∣s;θold)=EA π(⋅∣s;θold)[π(A∣s;θold)π(A∣s;θ)⋅Qπ(s,A)]
J ( θ ) = E S [ E A [ π ( A ∣ S ; θ ) π ( A ∣ S ; θ o l d ) ⋅ Q π ( S , A ) ] ] J(\theta)=E_S[E_A[\frac{\pi(A|S;\theta)}{\pi(A|S;\theta_{old})}\cdot Q_\pi(S,A)]] J(θ)=ES[EA[π(A∣S;θold)π(A∣S;θ)⋅Qπ(S,A)]] - trajectory from
π
(
a
,
∣
s
;
θ
o
l
d
)
\pi(a,|s;\theta_{old})
π(a,∣s;θold):
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s n , a n , r n s_1,a_1,r_1,s_2,a_2,r_2,...,s_n,a_n,r_n s1,a1,r1,s2,a2,r2,...,sn,an,rn - Montel Carlo Approximation:
L ( θ ∣ θ o l d ) = 1 n ∑ i = 1 n π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) ⋅ Q π ( s i , a i ) L(\theta|\theta_{old})=\frac{1}{n}\sum_{i=1}^n \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}\cdot Q_\pi(s_i,a_i) L(θ∣θold)=n1i=1∑nπ(ai∣si;θold)π(ai∣si;θ)⋅Qπ(si,ai)
Q π ( s i , a i ) ≈ u i = r i + γ r i + 1 + . . . + γ n − i r n Q_\pi(s_i,a_i)\approx u_i=r_i + \gamma r_{i+1}+...+\gamma^{n-i}r_n Qπ(si,ai)≈ui=ri+γri+1+...+γn−irn
L ~ ( θ ∣ θ o l d ) = 1 n ∑ i = 1 n π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) ⋅ u i \tilde{L}(\theta|\theta_{old})=\frac{1}{n}\sum_{i=1}^n \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}\cdot u_i L~(θ∣θold)=n1i=1∑nπ(ai∣si;θold)π(ai∣si;θ)⋅ui - θ ∈ N ( θ o l d ) \theta\in N(\theta_{old}) θ∈N(θold)
option1: ∣ ∣ θ − θ o l d ∣ ∣ < Δ ||\theta-\theta_{old}||<\Delta ∣∣θ−θold∣∣<Δ
option2: 1 n ∑ i = 1 n K L [ π ( ⋅ ∣ s i ; θ ) ∣ ∣ π ( ⋅ ∣ s i ; θ o l d ) ] < Δ \frac{1}{n}\sum_{i=1}^nKL[\pi(\cdot |s_i;\theta)||\pi(\cdot|s_i;\theta_{old})]<\Delta n1∑i=1nKL[π(⋅∣si;θ)∣∣π(⋅∣si;θold)]<Δ
3. Summary for TRPO
- trajectory from
π
(
⋅
∣
s
;
θ
o
l
d
)
\pi(\cdot|s;\theta_{old})
π(⋅∣s;θold)
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s n , a n , r n s_1,a_1,r_1,s_2,a_2,r_2,...,s_n,a_n,r_n s1,a1,r1,s2,a2,r2,...,sn,an,rn - dicounted returns
u i = ∑ k = i n γ k − i ⋅ r k u_i=\sum_{k=i}^n \gamma^{k-i}\cdot r_k ui=k=i∑nγk−i⋅rk - approximation:
L ~ ( θ ∣ θ o l d ) = ∑ i = 1 n π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) ⋅ u i \tilde L(\theta|\theta_{old})=\sum_{i=1}^n \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}\cdot u_i L~(θ∣θold)=i=1∑nπ(ai∣si;θold)π(ai∣si;θ)⋅ui - maximization:
θ n e w = a r g m a x θ ∈ N ( θ o l d ) L ~ ( θ ∣ θ o l d ) \theta_{new}=\mathop{argmax}\limits_{\theta \in N(\theta_{old})}\tilde L(\theta|\theta_{old}) θnew=θ∈N(θold)argmaxL~(θ∣θold)
by CyrusMay 2022 07 14