强化学习笔记-TRPO(1)基础理论部分

TRPO

1. Base-line

  • policy- Gradient

    策略梯度算法,是一个强化学习中非常重要的算法,他描述的是 π ( a ∣ s , θ ) \pi(a|s,\theta) π(as,θ),即,在网络参数为 θ \theta θ 时,当状态(环境)为s时候,采取哪种状态a
    ∇ V π ( s ) ∇ θ = E A   π [ ∇ l n π ( A ∣ S , θ ) ∇ θ ∗ Q π ( S , A ) ] \frac {\nabla V_{\pi}(s)} {\nabla \theta} = E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} * Q_{\pi}(S,A)] θVπ(s)=EA π[θl(AS,θ)Qπ(S,A)]

  • Baseline

​ 存在一个常数b,它不依赖于action A
E A   π [ ∇ l n π ( A ∣ S , θ ) ∇ θ ∗ b ] = b ∗ E A   π [ ∇ l n π ( A ∣ S , θ ) ∇ θ ] 根据期望的定义 = b ∗ ∑ a π ( A ∣ S , θ ) ∗ [ ∇ l n π ( A ∣ S , θ ) ∇ θ ] l n x 的导数 = b ∗ ∑ a ∇ π ( a ∣ s , θ ) ∇ θ = 0 ( 求和与求导对象不同 ) E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} * b] \\= b * E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} ] \\根据期望的定义=b*\sum_a \pi(A|S,\theta) *[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta}] \\lnx的导数 = b * \sum_a \frac {\nabla\pi(a|s,\theta)}{\nabla \theta} = 0(求和与求导对象不同) EA π[θl(AS,θ)b]=bEA π[θl(AS,θ)]根据期望的定义=baπ(AS,θ)[θl(AS,θ)]lnx的导数=baθπ(as,θ)=0(求和与求导对象不同)
所以存在一个不依赖于A的常数b,似的期望为0。所以考虑在梯度下降中添加bias ,虽然不会影响结果,但是会影响MC近似,让估计到的Q会和真实的数值更加接近
∇ V π ( s ) ∇ θ = E A   π [ ∇ l n π ( A ∣ S , θ ) ∇ θ ∗ ( Q π ( S , A ) − b ) ] \frac {\nabla V_{\pi}(s)} {\nabla \theta} = E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} * (Q_{\pi}(S,A) - b)] θVπ(s)=EA π[θl(AS,θ)Qπ(S,A)b)]

  • 常见Baseline

    b = V π ( s t ) , V π ( s t ) S t b = V_\pi(s_t),V_\pi(s_t) S_t b=Vπ(st),Vπ(st)St,

    b = 0

2. Trust Region Policy Optimization(TRPO)

2.1 Optimization

Gradient Ascent
$$
Find \theta^* = argmax_\theta J(\theta)

$$

  • $\theta_{old} $
    g = ∇ J ( θ ) ∇ θ θ = θ o l d g = \frac {\nabla J(\theta)}{\nabla \theta}_{\theta = \theta_{old}} g=θJ(θ)θ=θold

  • θ n e w ← θ o l d + α ∗ g \theta_{new} \leftarrow \theta_{old} + \alpha * g θnewθold+αg

但是梯度不太容易计算,所以采用随机梯度下降算法 J ( θ ) = E S [ V ( S ; θ ) ] J(\theta) = E_S[V(S;\theta)] J(θ)=ES[V(S;θ)]

Random Gradient

  • $S \leftarrow $random sampling

  • g = ∇ J ( θ ) ∇ θ θ = θ o l d g = \frac {\nabla J(\theta)}{\nabla \theta}_{\theta = \theta_{old}} g=θJ(θ)θ=θold

  • θ n e w ← θ o l d + α ∗ g \theta_{new} \leftarrow \theta_{old} + \alpha *g θnewθold+αg

2.2 Trust Region

N ( θ o l d ) N(\theta_{old}) N(θold)表示 θ o l d \theta_{old} θold 的邻域
N ( θ o l d ) = { θ ∣ ∣ ∣ θ − θ o l d ∣ ∣ 2 ≤ Δ } N(\theta_{old}) = \{\theta | ||\theta-\theta_{old}||_2 \leq \Delta\} N(θold)={θ∣∣∣θθold2Δ}
在邻域内,用 J ( θ ) 表示 L ( θ ∣ θ o l d ) J(\theta) 表示 L(\theta | \theta_{old}) J(θ)表示L(θθold)

步骤

  • Approximation J ( θ ) = L ( θ ∣ θ o l d ) J(\theta) = L(\theta | \theta_{old}) J(θ)=L(θθold)
  • Maximization θ n e w ← a r g a m a x θ ∈ N ( θ o l d ) L ( θ ∣ θ o l d ) \theta_{new} \leftarrow argamax_{\theta \in N(\theta_{old})} L(\theta | \theta_{old}) θnewargamaxθN(θold)L(θθold)

2.3 Policy-Based RL

  • π ( a ∣ s ; θ ) \pi (a|s;\theta) π(as;θ)

V π = E A − π [ Q π ( S , A ) ] = ∑ a   π ( a ∣ s , θ ) ∗ Q π ( s , a ) = ∑ a   π ( a ∣ s , θ o l d ) ∗ Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) E A − π ( . ∣ s , θ o l d ) [ Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) ] V_{\pi} = E_{A-\pi}[Q_\pi(S,A)] = \sum_a \ \pi(a|s,\theta)*Q_\pi(s,a) \\ =\sum_a \ \pi(a|s,\theta_{old})*Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})}\\ E_{A - \pi(.|s,\theta_{old})}[Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})}] Vπ=EAπ[Qπ(S,A)]=a π(as,θ)Qπ(s,a)=a π(as,θold)Qπ(s,a)π(as,θold)π(as,θ)EAπ(.∣s,θold)[Qπ(s,a)π(as,θold)π(as,θ)]

  • J ( θ ) = E s [ V π ( s ) ] = E s [ E A − π ( . ∣ s , θ o l d ) [ Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) ] ] J(\theta) = E_s[V_\pi(s)] = E_s[E_{A - \pi(.|s,\theta_{old})}[Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})}]] J(θ)=Es[Vπ(s)]=Es[EAπ(.∣s,θold)[Qπ(s,a)π(as,θold)π(as,θ)]]

利用重要性采样,我们通过旧的策略来找到新策略的状态转移函数

2.4 TRPO

policy gradient

  • 学习不够稳定,超参数对结果影响很大,容易从局部最优走到不好的策略

  • sample effiency:采样效率的利用很低。

two step

  • L ( θ ∣ θ o l d ) L(\theta | \theta_{old}) L(θθold)

    1.玩一局游戏,得到 Q π ( s i , a i ) : R : r 1 , r 2 , r 3 , r 4 , , , , r n Q_\pi(s_i,a_i):R:r_1,r_2,r_3,r_4,,,,r_n Qπ(si,ai):R:r1,r2,r3,r4,,,,rn

    2.重要性采样,估计出
    L ( θ ∣ θ o l d ) = ∑ i = 1 n Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) L(\theta | \theta_{old}) =\sum_{i = 1}^n Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})} L(θθold)=i=1nQπ(s,a)π(as,θold)π(as,θ)
    ​ 这里的Q由采样的u决定

    3.$ argmax = L(\theta | \theta_{old})$

  • θ n e w ← a r g m a x \theta_{new} \leftarrow argmax θnewargmax
    θ n e w ← a r g m a x θ L ( θ ∣ θ o l d ) θ ∈ N ( θ o l d ) \theta_{new} \leftarrow argmax_{\theta} L(\theta | \theta_{old}) \\\theta \in N(\theta_{old}) θnewargmaxθL(θθold)θN(θold)

    1. ∣ ∣ θ − θ o l d < Δ ∣ ∣ ||\theta - \theta_{old} < \Delta|| ∣∣θθold<Δ∣∣
    2. KL 散度,来衡量两个分布之间距离足够远
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值