TRPO
1. Base-line
-
policy- Gradient
策略梯度算法,是一个强化学习中非常重要的算法,他描述的是 π ( a ∣ s , θ ) \pi(a|s,\theta) π(a∣s,θ),即,在网络参数为 θ \theta θ 时,当状态(环境)为s时候,采取哪种状态a
∇ V π ( s ) ∇ θ = E A π [ ∇ l n π ( A ∣ S , θ ) ∇ θ ∗ Q π ( S , A ) ] \frac {\nabla V_{\pi}(s)} {\nabla \theta} = E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} * Q_{\pi}(S,A)] ∇θ∇Vπ(s)=EA π[∇θ∇lnπ(A∣S,θ)∗Qπ(S,A)] -
Baseline
存在一个常数b,它不依赖于action A
E
A
π
[
∇
l
n
π
(
A
∣
S
,
θ
)
∇
θ
∗
b
]
=
b
∗
E
A
π
[
∇
l
n
π
(
A
∣
S
,
θ
)
∇
θ
]
根据期望的定义
=
b
∗
∑
a
π
(
A
∣
S
,
θ
)
∗
[
∇
l
n
π
(
A
∣
S
,
θ
)
∇
θ
]
l
n
x
的导数
=
b
∗
∑
a
∇
π
(
a
∣
s
,
θ
)
∇
θ
=
0
(
求和与求导对象不同
)
E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} * b] \\= b * E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} ] \\根据期望的定义=b*\sum_a \pi(A|S,\theta) *[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta}] \\lnx的导数 = b * \sum_a \frac {\nabla\pi(a|s,\theta)}{\nabla \theta} = 0(求和与求导对象不同)
EA π[∇θ∇lnπ(A∣S,θ)∗b]=b∗EA π[∇θ∇lnπ(A∣S,θ)]根据期望的定义=b∗a∑π(A∣S,θ)∗[∇θ∇lnπ(A∣S,θ)]lnx的导数=b∗a∑∇θ∇π(a∣s,θ)=0(求和与求导对象不同)
所以存在一个不依赖于A的常数b,似的期望为0。所以考虑在梯度下降中添加bias ,虽然不会影响结果,但是会影响MC近似,让估计到的Q会和真实的数值更加接近
∇
V
π
(
s
)
∇
θ
=
E
A
π
[
∇
l
n
π
(
A
∣
S
,
θ
)
∇
θ
∗
(
Q
π
(
S
,
A
)
−
b
)
]
\frac {\nabla V_{\pi}(s)} {\nabla \theta} = E_{A ~ \pi}[\frac{\nabla ln\pi(A|S,\theta)}{\nabla \theta} * (Q_{\pi}(S,A) - b)]
∇θ∇Vπ(s)=EA π[∇θ∇lnπ(A∣S,θ)∗(Qπ(S,A)−b)]
-
常见Baseline
b = V π ( s t ) , V π ( s t ) S t b = V_\pi(s_t),V_\pi(s_t) S_t b=Vπ(st),Vπ(st)St,
b = 0
2. Trust Region Policy Optimization(TRPO)
2.1 Optimization
Gradient Ascent
$$
Find \theta^* = argmax_\theta J(\theta)
$$
-
$\theta_{old} $
g = ∇ J ( θ ) ∇ θ θ = θ o l d g = \frac {\nabla J(\theta)}{\nabla \theta}_{\theta = \theta_{old}} g=∇θ∇J(θ)θ=θold -
θ n e w ← θ o l d + α ∗ g \theta_{new} \leftarrow \theta_{old} + \alpha * g θnew←θold+α∗g
但是梯度不太容易计算,所以采用随机梯度下降算法 J ( θ ) = E S [ V ( S ; θ ) ] J(\theta) = E_S[V(S;\theta)] J(θ)=ES[V(S;θ)]
Random Gradient
-
$S \leftarrow $random sampling
-
g = ∇ J ( θ ) ∇ θ θ = θ o l d g = \frac {\nabla J(\theta)}{\nabla \theta}_{\theta = \theta_{old}} g=∇θ∇J(θ)θ=θold
-
θ n e w ← θ o l d + α ∗ g \theta_{new} \leftarrow \theta_{old} + \alpha *g θnew←θold+α∗g
2.2 Trust Region
N
(
θ
o
l
d
)
N(\theta_{old})
N(θold)表示
θ
o
l
d
\theta_{old}
θold 的邻域
N
(
θ
o
l
d
)
=
{
θ
∣
∣
∣
θ
−
θ
o
l
d
∣
∣
2
≤
Δ
}
N(\theta_{old}) = \{\theta | ||\theta-\theta_{old}||_2 \leq \Delta\}
N(θold)={θ∣∣∣θ−θold∣∣2≤Δ}
在邻域内,用
J
(
θ
)
表示
L
(
θ
∣
θ
o
l
d
)
J(\theta) 表示 L(\theta | \theta_{old})
J(θ)表示L(θ∣θold)
步骤
- Approximation : J ( θ ) = L ( θ ∣ θ o l d ) J(\theta) = L(\theta | \theta_{old}) J(θ)=L(θ∣θold)
- Maximization: θ n e w ← a r g a m a x θ ∈ N ( θ o l d ) L ( θ ∣ θ o l d ) \theta_{new} \leftarrow argamax_{\theta \in N(\theta_{old})} L(\theta | \theta_{old}) θnew←argamaxθ∈N(θold)L(θ∣θold)
2.3 Policy-Based RL
- π ( a ∣ s ; θ ) \pi (a|s;\theta) π(a∣s;θ)
V π = E A − π [ Q π ( S , A ) ] = ∑ a π ( a ∣ s , θ ) ∗ Q π ( s , a ) = ∑ a π ( a ∣ s , θ o l d ) ∗ Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) E A − π ( . ∣ s , θ o l d ) [ Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) ] V_{\pi} = E_{A-\pi}[Q_\pi(S,A)] = \sum_a \ \pi(a|s,\theta)*Q_\pi(s,a) \\ =\sum_a \ \pi(a|s,\theta_{old})*Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})}\\ E_{A - \pi(.|s,\theta_{old})}[Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})}] Vπ=EA−π[Qπ(S,A)]=a∑ π(a∣s,θ)∗Qπ(s,a)=a∑ π(a∣s,θold)∗Qπ(s,a)∗π(a∣s,θold)π(a∣s,θ)EA−π(.∣s,θold)[Qπ(s,a)∗π(a∣s,θold)π(a∣s,θ)]
- J ( θ ) = E s [ V π ( s ) ] = E s [ E A − π ( . ∣ s , θ o l d ) [ Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) ] ] J(\theta) = E_s[V_\pi(s)] = E_s[E_{A - \pi(.|s,\theta_{old})}[Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})}]] J(θ)=Es[Vπ(s)]=Es[EA−π(.∣s,θold)[Qπ(s,a)∗π(a∣s,θold)π(a∣s,θ)]]
利用重要性采样,我们通过旧的策略来找到新策略的状态转移函数
2.4 TRPO
policy gradient
-
学习不够稳定,超参数对结果影响很大,容易从局部最优走到不好的策略
-
sample effiency:采样效率的利用很低。
two step
-
L ( θ ∣ θ o l d ) L(\theta | \theta_{old}) L(θ∣θold)
1.玩一局游戏,得到 Q π ( s i , a i ) : R : r 1 , r 2 , r 3 , r 4 , , , , r n Q_\pi(s_i,a_i):R:r_1,r_2,r_3,r_4,,,,r_n Qπ(si,ai):R:r1,r2,r3,r4,,,,rn
2.重要性采样,估计出
L ( θ ∣ θ o l d ) = ∑ i = 1 n Q π ( s , a ) ∗ π ( a ∣ s , θ ) π ( a ∣ s , θ o l d ) L(\theta | \theta_{old}) =\sum_{i = 1}^n Q_\pi(s,a)* \frac{\pi(a|s,\theta)}{\pi(a|s,\theta_{old})} L(θ∣θold)=i=1∑nQπ(s,a)∗π(a∣s,θold)π(a∣s,θ)
这里的Q由采样的u决定3.$ argmax = L(\theta | \theta_{old})$
-
θ n e w ← a r g m a x \theta_{new} \leftarrow argmax θnew←argmax
θ n e w ← a r g m a x θ L ( θ ∣ θ o l d ) θ ∈ N ( θ o l d ) \theta_{new} \leftarrow argmax_{\theta} L(\theta | \theta_{old}) \\\theta \in N(\theta_{old}) θnew←argmaxθL(θ∣θold)θ∈N(θold)- ∣ ∣ θ − θ o l d < Δ ∣ ∣ ||\theta - \theta_{old} < \Delta|| ∣∣θ−θold<Δ∣∣
- KL 散度,来衡量两个分布之间距离足够远