一. 优化问题的构建
1. 由期望累计奖励出发
记期望累计奖励
η
(
π
)
\eta(\pi)
η(π)为:
η
(
π
)
=
E
s
0
∼
ρ
(
s
0
)
,
a
t
∼
π
(
a
∣
s
)
,
s
t
+
1
∼
P
(
s
t
+
1
∣
s
t
,
a
t
)
[
∑
t
=
0
∞
γ
t
r
(
s
t
)
]
,
(1)
\eta(\pi) = \mathbb{E}_{s_0 \sim \rho(s_0), a_t \sim \pi(a|s), s_{t+1}\sim P(s_{t+1}|s_t,a_t)}[\sum_{t=0}^{\infty}\gamma^tr(s_t)] \tag{1},
η(π)=Es0∼ρ(s0),at∼π(a∣s),st+1∼P(st+1∣st,at)[t=0∑∞γtr(st)],(1)其中
ρ
(
s
0
)
\rho(s_0)
ρ(s0)为初始状态
s
0
s_0
s0的分布。可以证明:
η
(
π
~
)
=
η
(
π
)
+
E
s
0
∼
ρ
(
s
0
)
,
a
t
∼
π
~
(
a
∣
s
)
,
s
t
+
1
∼
P
(
s
t
+
1
∣
s
t
,
a
t
)
[
∑
t
=
0
∞
γ
t
A
π
(
s
t
,
a
t
)
]
,
(2)
\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0 \sim \rho(s_0), a_t \sim \tilde{\pi}(a|s), s_{t+1}\sim P(s_{t+1}|s_t,a_t)}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)], \tag{2}
η(π~)=η(π)+Es0∼ρ(s0),at∼π~(a∣s),st+1∼P(st+1∣st,at)[t=0∑∞γtAπ(st,at)],(2)具体证明过程参考TRPO论文Appendix A Lemma1。注意,公式(2)期望中
a
t
a_t
at由新策略进行采样。记
ρ
π
(
s
)
=
P
(
s
0
=
s
)
+
γ
P
(
s
1
=
s
)
+
.
.
.
,
\rho_{\pi}(s) = P(s_0=s)+\gamma P(s_1=s) + ... ,
ρπ(s)=P(s0=s)+γP(s1=s)+...,则公式(2)可以改写为:
η
(
π
~
)
=
η
(
π
)
+
∑
t
=
0
∞
∑
s
P
(
s
t
=
s
∣
π
~
)
∑
a
π
~
(
a
∣
s
)
γ
t
A
π
(
s
,
a
)
=
η
(
π
)
+
∑
s
∑
t
=
0
∞
γ
t
P
(
s
t
=
s
∣
π
~
)
∑
a
π
~
(
a
∣
s
)
A
π
(
s
,
a
)
=
η
(
π
)
+
∑
s
ρ
π
~
(
s
)
∑
a
π
~
(
a
∣
s
)
A
π
(
s
,
a
)
.
(3)
\eta(\tilde{\pi}) = \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s}P(s_t=s|\tilde{\pi})\sum_{a}\tilde{\pi}(a|s)\gamma^tA_\pi(s,a) \\ = \eta(\pi) + \sum_{s}\sum_{t=0}^{\infty} \gamma^t P(s_t=s|\tilde{\pi}) \sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) \\ = \eta(\pi) + \sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) \tag{3}.
η(π~)=η(π)+t=0∑∞s∑P(st=s∣π~)a∑π~(a∣s)γtAπ(s,a)=η(π)+s∑t=0∑∞γtP(st=s∣π~)a∑π~(a∣s)Aπ(s,a)=η(π)+s∑ρπ~(s)a∑π~(a∣s)Aπ(s,a).(3)公式(3)等号左边为新策略
π
~
\tilde{\pi}
π~的期望累计奖励
η
(
π
~
)
\eta(\tilde{\pi})
η(π~),右边为旧策略
π
\pi
π的期望累计奖励
η
(
π
)
\eta(\pi)
η(π)加上某一项
∑
s
ρ
π
~
(
s
)
∑
a
π
~
(
a
∣
s
)
A
π
(
s
,
a
)
\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a)
∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)。TRPO的思想就是通过最大化
∑
s
ρ
π
~
(
s
)
∑
a
π
~
(
a
∣
s
)
A
π
(
s
,
a
)
\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a)
∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a),以提升策略的性能。但不能直接对
∑
s
ρ
π
~
(
s
)
∑
a
π
~
(
a
∣
s
)
A
π
(
s
,
a
)
\sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a)
∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)进行优化,因为
ρ
π
~
(
s
)
\rho_{\tilde{\pi}}(s)
ρπ~(s)导致了优化目标中含有未知项,即在未得到
π
~
\tilde\pi
π~之前,不知道
ρ
π
~
(
s
)
\rho_{\tilde{\pi}}(s)
ρπ~(s),直接求导十分复杂, 难以计算。
2. 对目标函数进行近似与简化
首先,由于上述提到的问题,我们将公式(3)简化为:
L
π
(
π
~
)
=
η
(
π
)
+
∑
s
ρ
π
(
s
)
∑
a
π
~
(
a
∣
s
)
A
π
(
s
,
a
)
,
(4)
L_{\pi}(\tilde\pi) = \eta(\pi) + \sum_{s}\rho_{\pi}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a), \tag{4}
Lπ(π~)=η(π)+s∑ρπ(s)a∑π~(a∣s)Aπ(s,a),(4)注意在公式(4)中,我们将
ρ
π
~
(
s
)
\rho_{\tilde{\pi}}(s)
ρπ~(s)替换为了
ρ
π
(
s
)
\rho_{\pi}(s)
ρπ(s)。公式(4)具有以下的性质:
L
π
θ
0
(
π
θ
0
)
=
η
(
π
θ
0
)
,
∇
θ
L
π
θ
0
(
π
θ
)
∣
θ
=
θ
0
=
∇
θ
η
π
θ
0
(
π
θ
)
∣
θ
=
θ
0
,
(5)
L_{\pi_{\theta_0}}(\pi_{\theta_0}) = \eta(\pi_{\theta_0}),\\ \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta}) |_{\theta = \theta_0} = \nabla_\theta \eta_{\pi_{\theta_0}}(\pi_{\theta}) |_{\theta = \theta_0}, \tag{5}
Lπθ0(πθ0)=η(πθ0),∇θLπθ0(πθ)∣θ=θ0=∇θηπθ0(πθ)∣θ=θ0,(5)性质的证明见此公式: 公式(5)证明。下面的定理展示了
L
π
(
π
~
)
L_{\pi}(\tilde\pi)
Lπ(π~)和
η
(
π
~
)
\eta(\tilde{\pi})
η(π~)之间的关系:
η
(
π
~
)
≥
L
π
(
π
~
)
−
C
D
T
V
m
a
x
(
π
,
π
~
)
,
\eta(\tilde{\pi}) \ge L_{\pi}(\tilde\pi) - CD_{TV}^{max}(\pi, \tilde\pi),
η(π~)≥Lπ(π~)−CDTVmax(π,π~),其中
C
=
4
ϵ
γ
(
1
−
γ
)
2
,
ϵ
=
max
s
,
a
A
π
(
s
,
a
)
,
C=\frac{4\epsilon\gamma}{(1-\gamma)^2},\epsilon=\max_{s,a}{A_\pi(s,a)},
C=(1−γ)24ϵγ,ϵ=maxs,aAπ(s,a),
D
T
V
m
a
x
(
π
,
π
~
)
=
max
s
D
T
V
(
π
,
π
~
)
=
max
s
1
2
∑
i
∣
π
i
(
⋅
∣
s
)
−
π
~
i
(
⋅
∣
s
)
∣
D_{TV}^{max}(\pi, \tilde\pi)=\max_{s}D_{TV}(\pi,\tilde\pi)=\max_{s}\frac{1}{2}\sum_{i}|\pi_i(\cdot|s)-\tilde\pi_i(\cdot|s)|
DTVmax(π,π~)=maxsDTV(π,π~)=maxs21∑i∣πi(⋅∣s)−π~i(⋅∣s)∣。证明见TRPO论文Appendix A Lemma2、3。文章用KL散度替换了TV散度,由于性质
D
T
V
(
π
,
π
~
)
2
≤
D
K
L
(
π
,
π
~
)
D_{TV}(\pi, \tilde\pi)^2 \le D_{KL}(\pi, \tilde\pi)
DTV(π,π~)2≤DKL(π,π~),则可以得到:
η
(
π
~
)
≥
L
π
(
π
~
)
−
C
D
K
L
m
a
x
(
π
,
π
~
)
,
(6)
\eta(\tilde{\pi}) \ge L_{\pi}(\tilde\pi) - CD_{KL}^{max}(\pi, \tilde\pi), \tag{6}
η(π~)≥Lπ(π~)−CDKLmax(π,π~),(6)其中
C
=
4
ϵ
γ
(
1
−
γ
)
2
,
ϵ
=
max
s
,
a
A
π
(
s
,
a
)
。
C=\frac{4\epsilon\gamma}{(1-\gamma)^2},\epsilon=\max_{s,a}{A_\pi(s,a)}。
C=(1−γ)24ϵγ,ϵ=maxs,aAπ(s,a)。通过最大化等式右边
L
π
(
π
~
)
−
C
D
K
L
m
a
x
(
π
,
π
~
)
L_{\pi}(\tilde\pi) - CD_{KL}^{max}(\pi, \tilde\pi)
Lπ(π~)−CDKLmax(π,π~),可以保证策略性能
η
(
π
~
)
\eta(\tilde{\pi})
η(π~)的单调上升。
3. 将优化问题进一步简化
记
π
=
π
θ
o
l
d
,
π
~
=
π
θ
\pi = \pi_{\theta_{old}}, \tilde\pi=\pi_{\theta}
π=πθold,π~=πθ,则优化问题可以构建为:
maximize
θ
[
L
θ
o
l
d
(
θ
)
−
C
D
K
L
m
a
x
(
θ
o
l
d
,
θ
)
]
.
\text{maximize}_{\theta}[L_{\theta_{old}}(\theta)-CD_{KL}^{max}(\theta_{old}, \theta)].
maximizeθ[Lθold(θ)−CDKLmax(θold,θ)].文章中指出,如果按照理论中设置C的大小,则每次更新的步长非常小,效率很低。因此,文章用信赖域的约束优化问题代替上面的问题:
maximize
θ
L
θ
o
l
d
(
θ
)
,
s.t.
D
K
L
m
a
x
(
θ
o
l
d
,
θ
)
≤
δ
.
\text{maximize}_{\theta}L_{\theta_{old}}(\theta), \\ \text{s.t.} D_{KL}^{max}(\theta_{old}, \theta) \le \delta.
maximizeθLθold(θ),s.t.DKLmax(θold,θ)≤δ.文章进一步将约束中的KL散度由最大值约束简化为了平均值约束:
maximize
θ
L
θ
o
l
d
(
θ
)
,
s.t.
D
ˉ
K
L
ρ
θ
o
l
d
(
θ
o
l
d
,
θ
)
≤
δ
.
\text{maximize}_{\theta}L_{\theta_{old}}(\theta), \\ \text{s.t.} \bar D_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta.
maximizeθLθold(θ),s.t.DˉKLρθold(θold,θ)≤δ.其中
D
ˉ
K
L
ρ
(
θ
1
,
θ
2
)
=
E
s
∼
ρ
[
D
K
L
(
π
θ
1
(
⋅
∣
s
)
,
π
θ
2
(
⋅
∣
s
)
)
]
\bar D_{KL}^{\rho}(\theta_1,\theta_2) = \mathbb{E}_{s\sim\rho}[D_{KL}(\pi_{\theta_1}(\cdot|s), \pi_{\theta_2}(\cdot|s))]
DˉKLρ(θ1,θ2)=Es∼ρ[DKL(πθ1(⋅∣s),πθ2(⋅∣s))]。由于在强化学习中,一般通过采样的方式收集样本,上述的优化目标还可以写成期望的形式:
maximize
θ
∑
s
ρ
θ
o
l
d
(
s
)
∑
a
∼
π
θ
π
θ
(
a
∣
s
)
A
θ
o
l
d
(
s
,
a
)
,
s.t.
D
ˉ
K
L
ρ
θ
o
l
d
(
θ
o
l
d
,
θ
)
≤
δ
.
\text{maximize}_{\theta} \sum_{s}\rho_{\theta_{old}}(s)\sum_{a\sim\pi_{\theta}}\pi_{\theta}(a|s)A_{\theta_{old}}(s,a), \\ \text{s.t.} \bar D_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta.
maximizeθs∑ρθold(s)a∼πθ∑πθ(a∣s)Aθold(s,a),s.t.DˉKLρθold(θold,θ)≤δ.在上式中,
maximize
θ
∑
s
ρ
θ
o
l
d
(
s
)
[
.
.
.
]
\text{maximize}_{\theta} \sum_{s}\rho_{\theta_{old}}(s)[...]
maximizeθ∑sρθold(s)[...]代表根据分布
s
∼
ρ
θ
o
l
d
(
s
)
s\sim\rho_{\theta_{old}}(s)
s∼ρθold(s)对状态进行采样,可以写成
E
s
∼
ρ
θ
o
l
d
(
s
)
[
.
.
.
]
\mathbb{E}_{s\sim \rho_{\theta_{old}}(s)}[...]
Es∼ρθold(s)[...],同时,由于在收集样本时,一般是采用某些随机策略
q
(
⋅
∣
s
)
q(\cdot|s)
q(⋅∣s),以增加对状态空间的探索,因此
∑
a
π
θ
(
a
∣
s
)
\sum_a\pi_{\theta}(a|s)
∑aπθ(a∣s)与数据中
a
t
a_t
at来源于
q
(
⋅
∣
s
)
q(\cdot|s)
q(⋅∣s)不一致,通过重要性采样将其改写为
∑
a
∼
q
π
θ
(
a
∣
s
)
q
(
⋅
∣
s
)
\sum_{a\sim q}\frac{\pi_{\theta}(a|s)}{q(\cdot|s)}
∑a∼qq(⋅∣s)πθ(a∣s)。综上,优化问题可以改写为:
maximize
θ
E
s
∼
ρ
θ
o
l
d
,
a
∼
q
π
θ
(
a
∣
s
)
q
(
⋅
∣
s
)
A
θ
o
l
d
(
s
,
a
)
,
s.t.
D
ˉ
K
L
ρ
θ
o
l
d
(
θ
o
l
d
,
θ
)
≤
δ
.
(7)
\text{maximize}_{\theta} \mathbb{E}_{s\sim \rho_{\theta_{old}},a\sim q}\frac{\pi_{\theta}(a|s)}{q(\cdot|s)}A_{\theta_{old}}(s,a), \\ \text{s.t.} \bar D_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta. \tag{7}
maximizeθEs∼ρθold,a∼qq(⋅∣s)πθ(a∣s)Aθold(s,a),s.t.DˉKLρθold(θold,θ)≤δ.(7)
二. 优化问题求解
文章采用顺序二次规划的思想,将目标函数线性化、将约束进行二阶Taylor展开,通过迭代的方式逐步求解问题(7)。
1. 目标函数的线性化
记
f
(
θ
)
=
π
θ
(
a
∣
s
)
q
(
⋅
∣
s
)
A
θ
o
l
d
(
s
,
a
)
f(\theta) = \frac{\pi_{\theta}(a|s)}{q(\cdot|s)}A_{\theta_{old}}(s,a)
f(θ)=q(⋅∣s)πθ(a∣s)Aθold(s,a),则
f
(
θ
)
=
f
(
θ
o
l
d
)
+
∇
θ
f
(
θ
)
∣
θ
=
θ
o
l
d
(
θ
−
θ
o
l
d
)
f(\theta) = f(\theta_{old})+\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}}(\theta-\theta_{old})
f(θ)=f(θold)+∇θf(θ)∣θ=θold(θ−θold)。其中
∇
θ
f
(
θ
)
∣
θ
=
θ
o
l
d
\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}}
∇θf(θ)∣θ=θold
可以通过自动微分计算。
2. 约束的二阶Taylor展开
KL散度的二阶Taylor展开可以写为以下形式:
g
(
θ
)
=
D
K
L
(
p
θ
0
,
p
θ
)
=
D
K
L
(
p
θ
0
,
p
θ
0
)
+
∇
θ
D
K
L
(
p
θ
0
,
p
θ
)
∣
θ
=
θ
0
(
θ
−
θ
0
)
+
1
2
(
θ
−
θ
0
)
T
F
(
θ
−
θ
0
)
+
O
(
θ
3
)
.
g(\theta) = D_{KL}(p_{\theta_0},p_{\theta}) = D_{KL}(p_{\theta_0},p_{\theta_0}) + \nabla_\theta D_{KL}(p_{\theta_0},p_{\theta})|_{\theta = \theta_0}(\theta - \theta_0) + \frac{1}{2}(\theta - \theta_0)^TF(\theta-\theta_0) + O(\theta^3).
g(θ)=DKL(pθ0,pθ)=DKL(pθ0,pθ0)+∇θDKL(pθ0,pθ)∣θ=θ0(θ−θ0)+21(θ−θ0)TF(θ−θ0)+O(θ3).其中
D
K
L
(
p
θ
0
,
p
θ
0
)
=
0
,
D_{KL}(p_{\theta_0},p_{\theta_0})=0,
DKL(pθ0,pθ0)=0,
∇
θ
D
K
L
(
p
θ
0
,
p
θ
)
∣
θ
=
θ
0
=
∇
θ
∫
p
θ
0
(
x
)
[
log
p
θ
0
(
x
)
−
log
p
θ
(
x
)
]
d
x
∣
θ
=
θ
0
=
−
∫
p
θ
0
(
x
)
∇
θ
log
p
θ
(
x
)
d
x
∣
θ
=
θ
0
=
−
∫
p
θ
0
(
x
)
∇
θ
p
θ
(
x
)
p
θ
(
x
)
d
x
∣
θ
=
θ
0
=
−
∇
θ
∫
p
θ
(
x
)
d
x
∣
θ
=
θ
0
=
−
∇
θ
1
=
0.
\nabla_\theta D_{KL}(p_{\theta_0},p_{\theta})|_{\theta = \theta_0} = \nabla_\theta \int{p_{\theta_0}(x) [\log p_{\theta_0}(x)-\log p_{\theta}(x)]}dx|_{\theta = \theta_0} \\ = - \int{p_{\theta_0}(x) \nabla_\theta\log p_{\theta}(x)}dx|_{\theta = \theta_0} \\ = - \int{p_{\theta_0}(x) \frac{\nabla_\theta p_{\theta}(x)}{p_{\theta}(x)}}dx|_{\theta = \theta_0}\\ = - \nabla_\theta \int{p_\theta(x)dx}|_{\theta = \theta_0} \\ = - \nabla_\theta 1 \\ = 0.
∇θDKL(pθ0,pθ)∣θ=θ0=∇θ∫pθ0(x)[logpθ0(x)−logpθ(x)]dx∣θ=θ0=−∫pθ0(x)∇θlogpθ(x)dx∣θ=θ0=−∫pθ0(x)pθ(x)∇θpθ(x)dx∣θ=θ0=−∇θ∫pθ(x)dx∣θ=θ0=−∇θ1=0.
F
F
F为Fisher信息矩阵,定义可见Fisher信息矩阵的定义。具体推导过程可见KL散度的二阶Taylor展开。
综上,优化问题可以写为:
maximize
θ
E
s
∼
ρ
θ
o
l
d
,
a
∼
q
[
f
(
θ
o
l
d
)
+
∇
θ
f
(
θ
)
∣
θ
=
θ
o
l
d
(
θ
−
θ
o
l
d
)
]
,
s
.
t
.
1
2
(
θ
−
θ
0
)
T
F
(
θ
−
θ
0
)
≤
δ
.
(8)
\text{maximize}_{\theta} \mathbb{E}_{s\sim \rho_{\theta_{old}},a\sim q}[ f(\theta_{old})+\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}}(\theta-\theta_{old})],\\ s.t. \frac{1}{2}(\theta - \theta_0)^TF(\theta-\theta_0) \le \delta. \tag{8}
maximizeθEs∼ρθold,a∼q[f(θold)+∇θf(θ)∣θ=θold(θ−θold)],s.t.21(θ−θ0)TF(θ−θ0)≤δ.(8)
3.高效求解优化问题
由问题(8),根据Language乘子法,可知
∇
θ
f
(
θ
)
∣
θ
=
θ
o
l
d
+
λ
F
(
θ
−
θ
0
)
=
0
,
\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}} + \lambda F(\theta-\theta_0)=0,
∇θf(θ)∣θ=θold+λF(θ−θ0)=0,即更新方向
g
=
(
θ
−
θ
0
)
g=(\theta-\theta_0)
g=(θ−θ0)满足
F
g
=
b
,
(9)
Fg = b,\tag{9}
Fg=b,(9)其中
F
F
F为Fisher信息矩阵,
g
=
(
θ
−
θ
0
)
,
b
=
∇
θ
f
(
θ
)
∣
θ
=
θ
o
l
d
g=(\theta-\theta_0), b=\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}}
g=(θ−θ0),b=∇θf(θ)∣θ=θold。文中采用共轭梯度法对(9)式进行快速计算。通过(9)式求得更新方向
g
g
g后,假设步长为
β
\beta
β,需满足KL散度的约束,即
δ
=
D
K
L
≈
1
2
(
β
g
)
T
A
(
β
g
)
=
1
2
β
2
g
T
A
g
.
\delta = D_{KL} \approx \frac{1}{2}(\beta g)^TA(\beta g)= \frac{1}{2}\beta^2 g^TAg.
δ=DKL≈21(βg)TA(βg)=21β2gTAg.解得
β
=
2
δ
/
g
T
A
g
\beta = \sqrt{2\delta/g^TAg}
β=2δ/gTAg。文中从
β
=
2
δ
/
g
T
A
g
\beta = \sqrt{2\delta/g^TAg}
β=2δ/gTAg开始进行线搜索,求得使
L
θ
o
l
d
(
θ
)
−
I
(
D
K
L
(
θ
o
l
d
,
θ
)
≤
δ
)
L_{\theta_{old}}(\theta)-\mathbb{I}(D_{KL}(\theta_{old},\theta)\le \delta)
Lθold(θ)−I(DKL(θold,θ)≤δ)取得最大值的
β
∗
\beta^*
β∗,其中
I
(
⋅
)
\mathbb{I}(\cdot)
I(⋅)为示性函数,当条件满足时取0,不满足时取
+
∞
+\infty
+∞。