强化学习的理论基础是MDP(Markov Decesion Process),当MDP中的策略 π \pi π确定之后,MDP便是最一般的Markov Process的形式。这里需要补充一些MDP中的基础概念:
(1)策略 π \pi π下的累计折扣回报 G t = ∑ k = 0 ∞ γ k R k + t G_t=\sum_{k=0}^{\infty}\gamma^kR_{k+t} Gt=∑k=0∞γkRk+t,其中 r ∈ ( 0 , 1 ] r\in (0,1] r∈(0,1]是折扣因子, R t R_{t} Rt表示 t t t时刻的奖励。
(2)策略 π \pi π下的价值函数 q π ( s , a ) q_{\pi}(s,a) qπ(s,a),定义式: q π ( s , a ) = E t [ G t ∣ s 0 = s , a 0 = a , π ] q_{\pi}(s,a)=\mathbf{E}_t[G_t|s_0=s,a_0=a,\pi] qπ(s,a)=Et[Gt∣s0=s,a0=a,π];
推导式: q π ( s , a ) = r ( s , a ) + γ ∑ s ′ P r ( s ′ ∣ s , a ) v π ( s ′ ) q_{\pi}(s,a)=r(s,a)+\gamma\sum_{s^{'}}\mathbf{Pr}(s^{'}|s,a)v_{\pi}(s^{'}) qπ(s,a)=r(s,a)+γ∑s′Pr(s′∣s,a)vπ(s′);其中 r ( s , a ) r(s,a) r(s,a)是在状态 s s s下采取动作 a a a的奖励期望。
(3)策略 π \pi π下的价值函数 v π ( s ) v_{\pi}(s) vπ(s),定义式: v π ( s ) = E t [ G t ∣ s 0 = s , π ] v_{\pi}(s)=\mathbf{E_t}[G_t|s_0=s,\pi] vπ(s)=Et[Gt∣s0=s,π];
推导式: v π ( s ) = E a ∼ π ( . ∣ s ) [ q π ( s , a ) ] = ∑ a π ( a ∣ s ) q π ( s , a ) v_{\pi}(s)=\mathbf{E}_{a\sim \pi(.|s)}[q_{\pi}(s,a)]=\sum_{a}\pi(a|s)q_{\pi}(s,a) vπ(s)=Ea∼π(.∣s)[qπ(s,a)]=∑aπ(a∣s)qπ(s,a);
(4)Bellman方程: v π ( s ) = ∑ a π ( a ∣ s ) ( r ( s , a ) + γ ∑ s ′ P r ( s ′ ∣ s , a ) v π ( s ′ ) ) v_{\pi}(s)=\sum_{a}\pi(a|s)(r(s,a)+\gamma\sum_{s^{'}}\mathbf{Pr}(s^{'}|s,a)v_{\pi}(s^{'})) vπ(s)=∑aπ(a∣s)(r(s,a)+γ∑s′Pr(s′∣s,a)vπ(s′)),矩阵形式为 v π = r π + γ P π v π v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi} vπ=rπ+γPπvπ。
(TD算法的收敛性) 在策略
π
\pi
π下智能体与环境交互产生了一串随机序列
(
s
0
,
a
0
,
r
1
,
s
1
,
a
1
,
r
2
,
s
2
,
.
.
.
)
(s_0,a_0,r_1,s_1,a_1,r_2,s_2,...)
(s0,a0,r1,s1,a1,r2,s2,...),若对强化学习中
t
t
t时刻的价值函数采用如下式子进行值函数迭代:
v
t
+
1
(
s
t
)
=
v
t
(
s
t
)
−
α
t
(
s
t
)
[
v
t
(
s
t
)
−
[
r
t
+
1
+
γ
v
t
(
s
t
+
1
)
]
]
,
v
t
+
1
(
s
)
=
v
t
(
s
)
,
∀
s
≠
s
t
v_{t+1}(s_t)=v_t(s_t)-\alpha_t(s_t)[v_t(s_t)-[r_{t+1}+\gamma v_t(s_{t+1})]],\\ v_{t+1}(s)=v_t(s),\forall s\ne s_t
vt+1(st)=vt(st)−αt(st)[vt(st)−[rt+1+γvt(st+1)]],vt+1(s)=vt(s),∀s=st
当满足以下条件:
(1)状态空间 S S S中的状态 s t s_t st有限;
(2) ∀ s ∈ S , ∑ t α t ( s ) = ∞ , ∑ t α t 2 ( s ) < ∞ \forall s\in S,\sum_t\alpha_t(s)=\infty,\sum_{t}\alpha^2_t(s)<\infty ∀s∈S,∑tαt(s)=∞,∑tαt2(s)<∞;
则: ∀ s ∈ S , v t ( s ) → v π ( s ) , w . p . 1 \forall s \in S,v_t(s)\rightarrow v_{\pi}(s),w.p.1 ∀s∈S,vt(s)→vπ(s),w.p.1。
Proof. 设
t
t
t时刻为状态
s
t
=
s
s_t=s
st=s则
α
t
(
s
)
>
0
\alpha_t(s)>0
αt(s)>0,否则
α
t
(
s
)
=
0
,
∀
s
t
≠
s
\alpha_t(s)=0,\forall s_t \ne s
αt(s)=0,∀st=s,则迭代式可以变形为:
v
t
+
1
(
s
)
=
v
t
(
s
)
−
α
t
(
s
)
[
v
t
(
s
)
−
[
r
t
+
1
+
γ
v
t
(
s
′
)
]
]
=
(
1
−
α
t
(
s
)
)
v
t
(
s
)
+
α
t
(
s
)
[
r
t
+
1
+
γ
v
t
(
s
′
)
]
,
∀
s
∈
S
,
t
=
t
0
,
t
1
,
.
.
.
v_{t+1}(s)=v_t(s)-\alpha_t(s)[v_t(s)-[r_{t+1}+\gamma v_t(s^{'})]]\\=(1-\alpha_t(s))v_t(s)+\alpha_t(s)[r_{t+1}+\gamma v_t(s^{'})],\forall s\in S,t={t_0,t_1,...}
vt+1(s)=vt(s)−αt(s)[vt(s)−[rt+1+γvt(s′)]]=(1−αt(s))vt(s)+αt(s)[rt+1+γvt(s′)],∀s∈S,t=t0,t1,...
其中
s
′
s^{'}
s′是当前
t
t
t时刻从
s
s
s转移到的下一个状态。设
Δ
k
+
1
(
s
)
=
v
k
+
1
(
s
)
−
v
π
(
s
)
\Delta_{k+1}(s)=v_{k+1}(s)-v_{\pi}(s)
Δk+1(s)=vk+1(s)−vπ(s),带入上式得到:
Δ
k
+
1
(
s
)
=
(
1
−
α
k
(
s
)
)
Δ
k
(
s
)
+
α
k
(
s
)
[
r
k
+
1
+
γ
v
k
(
s
k
+
1
)
−
v
π
(
s
)
]
\Delta_{k+1}(s)=(1-\alpha_k(s))\Delta_k(s)+\alpha_k(s)[r_{k+1}+\gamma v_k(s_{k+1})-v_{\pi}(s)]
Δk+1(s)=(1−αk(s))Δk(s)+αk(s)[rk+1+γvk(sk+1)−vπ(s)]
其中
s
k
=
s
s_{k}=s
sk=s表示当前时刻
k
k
k的状态,
s
k
+
1
s_{k+1}
sk+1表示
k
+
1
k+1
k+1时刻的状态。设
e
k
(
s
)
=
r
k
+
1
+
γ
v
k
(
s
k
+
1
)
−
v
π
(
s
)
,
Δ
k
=
[
Δ
k
(
s
1
)
,
Δ
k
(
s
2
)
,
.
.
.
Δ
k
(
s
∣
S
∣
)
]
T
e_k(s)=r_{k+1}+\gamma v_k(s_{k+1})-v_{\pi}(s),\Delta_k=[\Delta_k(s_1),\Delta_k(s_2),...\Delta_k(s_{|S|})]^T
ek(s)=rk+1+γvk(sk+1)−vπ(s),Δk=[Δk(s1),Δk(s2),...Δk(s∣S∣)]T,
H
k
=
{
Δ
k
,
Δ
k
−
1
,
.
.
.
e
k
−
1
,
.
.
.
,
α
k
−
1
,
.
.
.
}
H_k=\{\Delta_k,\Delta_{k-1},...e_{k-1},...,\alpha_{k-1},...\}
Hk={Δk,Δk−1,...ek−1,...,αk−1,...},
e
k
=
[
e
k
(
s
1
)
,
e
k
(
s
2
)
,
.
.
.
e
k
(
s
∣
S
∣
)
]
T
,
v
π
=
[
v
π
(
s
1
)
,
.
.
.
v
π
(
s
∣
S
∣
)
]
T
e_k=[e_{k}(s_1),e_k(s_2),...e_k(s_{|S|})]^T,v_{\pi}=[v_{\pi}(s_1),...v_{\pi}(s_{|S|})]^T
ek=[ek(s1),ek(s2),...ek(s∣S∣)]T,vπ=[vπ(s1),...vπ(s∣S∣)]T,且
E
[
v
k
(
s
k
+
1
)
∣
H
k
]
=
E
s
k
+
1
[
v
k
(
s
k
+
1
)
∣
s
k
=
s
]
=
∑
s
′
P
r
[
s
′
∣
s
]
v
k
(
s
′
)
\mathbf{E}[v_k(s_{k+1})|H_k]=\mathbf{E}_{s_{k+1}}[v_k(s_{k+1})|s_k=s]=\sum_{s^{'}}\mathbf{Pr}[s^{'}|s]v_k(s^{'})
E[vk(sk+1)∣Hk]=Esk+1[vk(sk+1)∣sk=s]=∑s′Pr[s′∣s]vk(s′),则可以得到:
∣
∣
E
[
e
k
∣
H
k
]
∣
∣
∞
=
∣
∣
r
π
+
γ
P
π
v
k
−
v
π
∣
∣
∞
=
∣
∣
r
π
+
γ
P
π
v
k
−
(
r
π
+
γ
P
π
v
π
)
∣
∣
∞
=
γ
∣
∣
P
π
(
v
k
−
v
π
)
∣
∣
∞
≤
γ
∣
∣
v
k
−
v
π
∣
∣
∞
=
γ
∣
∣
Δ
k
∣
∣
∞
||\mathbf{E}[e_k|H_k]||_{\infty}=||r_{\pi}+\gamma P_{\pi}v_k-v_{\pi}||_{\infty}\\=||r_{\pi}+\gamma P_{\pi}v_k-(r_{\pi}+\gamma P_{\pi}v_{\pi})||_{\infty}\\=\gamma||P_{\pi}(v_k-v_{\pi})||_{\infty}\\\leq\gamma||v_k-v_{\pi}||_{\infty}=\gamma||\Delta_k||_{\infty}
∣∣E[ek∣Hk]∣∣∞=∣∣rπ+γPπvk−vπ∣∣∞=∣∣rπ+γPπvk−(rπ+γPπvπ)∣∣∞=γ∣∣Pπ(vk−vπ)∣∣∞≤γ∣∣vk−vπ∣∣∞=γ∣∣Δk∣∣∞
同理可得
V
a
r
[
e
k
∣
H
k
]
\mathbf{Var}[e_k|H_k]
Var[ek∣Hk]有界,由Dvoretzky’s 收敛定理的扩展:
Δ
k
(
s
)
→
0
,
w
.
p
.
1
\Delta_k(s)\rightarrow 0,w.p.1
Δk(s)→0,w.p.1,即
v
k
(
s
)
→
v
π
(
s
)
,
w
.
p
.
1
s
v_k(s)\rightarrow v_{\pi}(s),w.p.1s
vk(s)→vπ(s),w.p.1s.
(线性值函数逼近的收敛性) 当采用式
v
^
(
s
;
w
)
=
ϕ
(
s
)
T
w
\hat{v}(s;w)=\phi(s)^Tw
v^(s;w)=ϕ(s)Tw,
ϕ
(
s
)
∈
R
m
\phi(s)\in R^m
ϕ(s)∈Rm,当采用TD算法更新
w
w
w使
v
^
(
s
;
w
)
\hat{v}(s;w)
v^(s;w)逼近
v
π
(
s
)
v_{\pi}(s)
vπ(s)即:
min
w
E
s
∼
d
(
.
)
[
(
v
^
(
s
;
w
)
−
v
π
(
s
)
)
2
]
=
min
w
E
s
t
∼
d
(
.
)
[
(
v
^
(
s
t
;
w
t
)
−
(
r
t
+
1
+
γ
v
^
(
s
t
+
1
;
w
t
)
)
)
2
]
\min_{w}\mathbf{E}_{s\sim d(.)}[(\hat{v}(s;w)-v_{\pi}(s))^2]\\=\min_{w}\mathbf{E}_{s_t\sim d(.)}[(\hat{v}(s_t;w_t)-(r_{t+1}+\gamma \hat{v}(s_{t+1};w_t)))^2]
wminEs∼d(.)[(v^(s;w)−vπ(s))2]=wminEst∼d(.)[(v^(st;wt)−(rt+1+γv^(st+1;wt)))2]
采用以下迭代式进行参数更新:
w
t
+
1
=
w
t
+
α
t
E
t
[
(
r
t
+
1
+
γ
ϕ
T
(
s
t
+
1
)
w
t
−
ϕ
T
(
s
t
)
w
t
)
ϕ
(
s
t
)
]
w_{t+1}=w_t+\alpha_t\mathbf{E}_t[(r_{t+1}+\gamma\phi^T(s_{t+1})w_t-\phi^T(s_t)w_t)\phi(s_t)]
wt+1=wt+αtEt[(rt+1+γϕT(st+1)wt−ϕT(st)wt)ϕ(st)]
则有以下结论成立:
(1)迭代式中的期望可以写成:
E
t
[
(
r
t
+
1
+
γ
ϕ
T
(
s
t
+
1
)
w
t
−
ϕ
T
(
s
t
)
w
t
)
ϕ
(
s
t
)
]
=
b
−
A
w
t
\mathbf{E}_t[(r_{t+1}+\gamma\phi^T(s_{t+1})w_t-\phi^T(s_t)w_t)\phi(s_t)]=b-Aw_t
Et[(rt+1+γϕT(st+1)wt−ϕT(st)wt)ϕ(st)]=b−Awt
其中
A
=
Φ
T
D
(
I
−
γ
P
π
)
Φ
∈
R
m
×
m
A=\Phi^TD(I-\gamma P_{\pi})\Phi\in R^{m\times m}
A=ΦTD(I−γPπ)Φ∈Rm×m,
b
=
Φ
T
D
r
π
∈
R
m
b=\Phi^T D r_{\pi} \in R^m
b=ΦTDrπ∈Rm。其中:
Φ
=
(
.
.
.
ϕ
T
(
s
)
.
.
.
)
∈
R
∣
S
∣
×
m
,
D
=
(
.
.
.
d
π
(
s
)
.
.
.
)
∈
R
∣
S
∣
×
∣
S
∣
\Phi=\begin{pmatrix}...\\\phi^T(s)\\... \end{pmatrix}\in R^{|S|\times m},D=\begin{pmatrix} ...& & \\ & d_{\pi}(s) & \\ & & ...\end{pmatrix}\in R^{|S|\times |S|}
Φ=
...ϕT(s)...
∈R∣S∣×m,D=
...dπ(s)...
∈R∣S∣×∣S∣
(2)当采用SGD算法进行梯度下降:
w
t
+
1
=
w
t
+
α
t
(
b
−
A
w
t
)
w_{t+1}=w_t+\alpha_t(b-Aw_t)
wt+1=wt+αt(b−Awt),若满足
∑
t
α
t
=
∞
\sum_t\alpha_t = \infty
∑tαt=∞,
∑
t
α
t
2
<
∞
\sum_t \alpha_t^2 < \infty
∑tαt2<∞,或在其他的一些条件下有:
w
t
→
w
∗
=
A
−
1
b
=
v
π
w_t\rightarrow w^*=A^{-1}b=v_{\pi}
wt→w∗=A−1b=vπ。
Proof.(1)证明略,想详细了解细节可以参考原书。
(2)易知
δ
t
=
w
t
−
w
∗
\delta_t = w_t-w^*
δt=wt−w∗,且
w
∗
=
A
−
1
b
w^*=A^{-1}b
w∗=A−1b,带入
w
t
+
1
=
w
t
+
α
t
(
b
−
A
w
t
)
w_{t+1}=w_t+\alpha_t(b-Aw_t)
wt+1=wt+αt(b−Awt)得到:
δ
t
+
1
=
(
I
−
α
t
A
)
δ
t
\delta_{t+1}=(I-\alpha_t A)\delta_t
δt+1=(I−αtA)δt,即:
δ
t
+
1
=
∏
k
=
0
t
(
I
−
α
k
A
)
δ
0
\delta_{t+1}=\prod_{k=0}^t(I-\alpha_kA)\delta_0
δt+1=k=0∏t(I−αkA)δ0
若
α
t
=
α
,
∀
t
\alpha_t=\alpha,\forall t
αt=α,∀t:
∣
∣
δ
t
+
1
∣
∣
≤
∣
∣
(
I
−
α
A
)
∣
∣
t
+
1
∣
∣
δ
0
∣
∣
||\delta_{t+1}||\leq||(I-\alpha A)||^{t+1}||\delta_0||
∣∣δt+1∣∣≤∣∣(I−αA)∣∣t+1∣∣δ0∣∣,若
α
>
0
\alpha >0
α>0且
ρ
(
I
−
α
A
)
<
1
\rho(I-\alpha A)<1
ρ(I−αA)<1,则:
δ
t
→
0
\delta_t \rightarrow 0
δt→0,即:
w
t
→
w
∗
w_t \rightarrow w^*
wt→w∗.