先说明该文章对于数学基础要求比较高,大多数的结论数学证明来自于《Mathematical Foundation of Reinforcement Learning》。了解强化学习中一些重要收敛性结论的证明过程,对设计好的强化学习算法以及了解一些强化学习中一些基本结论的由来是大有裨益的。本节将重点介绍一些随机逼近理论中的重要收敛性定理,这些定理将为后面强化学习中重要算法的收敛性分析提供理论基础。
(Dvoretzky’s 收敛定理) 考虑一个随机过程 ω k + 1 = ( 1 − α k ) ω k + β k η k \omega_{k+1}=(1-\alpha_k)\omega_{k}+\beta_k\eta_k ωk+1=(1−αk)ωk+βkηk,其中 { α k } k = 1 ∞ \{\alpha_k\}_{k=1}^{\infty} {αk}k=1∞, { β k } k = 1 ∞ \{\beta_k\}_{k=1}^{\infty} {βk}k=1∞, { η k } k = 1 ∞ \{\eta_k\}_{k=1}^{\infty} {ηk}k=1∞是随机序列。 ∀ k , α k , β k ≥ 0 \forall k,\alpha_k,\beta_k\geq 0 ∀k,αk,βk≥0,当满足以下条件:
(1) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ , ∑ k = 1 ∞ β k 2 < ∞ , u n i f o r m l y . w . p . 1 \sum_{k=1}^{\infty}\alpha_k =\infty,\sum_{k=1}^{\infty}\alpha_k^2 < \infty,\sum_{k=1}^{\infty}\beta_k^2<\infty,uniformly.w.p.1 ∑k=1∞αk=∞,∑k=1∞αk2<∞,∑k=1∞βk2<∞,uniformly.w.p.1;
(2) E [ η k ∣ H k ] = 0 , E [ η k 2 ∣ H k ] ≤ C \mathbf{E}[\eta_k|H_k]=0,\mathbf{E}[\eta_k^2|H_k]\leq C E[ηk∣Hk]=0,E[ηk2∣Hk]≤C;
其中 H k = { ω k , ω k − 1 , . . . η k − 1 , η k − 2 , . . . α k − 1 , . . . β k − 1 , . . . } H_k=\{\omega_k,\omega_{k-1},...\eta_{k-1},\eta_{k-2},...\alpha_{k-1},...\beta_{k-1},...\} Hk={ωk,ωk−1,...ηk−1,ηk−2,...αk−1,...βk−1,...},则 ω k → 0 , w . p . 1 \omega_k \rightarrow 0,w.p.1 ωk→0,w.p.1。
Proof. 假设
α
k
,
β
k
\alpha_k,\beta_k
αk,βk可以由
H
k
H_k
Hk完全确定,即
α
k
=
α
k
(
H
k
)
,
β
k
=
β
k
(
H
k
)
\alpha_k=\alpha_k(H_k),\beta_k=\beta_k(H_k)
αk=αk(Hk),βk=βk(Hk)。则:
E
[
α
k
∣
H
k
]
=
α
k
,
E
[
β
k
∣
H
k
]
=
β
k
\mathbf E[\alpha_k|H_k]=\alpha_k,\mathbf E[\beta_k|H_k]=\beta_k
E[αk∣Hk]=αk,E[βk∣Hk]=βk
构造
h
k
=
ω
k
2
h_k=\omega_k^2
hk=ωk2,得到:
E
[
h
k
+
1
−
h
k
∣
H
k
]
=
E
[
ω
k
+
1
2
−
ω
k
2
∣
H
k
]
=
E
[
−
α
k
(
2
−
α
k
)
ω
k
2
+
β
k
2
η
k
2
+
(
2
−
2
α
k
)
β
k
η
k
ω
k
∣
H
k
]
=
−
α
k
(
2
−
α
k
)
ω
k
2
+
β
k
2
E
[
η
k
2
∣
H
k
]
+
(
2
−
2
α
k
)
β
k
ω
k
E
[
η
k
∣
H
k
]
≤
−
α
k
(
2
−
α
k
)
ω
k
2
+
β
k
2
C
\mathbf E[h_{k+1}-h_k|H_k]=\mathbf E[\omega_{k+1}^2-\omega_{k}^2|H_k]\\=\mathbf E[-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2\eta_k^2+(2-2\alpha_k)\beta_k\eta_k\omega_k|H_k]\\=-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2\mathbf{E}[\eta_k^2|H_k]+(2-2\alpha_k)\beta_k\omega_k\mathbf{E}[\eta_k|H_k]\\\leq-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2C
E[hk+1−hk∣Hk]=E[ωk+12−ωk2∣Hk]=E[−αk(2−αk)ωk2+βk2ηk2+(2−2αk)βkηkωk∣Hk]=−αk(2−αk)ωk2+βk2E[ηk2∣Hk]+(2−2αk)βkωkE[ηk∣Hk]≤−αk(2−αk)ωk2+βk2C
因为
∑
k
=
1
∞
α
k
2
<
∞
\sum_{k=1}^{\infty}\alpha_k^2<\infty
∑k=1∞αk2<∞,所以
α
k
→
0
\alpha_k\rightarrow0
αk→0。
所以存在 N N N,当 k > N k>N k>N时有 α k ≤ ∣ α k ∣ < 1 \alpha_k\leq|\alpha_k|<1 αk≤∣αk∣<1(极限定义), − α k ( 2 − α k ) ω k 2 < 0 -\alpha_k(2-\alpha_k)\omega_k^2<0 −αk(2−αk)ωk2<0,此时 E [ h k + 1 − h k ∣ H k ] ≤ β k 2 C , k > N \mathbf E[h_{k+1}-h_k|H_k]\leq\beta_k^2C,k>N E[hk+1−hk∣Hk]≤βk2C,k>N。
又因为
∑
k
=
1
∞
β
k
2
=
C
β
2
<
∞
\sum_{k=1}^{\infty}\beta_k^2=C_{\beta^2}<\infty
∑k=1∞βk2=Cβ2<∞,因此有:
∑
k
=
1
∞
E
[
h
k
+
1
−
h
k
∣
H
k
]
=
(
∑
k
=
1
N
+
∑
k
=
N
+
1
∞
)
E
[
h
k
+
1
−
h
k
∣
H
k
]
≤
∑
k
=
1
N
E
[
h
k
+
1
−
h
k
∣
H
k
]
+
∑
k
=
N
+
1
∞
β
k
2
C
≤
∑
k
=
1
N
E
[
h
k
+
1
−
h
k
∣
H
k
]
+
C
β
2
C
<
∞
\sum_{k=1}^{\infty}\mathbf E[h_{k+1}-h_k|H_k]=(\sum_{k=1}^N+\sum_{k=N+1}^{\infty})E[h_{k+1}-h_k|H_k]\\ \leq \sum_{k=1}^N E[h_{k+1}-h_k|H_k]+\sum_{k=N+1}^{\infty}\beta_k^2C\\ \leq\sum_{k=1}^N E[h_{k+1}-h_k|H_k]+ C_{\beta^2}C<\infty
k=1∑∞E[hk+1−hk∣Hk]=(k=1∑N+k=N+1∑∞)E[hk+1−hk∣Hk]≤k=1∑NE[hk+1−hk∣Hk]+k=N+1∑∞βk2C≤k=1∑NE[hk+1−hk∣Hk]+Cβ2C<∞
继续推导:
∑
k
=
1
∞
α
k
ω
k
2
=
∑
k
=
N
∞
α
k
ω
k
2
+
∑
k
=
1
N
α
k
ω
k
2
<
∑
k
=
1
N
α
k
ω
k
2
+
∑
k
=
N
∞
α
k
(
2
−
α
k
)
ω
k
2
<
∑
k
=
1
N
α
k
ω
k
2
+
∑
k
=
1
∞
α
k
(
2
−
α
k
)
ω
k
2
<
∑
k
=
1
N
α
k
ω
k
2
−
∑
k
=
1
∞
E
[
h
k
+
1
−
h
k
∣
H
k
]
+
∑
k
=
1
∞
β
k
2
C
\sum_{k=1}^{\infty}\alpha_k\omega_k^2=\sum_{k=N}^{\infty}\alpha_k\omega_k^2+\sum_{k=1}^N\alpha_k\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2+\sum_{k=N}^{\infty}\alpha_k(2-\alpha_k)\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2+\sum_{k=1}^{\infty}\alpha_k(2-\alpha_k)\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2-\sum_{k=1}^{\infty}\mathbf E[h_{k+1}-h_k|H_k]+\sum_{k=1}^{\infty}\beta_k^2C
k=1∑∞αkωk2=k=N∑∞αkωk2+k=1∑Nαkωk2<k=1∑Nαkωk2+k=N∑∞αk(2−αk)ωk2<k=1∑Nαkωk2+k=1∑∞αk(2−αk)ωk2<k=1∑Nαkωk2−k=1∑∞E[hk+1−hk∣Hk]+k=1∑∞βk2C
由之前的证明
E
[
h
k
+
1
−
h
k
∣
H
k
]
<
∞
,
∑
k
=
1
∞
β
k
2
C
=
C
β
2
C
<
∞
\mathbf E[h_{k+1}-h_k|H_k]<\infty,\sum_{k=1}^{\infty}\beta_k^2C=C_{\beta^2}C<\infty
E[hk+1−hk∣Hk]<∞,∑k=1∞βk2C=Cβ2C<∞可知:
∑
k
=
1
∞
α
k
ω
k
2
<
∞
\sum_{k=1}^{\infty}\alpha_k\omega_k^2<\infty
k=1∑∞αkωk2<∞
因此
ω
k
→
0
,
w
.
p
.
1
\omega_k \rightarrow 0,w.p.1
ωk→0,w.p.1,证毕。
(Dvoretzky’s 收敛定理的扩展) 对于集合
X
X
X,元素
x
∈
X
x\in X
x∈X,对于随机过程:
Δ
k
+
1
(
x
)
=
(
1
−
α
k
(
x
)
)
Δ
k
(
x
)
+
β
k
(
x
)
e
k
(
x
)
\Delta_{k+1}(x)=(1-\alpha_k(x))\Delta_k(x)+\beta_k(x)e_k(x)
Δk+1(x)=(1−αk(x))Δk(x)+βk(x)ek(x)
当满足以下条件:
(1)集合 X X X是有限的;
(2) ∑ k α k ( x ) = ∞ , , ∑ k α k 2 ( x ) < ∞ , ∑ k β k 2 ( x ) < ∞ \sum_{k}\alpha_k(x)=\infty,,\sum_{k}\alpha_k^2(x)<\infty,\sum_{k}\beta_k^2(x)<\infty ∑kαk(x)=∞,,∑kαk2(x)<∞,∑kβk2(x)<∞;
(3) E [ β k ( x ) ∣ H k ] ≤ E [ α k ( x ) ∣ H k ] , u n i f o r m l y . w . p . 1 \mathbf E[\beta_k(x)|H_k]\leq \mathbf E[\alpha_k(x)|H_k],uniformly.w.p.1 E[βk(x)∣Hk]≤E[αk(x)∣Hk],uniformly.w.p.1;
(4) ∣ ∣ E [ e k ∣ H k ] ∣ ∣ ∞ ≤ γ ∣ ∣ Δ k ∣ ∣ ∞ , γ ∈ ( 0 , 1 ) ||\mathbf E[e_k|H_k]||_{\infty}\leq \gamma||\Delta_k||_{\infty},\gamma \in(0,1) ∣∣E[ek∣Hk]∣∣∞≤γ∣∣Δk∣∣∞,γ∈(0,1), e k = [ e k ( x ) ] x ∈ X T , Δ k = [ Δ k ( x ) ] x ∈ X T e_k=[e_k(x)]_{x\in X}^T,\Delta_k=[\Delta_k(x)]_{x\in X}^T ek=[ek(x)]x∈XT,Δk=[Δk(x)]x∈XT;
(5) ∃ C ≥ 0 , V a r [ e k ( x ) ∣ H k ] ≤ C ( 1 + ∣ ∣ Δ k ( x ) ∣ ∣ ∞ ) \exist C\geq0,\mathbf{Var}[e_k(x)|H_k]\leq C(1+||\Delta_k(x)||_{\infty}) ∃C≥0,Var[ek(x)∣Hk]≤C(1+∣∣Δk(x)∣∣∞).
其中 H k = { Δ k , Δ k − 1 , . . . e k − 1 , . . . α k − 1 , . . . β k − 1 , . . . } H_k=\{\Delta_k,\Delta_{k-1},...e_{k-1},...\alpha_{k-1},...\beta_{k-1},...\} Hk={Δk,Δk−1,...ek−1,...αk−1,...βk−1,...},则 ∀ x ∈ X , ω k ( x ) → 0 , w . p . 1 \forall x\in X,\omega_k(x) \rightarrow 0,w.p.1 ∀x∈X,ωk(x)→0,w.p.1.
Proof.证明太过复杂,见文献Jaakkola,T.,M.I.Jordan and S.Singh.On the Convergence of Stochastic Iterative Dynamic Programming Algorithms.Neural Computation,1993.6:P.1185-1201.
(Robbins-Monro定理) 若在迭代式 ω k + 1 = ω k − α k g ˉ ( ω k , η k ) \omega_{k+1}=\omega_{k}-\alpha_k\bar{g}(\omega_k,\eta_k) ωk+1=ωk−αkgˉ(ωk,ηk),其中 η k \eta_k ηk为随机变量 g ˉ ( ω k , η k ) = g ( ω k ) + η k \bar{g}(\omega_k,\eta_k)=g(\omega_{k})+\eta_k gˉ(ωk,ηk)=g(ωk)+ηk,当满足条件:
(1) ∀ ω , 0 < c 1 ≤ ∇ ω g ( ω ) ≤ c 2 \forall \omega,0<c_1\leq \nabla_{\omega} g(\omega)\leq c_2 ∀ω,0<c1≤∇ωg(ω)≤c2;
(2) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k=\infty,\sum_{k=1}^{\infty}\alpha_k^2<\infty ∑k=1∞αk=∞,∑k=1∞αk2<∞;
(3) E [ η k ∣ H k ] = 0 , E [ η k 2 ∣ H k ] < ∞ \mathbf E[\eta_k|H_k]=0,\mathbf E[\eta_k^2|H_k]<\infty E[ηk∣Hk]=0,E[ηk2∣Hk]<∞;
其中 H k = { ω k , ω k − 1 , . . . } H_k=\{\omega_k,\omega_{k-1},...\} Hk={ωk,ωk−1,...},则 ω k → ω ∗ , w . p . 1 \omega_k\rightarrow \omega^*,w.p.1 ωk→ω∗,w.p.1,其中 g ( ω ∗ ) = 0 g(\omega^*)=0 g(ω∗)=0。
Proof.有下面的式子:
ω
k
+
1
−
ω
∗
=
ω
k
−
ω
∗
−
α
k
g
ˉ
(
ω
k
,
η
k
)
=
ω
k
−
ω
∗
−
α
k
(
g
(
ω
k
)
+
η
k
)
=
ω
k
−
ω
∗
−
α
k
(
g
(
ω
k
)
−
g
(
ω
∗
)
+
η
k
)
=
ω
k
−
ω
∗
−
α
k
(
∇
ω
g
(
ω
k
′
)
(
ω
k
−
ω
∗
)
)
+
(
α
k
)
(
−
η
k
)
=
(
1
−
α
k
∇
ω
g
(
ω
k
′
)
)
(
ω
k
−
ω
∗
)
+
(
α
k
)
(
−
η
k
)
\omega_{k+1}-\omega^*=\omega_{k}-\omega^*-\alpha_k\bar{g}(\omega_k,\eta_k)\\=\omega_k-\omega^*-\alpha_k(g(\omega_k)+\eta_k)\\=\omega_k-\omega^*-\alpha_k(g(\omega_k)-g(\omega^*)+\eta_k)\\=\omega_k-\omega^*-\alpha_k(\nabla_{\omega}g(\omega_k^{'})(\omega_k-\omega^*))+(\alpha_k)(-\eta_k)\\=(1-\alpha_k\nabla_{\omega}g(\omega_k^{'}))(\omega_k-\omega^*)+(\alpha_k)(-\eta_k)
ωk+1−ω∗=ωk−ω∗−αkgˉ(ωk,ηk)=ωk−ω∗−αk(g(ωk)+ηk)=ωk−ω∗−αk(g(ωk)−g(ω∗)+ηk)=ωk−ω∗−αk(∇ωg(ωk′)(ωk−ω∗))+(αk)(−ηk)=(1−αk∇ωg(ωk′))(ωk−ω∗)+(αk)(−ηk)
其中:
ω
k
′
=
θ
ω
k
+
(
1
−
θ
)
ω
∗
,
θ
∈
[
0
,
1
]
\omega_k^{'}=\theta\omega_k+(1-\theta)\omega^*,\theta\in[0,1]
ωk′=θωk+(1−θ)ω∗,θ∈[0,1]。令
Δ
k
=
ω
k
−
ω
∗
\Delta_k=\omega_k-\omega^*
Δk=ωk−ω∗,则
Δ
k
+
1
=
(
1
−
α
k
∇
ω
g
(
ω
k
′
)
)
Δ
k
+
α
k
(
−
η
k
)
\Delta_{k+1}=(1-\alpha_k\nabla_{\omega}g(\omega_k^{'}))\Delta_k+\alpha_k(-\eta_k)
Δk+1=(1−αk∇ωg(ωk′))Δk+αk(−ηk)。
因为 ∑ k = 1 ∞ α k ∇ ω g ( ω k ′ ) > c 1 ∑ k = 1 k α k , ∑ k = 1 k α k = ∞ \sum_{k=1}^{\infty}\alpha_k\nabla_{\omega}g(\omega_k^{'})>c_1\sum_{k=1}^k\alpha_k,\sum_{k=1}^k\alpha_k=\infty ∑k=1∞αk∇ωg(ωk′)>c1∑k=1kαk,∑k=1kαk=∞,所以 ∑ k = 1 ∞ ( α k ∇ ω g ( ω k ′ ) ) = ∞ \sum_{k=1}^{\infty}(\alpha_k\nabla_{\omega}g(\omega_k^{'}))=\infty ∑k=1∞(αk∇ωg(ωk′))=∞.
而 ∑ k = 1 ∞ ( α k ∇ ω g ( ω k ′ ) ) 2 ≤ c 2 2 ∑ k = 1 ∞ α k 2 < ∞ , E [ − η k ∣ H k ] = 0 \sum_{k=1}^{\infty}(\alpha_k\nabla_{\omega}g(\omega_k^{'}))^2\leq c_2^2\sum_{k=1}^{\infty}\alpha^2_k<\infty ,\mathbf E[-\eta_k|H_k]=0 ∑k=1∞(αk∇ωg(ωk′))2≤c22∑k=1∞αk2<∞,E[−ηk∣Hk]=0。
故由Dvoretzky’s 收敛定理, Δ k → 0 , w . p . 1 \Delta_k\rightarrow 0,w.p.1 Δk→0,w.p.1,即 ω k → ω ∗ , w . p . 1 \omega_k\rightarrow \omega^*,w.p.1 ωk→ω∗,w.p.1。
由Robbins-Monro定理可以很容易的估计随机变量的数学期望:若独立同分布随机变量 { x k } k = 1 ∞ \{x_k\}_{k=1}^{\infty} {xk}k=1∞,数学期望为 E [ X ] \mathbf{E}[X] E[X],采用迭代式 w k + 1 = ( 1 − α k ) w k + α k x k w_{k+1}=(1-\alpha_k)w_k+\alpha_kx_k wk+1=(1−αk)wk+αkxk进行计算,若 ∑ k α k = ∞ , ∑ k α k 2 < ∞ \sum_{k}\alpha_k=\infty,\sum_{k}\alpha_k^2<\infty ∑kαk=∞,∑kαk2<∞,可以得到 w k → E [ x ] w_k\rightarrow \mathbf{E}[x] wk→E[x]。(证明直接构造 g ( w k ) = w k − E [ x ] , η k = E [ x ] − x k g(w_k)=w_k-\mathbf E[x],\eta_k=\mathbf E[x]-x_k g(wk)=wk−E[x],ηk=E[x]−xk即可)
(随机梯度下降(SGD)算法的收敛性) 对于最优化问题 min w J ( w ) = E X [ f ( w , X ) ] \min_{w}J(w)=\mathbf{E}_{X}[f(w,X)] minwJ(w)=EX[f(w,X)],当采用迭代式 w k + 1 = w k − α k ∇ w f ( w k , x k ) w_{k+1}=w_k-\alpha_k\nabla_{w}f(w_k,x_k) wk+1=wk−αk∇wf(wk,xk)进行参数迭代时,若满足以下条件:
(1) 0 < c 1 ≤ ∇ w 2 f ( w , X ) ≤ c 2 0<c_1\leq \nabla_w^2f(w,X)\leq c_2 0<c1≤∇w2f(w,X)≤c2;
(2) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k=\infty,\sum_{k=1}^{\infty}\alpha_k^2<\infty ∑k=1∞αk=∞,∑k=1∞αk2<∞;
(3) { x k } k = 1 ∞ \{x_k\}_{k=1}^{\infty} {xk}k=1∞是独立同分布随机变量.
则 w k → w ∗ w_k\rightarrow w^* wk→w∗,其中 ∇ w E X [ f ( w ∗ , X ) ] = 0 , w . p . 1 \nabla_w \mathbf{E}_X[f(w^*,X)]=0,w.p.1 ∇wEX[f(w∗,X)]=0,w.p.1.
Proof. 令 g ( w k ) = ∇ w E X [ f ( w k , X ) ] g(w_k)=\nabla_w \mathbf{E}_X[f(w_k,X)] g(wk)=∇wEX[f(wk,X)], η k = ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] \eta_k=\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)] ηk=∇wf(wk,X)−∇wEX[f(wk,X)], g ˉ ( w k , η k ) = g ( w k ) + η k = ∇ w f ( w k , X ) \bar{g}(w_k,\eta_k)=g(w_k)+\eta_k=\nabla_{w}f(w_k,X) gˉ(wk,ηk)=g(wk)+ηk=∇wf(wk,X).
由于
0
<
c
1
≤
∇
w
2
f
(
w
,
X
)
≤
c
2
0<c_1\leq \nabla_w^2f(w,X)\leq c_2
0<c1≤∇w2f(w,X)≤c2,因此:
c
1
≤
∇
w
g
(
w
k
)
=
∇
w
2
E
X
[
f
(
w
k
,
X
)
]
≤
c
2
c_1\leq\nabla_w g(w_k)=\nabla_w^2 \mathbf{E}_X[f(w_k,X)]\leq c_2
c1≤∇wg(wk)=∇w2EX[f(wk,X)]≤c2
而
E
[
η
k
∣
H
k
]
=
E
[
∇
w
f
(
w
k
,
X
)
−
∇
w
E
X
[
f
(
w
k
,
X
)
]
∣
H
k
]
=
0
\mathbf E[\eta_k|H_k]=\mathbf{E}[\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)]|H_k]=0
E[ηk∣Hk]=E[∇wf(wk,X)−∇wEX[f(wk,X)]∣Hk]=0;
同理 E [ η k 2 ∣ H k ] = E [ ( ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] ) 2 ∣ H k ] < ∞ \mathbf E[\eta_k^2|H_k]=\mathbf{E}[(\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)])^2|H_k]<\infty E[ηk2∣Hk]=E[(∇wf(wk,X)−∇wEX[f(wk,X)])2∣Hk]<∞.
因此由Robbins-Monro定理, w k → w ∗ w_k\rightarrow w^* wk→w∗,其中 g ( w ∗ ) = ∇ w E X [ f ( w ∗ , X ) ] = 0 g(w^*)=\nabla_w \mathbf{E}_X[f(w^*,X)]=0 g(w∗)=∇wEX[f(w∗,X)]=0.
(压缩映射原理) 在非空完备度量空间
(
X
,
d
)
(X,d)
(X,d)中,映射
T
:
X
→
X
T:X\rightarrow X
T:X→X为压缩映射,即满足条件:
d
(
T
x
1
,
T
x
2
)
<
C
d
(
x
1
,
x
2
)
,
x
1
,
x
2
∈
X
,
0
<
C
<
1
d(Tx_1,Tx_2)<Cd(x_1,x_2),x_1,x_2\in X,0 < C<1
d(Tx1,Tx2)<Cd(x1,x2),x1,x2∈X,0<C<1
则
T
T
T在该空间中有唯一的不动点
x
0
x_0
x0满足
T
x
0
=
x
0
Tx_0=x_0
Tx0=x0,其可以通过
x
n
+
1
=
T
x
n
x_{n+1}=Tx_{n}
xn+1=Txn迭代得到
x
n
→
x
0
x_n\rightarrow x_0
xn→x0。
Proof.略,因为这是著名的Banach不动点定理,一般的泛函分析教材上都会有介绍。
(马尔科夫链的稳态分布定理) 设Markov Process的状态空间为
S
S
S,状态量的个数为
∣
S
∣
|S|
∣S∣,定义在策略
π
\pi
π下的状态转移概率矩阵
P
π
∈
R
∣
S
∣
×
∣
S
∣
P_{\pi}\in R^{|S|\times |S|}
Pπ∈R∣S∣×∣S∣,定义
k
k
k步状态转移概率矩阵
P
π
k
=
{
p
i
j
,
π
(
k
)
}
∣
S
∣
×
∣
S
∣
P_{\pi}^k=\{p_{ij,\pi}^{(k)}\}_{|S|\times |S|}
Pπk={pij,π(k)}∣S∣×∣S∣,其中:
p
i
j
,
π
(
k
)
=
P
r
o
b
(
S
t
k
=
j
∣
S
t
0
=
i
,
π
)
p_{ij,\pi}^{(k)}=\mathbf{Prob}(S_{t_k}=j|S_{t_0}=i,\pi)
pij,π(k)=Prob(Stk=j∣St0=i,π)
其满足
P
π
k
=
P
π
P
π
k
−
1
P_{\pi}^k=P_{\pi}P_{\pi}^{k-1}
Pπk=PπPπk−1.对任意一个初始状态分布
d
0
∈
R
∣
S
∣
,
d_0\in R^{|S|},
d0∈R∣S∣,,在策略
π
\pi
π下经过
k
k
k轮迭代后的状态分布为
d
k
:
d
k
T
=
d
0
T
P
π
k
d_k:d_k^T=d_0^TP_{\pi}^k
dk:dkT=d0TPπk.
当满足以下条件:
对于任意的两个状态 i , j ∈ S i,j\in S i,j∈S,都存在有限步长 k k k,使得 [ P π k ] i j > 0 [P_{\pi}^k]_{ij}>0 [Pπk]ij>0。
有以下结论:
(1) P π k → 1 ∣ S ∣ d π T P_{\pi}^k\rightarrow \mathbf{1}_{|S|}d_{\pi}^T Pπk→1∣S∣dπT;
(2) d k T → d 0 T 1 ∣ S ∣ d π T = d π T d_k^T\rightarrow d_0^T\mathbf{1}_{|S|}d_{\pi}^T=d_{\pi}^T dkT→d0T1∣S∣dπT=dπT;
(3) d π T d_{\pi}^T dπT满足 d π T = d π T P π d_{\pi}^T=d_{\pi}^TP_{\pi} dπT=dπTPπ.
此时称这样的Markov Process是regular的。
Proof. 略,因为这是关于Markov Process中经典的遍历定理,一般的随机过程教材上都会有介绍。
(完备度量空间中的柯西列均收敛) 在完备度量空间 ( X , d ) (X,d) (X,d)中,若柯西列 { x n } ⊂ X \{x_n\}\sub X {xn}⊂X,则柯西列必收敛 x n → x ∈ X x_n\rightarrow x \in X xn→x∈X。其子空间 ( Y , d ∣ Y × Y ) ⊂ ( X , d ) (Y,d|_{Y\times Y})\sub (X,d) (Y,d∣Y×Y)⊂(X,d)为闭集是 ( Y , d ∣ Y × Y ) (Y,d|_{Y\times Y}) (Y,d∣Y×Y)为完备度量空间的充要条件。
Proof. 略,该定理在一般的泛函分析教材上都会有介绍。
(夹逼定理) 如果数列 { X n } , { Y n } , { Z n } \{X_n\},\{Y_n\},\{Z_n\} {Xn},{Yn},{Zn}满足以下条件:
(1) 当 n > N 0 ∈ N ∗ n>N_0\in N^* n>N0∈N∗时,有 Y n ≤ X n ≤ Z n Y_n\leq X_n \leq Z_n Yn≤Xn≤Zn;
(2) lim n → ∞ Y n = lim n → ∞ Z n = a < ∞ \lim_{n \rightarrow \infty}Y_n=\lim_{n \rightarrow \infty}Z_n=a<\infty limn→∞Yn=limn→∞Zn=a<∞;
则数列 { X n } \{X_n\} {Xn}极限存在,且 lim n → ∞ X n = a \lim_{n\rightarrow \infty }X_n=a limn→∞Xn=a。
Proof. 由于 lim n → ∞ Y n = lim n → ∞ Z n = a \lim_{n \rightarrow \infty}Y_n=\lim_{n \rightarrow \infty}Z_n=a limn→∞Yn=limn→∞Zn=a,所以由极限的定义:
∀ ε > 0 , ∃ N 1 , n > N 1 , ∣ Y n − a ∣ < ε \forall \varepsilon >0,\exists N_1,n>N_1,|Y_n-a|<\varepsilon ∀ε>0,∃N1,n>N1,∣Yn−a∣<ε;
∀ ε > 0 , ∃ N 2 , n > N 1 , ∣ Z n − a ∣ < ε \forall \varepsilon >0,\exists N_2,n>N_1,|Z_n-a|<\varepsilon ∀ε>0,∃N2,n>N1,∣Zn−a∣<ε;
∀ ε > 0 \forall \varepsilon >0 ∀ε>0,当取 N = max { N 0 , N 1 , N 2 } N=\max\{N_0,N_1,N_2\} N=max{N0,N1,N2}时,若 n > N n>N n>N,有 X n ≥ Y n > a − ε X_n\geq Y_n>a-\varepsilon Xn≥Yn>a−ε, X n ≤ Z n < a + ε X_n\leq Z_n<a+\varepsilon Xn≤Zn<a+ε,得到 ∣ X n − a ∣ < ε |X_n-a|<\varepsilon ∣Xn−a∣<ε,故 lim n → ∞ X n = a \lim_{n\rightarrow \infty}X_n=a limn→∞Xn=a.
(数列的平均值收敛性质) 若 { a n } n = 1 ∞ ⊂ R \{a_n\}_{n=1}^{\infty}\sub R {an}n=1∞⊂R是收敛列, lim n → ∞ a n = a ∗ \lim_{n\rightarrow \infty}a_n=a^* limn→∞an=a∗,则 lim n → ∞ 1 n ∑ k = 1 n a n = a ∗ \lim_{n\rightarrow \infty}\frac{1}{n}\sum_{k=1}^na_n=a^* limn→∞n1∑k=1nan=a∗。
Proof. 令
b
n
=
1
n
∑
k
=
1
n
a
n
b_n=\frac{1}{n}\sum_{k=1}^na_n
bn=n1∑k=1nan,则有关系
(
n
+
1
)
b
n
+
1
−
n
b
n
=
a
n
+
1
(n+1)b_{n+1}-nb_n=a_{n+1}
(n+1)bn+1−nbn=an+1成立,令
Δ
n
=
b
n
−
a
∗
\Delta_n=b_n-a^*
Δn=bn−a∗,得到:
(
n
+
1
)
b
n
+
1
−
n
b
n
=
(
n
+
1
)
(
Δ
n
+
1
+
a
∗
)
−
n
(
Δ
n
+
a
∗
)
=
(
n
+
1
)
Δ
n
+
1
−
n
Δ
n
+
a
∗
=
a
n
+
1
(n+1)b_{n+1}-nb_n=(n+1)(\Delta_{n+1}+a^*)-n(\Delta_n+a^*)\\=(n+1)\Delta_{n+1}-n\Delta_n+a^*\\=a_{n+1}
(n+1)bn+1−nbn=(n+1)(Δn+1+a∗)−n(Δn+a∗)=(n+1)Δn+1−nΔn+a∗=an+1
化简得到:
(
n
+
1
)
Δ
n
+
1
−
n
Δ
n
=
a
n
+
1
−
a
∗
(n+1)\Delta_{n+1}-n\Delta_n=a_{n+1}-a^*
(n+1)Δn+1−nΔn=an+1−a∗。由于
a
n
+
1
→
a
∗
a_{n+1}\rightarrow a^*
an+1→a∗,有
(
n
+
1
)
Δ
n
+
1
−
n
Δ
n
→
0
(n+1)\Delta_{n+1}-n\Delta_n\rightarrow 0
(n+1)Δn+1−nΔn→0,即
{
n
Δ
n
}
⊂
R
\{n\Delta_n\}\sub R
{nΔn}⊂R是柯西列,由于
R
R
R的完备性,有
n
Δ
n
→
c
∗
∈
R
n\Delta_n\rightarrow c^*\in R
nΔn→c∗∈R。
这说明
∀
ε
>
0
,
∃
N
1
>
0
,
n
>
N
1
\forall \varepsilon>0,\exists N_1>0,n>N_1
∀ε>0,∃N1>0,n>N1有
∣
n
Δ
n
−
c
∗
∣
<
ε
|n\Delta_n-c^*|<\varepsilon
∣nΔn−c∗∣<ε,即:
∣
Δ
n
−
c
∗
n
∣
<
ε
n
|\Delta_n-\frac{c^*}{n}|<\frac{\varepsilon}{n}
∣Δn−nc∗∣<nε,此时有:
∣
Δ
n
∣
<
∣
Δ
n
−
c
∗
n
∣
+
∣
c
∗
n
∣
<
ε
n
+
∣
c
∗
∣
n
|\Delta_n|<|\Delta_n-\frac{c^*}{n}|+|\frac{c^*}{n}|<\frac{\varepsilon}{n}+\frac{|c^*|}{n}
∣Δn∣<∣Δn−nc∗∣+∣nc∗∣<nε+n∣c∗∣
若取
N
2
=
[
ε
+
∣
c
∗
∣
ε
]
+
1
N_2=[\frac{\varepsilon+|c^*|}{\varepsilon}]+1
N2=[εε+∣c∗∣]+1,当
n
>
N
2
n>N_2
n>N2有
∣
Δ
n
∣
<
ε
|\Delta_n|<\varepsilon
∣Δn∣<ε.
综上, ∀ ε > 0 \forall \varepsilon>0 ∀ε>0,若取 N = max { N 1 , N 2 } N=\max\{N_1,N_2\} N=max{N1,N2},当 n > N n>N n>N时有: ∣ Δ n ∣ < ε |\Delta_n|<\varepsilon ∣Δn∣<ε.
这说明 Δ n → 0 \Delta_n\rightarrow 0 Δn→0,即 b n = 1 n ∑ k = 1 n a n → a ∗ b_n=\frac{1}{n}\sum_{k=1}^na_n\rightarrow a^* bn=n1∑k=1nan→a∗得证。