强化学习中的重要收敛性结论(1):常用定理

先说明该文章对于数学基础要求比较高,大多数的结论数学证明来自于《Mathematical Foundation of Reinforcement Learning》。了解强化学习中一些重要收敛性结论的证明过程,对设计好的强化学习算法以及了解一些强化学习中一些基本结论的由来是大有裨益的。本节将重点介绍一些随机逼近理论中的重要收敛性定理,这些定理将为后面强化学习中重要算法的收敛性分析提供理论基础。

(Dvoretzky’s 收敛定理) 考虑一个随机过程 ω k + 1 = ( 1 − α k ) ω k + β k η k \omega_{k+1}=(1-\alpha_k)\omega_{k}+\beta_k\eta_k ωk+1=(1αk)ωk+βkηk,其中 { α k } k = 1 ∞ \{\alpha_k\}_{k=1}^{\infty} {αk}k=1 { β k } k = 1 ∞ \{\beta_k\}_{k=1}^{\infty} {βk}k=1 { η k } k = 1 ∞ \{\eta_k\}_{k=1}^{\infty} {ηk}k=1是随机序列。 ∀ k , α k , β k ≥ 0 \forall k,\alpha_k,\beta_k\geq 0 k,αk,βk0,当满足以下条件:

(1) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ , ∑ k = 1 ∞ β k 2 < ∞ , u n i f o r m l y . w . p . 1 \sum_{k=1}^{\infty}\alpha_k =\infty,\sum_{k=1}^{\infty}\alpha_k^2 < \infty,\sum_{k=1}^{\infty}\beta_k^2<\infty,uniformly.w.p.1 k=1αk=,k=1αk2<,k=1βk2<,uniformly.w.p.1

(2) E [ η k ∣ H k ] = 0 , E [ η k 2 ∣ H k ] ≤ C \mathbf{E}[\eta_k|H_k]=0,\mathbf{E}[\eta_k^2|H_k]\leq C E[ηkHk]=0,E[ηk2Hk]C;

其中 H k = { ω k , ω k − 1 , . . . η k − 1 , η k − 2 , . . . α k − 1 , . . . β k − 1 , . . . } H_k=\{\omega_k,\omega_{k-1},...\eta_{k-1},\eta_{k-2},...\alpha_{k-1},...\beta_{k-1},...\} Hk={ωk,ωk1,...ηk1,ηk2,...αk1,...βk1,...},则 ω k → 0 , w . p . 1 \omega_k \rightarrow 0,w.p.1 ωk0,w.p.1

Proof. 假设 α k , β k \alpha_k,\beta_k αk,βk可以由 H k H_k Hk完全确定,即 α k = α k ( H k ) , β k = β k ( H k ) \alpha_k=\alpha_k(H_k),\beta_k=\beta_k(H_k) αk=αk(Hk),βk=βk(Hk)。则:
E [ α k ∣ H k ] = α k , E [ β k ∣ H k ] = β k \mathbf E[\alpha_k|H_k]=\alpha_k,\mathbf E[\beta_k|H_k]=\beta_k E[αkHk]=αk,E[βkHk]=βk
构造 h k = ω k 2 h_k=\omega_k^2 hk=ωk2,得到:
E [ h k + 1 − h k ∣ H k ] = E [ ω k + 1 2 − ω k 2 ∣ H k ] = E [ − α k ( 2 − α k ) ω k 2 + β k 2 η k 2 + ( 2 − 2 α k ) β k η k ω k ∣ H k ] = − α k ( 2 − α k ) ω k 2 + β k 2 E [ η k 2 ∣ H k ] + ( 2 − 2 α k ) β k ω k E [ η k ∣ H k ] ≤ − α k ( 2 − α k ) ω k 2 + β k 2 C \mathbf E[h_{k+1}-h_k|H_k]=\mathbf E[\omega_{k+1}^2-\omega_{k}^2|H_k]\\=\mathbf E[-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2\eta_k^2+(2-2\alpha_k)\beta_k\eta_k\omega_k|H_k]\\=-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2\mathbf{E}[\eta_k^2|H_k]+(2-2\alpha_k)\beta_k\omega_k\mathbf{E}[\eta_k|H_k]\\\leq-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2C E[hk+1hkHk]=E[ωk+12ωk2Hk]=E[αk(2αk)ωk2+βk2ηk2+(22αk)βkηkωkHk]=αk(2αk)ωk2+βk2E[ηk2Hk]+(22αk)βkωkE[ηkHk]αk(2αk)ωk2+βk2C
因为 ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k^2<\infty k=1αk2<,所以 α k → 0 \alpha_k\rightarrow0 αk0

所以存在 N N N,当 k > N k>N k>N时有 α k ≤ ∣ α k ∣ < 1 \alpha_k\leq|\alpha_k|<1 αkαk<1(极限定义), − α k ( 2 − α k ) ω k 2 < 0 -\alpha_k(2-\alpha_k)\omega_k^2<0 αk(2αk)ωk2<0,此时 E [ h k + 1 − h k ∣ H k ] ≤ β k 2 C , k > N \mathbf E[h_{k+1}-h_k|H_k]\leq\beta_k^2C,k>N E[hk+1hkHk]βk2C,k>N

又因为 ∑ k = 1 ∞ β k 2 = C β 2 < ∞ \sum_{k=1}^{\infty}\beta_k^2=C_{\beta^2}<\infty k=1βk2=Cβ2<,因此有:
∑ k = 1 ∞ E [ h k + 1 − h k ∣ H k ] = ( ∑ k = 1 N + ∑ k = N + 1 ∞ ) E [ h k + 1 − h k ∣ H k ] ≤ ∑ k = 1 N E [ h k + 1 − h k ∣ H k ] + ∑ k = N + 1 ∞ β k 2 C ≤ ∑ k = 1 N E [ h k + 1 − h k ∣ H k ] + C β 2 C < ∞ \sum_{k=1}^{\infty}\mathbf E[h_{k+1}-h_k|H_k]=(\sum_{k=1}^N+\sum_{k=N+1}^{\infty})E[h_{k+1}-h_k|H_k]\\ \leq \sum_{k=1}^N E[h_{k+1}-h_k|H_k]+\sum_{k=N+1}^{\infty}\beta_k^2C\\ \leq\sum_{k=1}^N E[h_{k+1}-h_k|H_k]+ C_{\beta^2}C<\infty k=1E[hk+1hkHk]=(k=1N+k=N+1)E[hk+1hkHk]k=1NE[hk+1hkHk]+k=N+1βk2Ck=1NE[hk+1hkHk]+Cβ2C<
继续推导:
∑ k = 1 ∞ α k ω k 2 = ∑ k = N ∞ α k ω k 2 + ∑ k = 1 N α k ω k 2 < ∑ k = 1 N α k ω k 2 + ∑ k = N ∞ α k ( 2 − α k ) ω k 2 < ∑ k = 1 N α k ω k 2 + ∑ k = 1 ∞ α k ( 2 − α k ) ω k 2 < ∑ k = 1 N α k ω k 2 − ∑ k = 1 ∞ E [ h k + 1 − h k ∣ H k ] + ∑ k = 1 ∞ β k 2 C \sum_{k=1}^{\infty}\alpha_k\omega_k^2=\sum_{k=N}^{\infty}\alpha_k\omega_k^2+\sum_{k=1}^N\alpha_k\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2+\sum_{k=N}^{\infty}\alpha_k(2-\alpha_k)\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2+\sum_{k=1}^{\infty}\alpha_k(2-\alpha_k)\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2-\sum_{k=1}^{\infty}\mathbf E[h_{k+1}-h_k|H_k]+\sum_{k=1}^{\infty}\beta_k^2C k=1αkωk2=k=Nαkωk2+k=1Nαkωk2<k=1Nαkωk2+k=Nαk(2αk)ωk2<k=1Nαkωk2+k=1αk(2αk)ωk2<k=1Nαkωk2k=1E[hk+1hkHk]+k=1βk2C
由之前的证明 E [ h k + 1 − h k ∣ H k ] < ∞ , ∑ k = 1 ∞ β k 2 C = C β 2 C < ∞ \mathbf E[h_{k+1}-h_k|H_k]<\infty,\sum_{k=1}^{\infty}\beta_k^2C=C_{\beta^2}C<\infty E[hk+1hkHk]<k=1βk2C=Cβ2C<可知:
∑ k = 1 ∞ α k ω k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k\omega_k^2<\infty k=1αkωk2<
因此 ω k → 0 , w . p . 1 \omega_k \rightarrow 0,w.p.1 ωk0,w.p.1,证毕。

(Dvoretzky’s 收敛定理的扩展) 对于集合 X X X,元素 x ∈ X x\in X xX,对于随机过程:
Δ k + 1 ( x ) = ( 1 − α k ( x ) ) Δ k ( x ) + β k ( x ) e k ( x ) \Delta_{k+1}(x)=(1-\alpha_k(x))\Delta_k(x)+\beta_k(x)e_k(x) Δk+1(x)=(1αk(x))Δk(x)+βk(x)ek(x)
当满足以下条件:

(1)集合 X X X是有限的;

(2) ∑ k α k ( x ) = ∞ , , ∑ k α k 2 ( x ) < ∞ , ∑ k β k 2 ( x ) < ∞ \sum_{k}\alpha_k(x)=\infty,,\sum_{k}\alpha_k^2(x)<\infty,\sum_{k}\beta_k^2(x)<\infty kαk(x)=,,kαk2(x)<,kβk2(x)<

(3) E [ β k ( x ) ∣ H k ] ≤ E [ α k ( x ) ∣ H k ] , u n i f o r m l y . w . p . 1 \mathbf E[\beta_k(x)|H_k]\leq \mathbf E[\alpha_k(x)|H_k],uniformly.w.p.1 E[βk(x)Hk]E[αk(x)Hk],uniformly.w.p.1;

(4) ∣ ∣ E [ e k ∣ H k ] ∣ ∣ ∞ ≤ γ ∣ ∣ Δ k ∣ ∣ ∞ , γ ∈ ( 0 , 1 ) ||\mathbf E[e_k|H_k]||_{\infty}\leq \gamma||\Delta_k||_{\infty},\gamma \in(0,1) ∣∣E[ekHk]γ∣∣Δk,γ(0,1) e k = [ e k ( x ) ] x ∈ X T , Δ k = [ Δ k ( x ) ] x ∈ X T e_k=[e_k(x)]_{x\in X}^T,\Delta_k=[\Delta_k(x)]_{x\in X}^T ek=[ek(x)]xXT,Δk=[Δk(x)]xXT

(5) ∃ C ≥ 0 , V a r [ e k ( x ) ∣ H k ] ≤ C ( 1 + ∣ ∣ Δ k ( x ) ∣ ∣ ∞ ) \exist C\geq0,\mathbf{Var}[e_k(x)|H_k]\leq C(1+||\Delta_k(x)||_{\infty}) C0,Var[ek(x)Hk]C(1+∣∣Δk(x)).

其中 H k = { Δ k , Δ k − 1 , . . . e k − 1 , . . . α k − 1 , . . . β k − 1 , . . . } H_k=\{\Delta_k,\Delta_{k-1},...e_{k-1},...\alpha_{k-1},...\beta_{k-1},...\} Hk={Δk,Δk1,...ek1,...αk1,...βk1,...},则 ∀ x ∈ X , ω k ( x ) → 0 , w . p . 1 \forall x\in X,\omega_k(x) \rightarrow 0,w.p.1 xX,ωk(x)0,w.p.1.

Proof.证明太过复杂,见文献Jaakkola,T.,M.I.Jordan and S.Singh.On the Convergence of Stochastic Iterative Dynamic Programming Algorithms.Neural Computation,1993.6:P.1185-1201.

(Robbins-Monro定理) 若在迭代式 ω k + 1 = ω k − α k g ˉ ( ω k , η k ) \omega_{k+1}=\omega_{k}-\alpha_k\bar{g}(\omega_k,\eta_k) ωk+1=ωkαkgˉ(ωk,ηk),其中 η k \eta_k ηk为随机变量 g ˉ ( ω k , η k ) = g ( ω k ) + η k \bar{g}(\omega_k,\eta_k)=g(\omega_{k})+\eta_k gˉ(ωk,ηk)=g(ωk)+ηk,当满足条件:

(1) ∀ ω , 0 < c 1 ≤ ∇ ω g ( ω ) ≤ c 2 \forall \omega,0<c_1\leq \nabla_{\omega} g(\omega)\leq c_2 ω,0<c1ωg(ω)c2

(2) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k=\infty,\sum_{k=1}^{\infty}\alpha_k^2<\infty k=1αk=,k=1αk2<

(3) E [ η k ∣ H k ] = 0 , E [ η k 2 ∣ H k ] < ∞ \mathbf E[\eta_k|H_k]=0,\mathbf E[\eta_k^2|H_k]<\infty E[ηkHk]=0,E[ηk2Hk]<

其中 H k = { ω k , ω k − 1 , . . . } H_k=\{\omega_k,\omega_{k-1},...\} Hk={ωk,ωk1,...},则 ω k → ω ∗ , w . p . 1 \omega_k\rightarrow \omega^*,w.p.1 ωkω,w.p.1,其中 g ( ω ∗ ) = 0 g(\omega^*)=0 g(ω)=0

Proof.有下面的式子:
ω k + 1 − ω ∗ = ω k − ω ∗ − α k g ˉ ( ω k , η k ) = ω k − ω ∗ − α k ( g ( ω k ) + η k ) = ω k − ω ∗ − α k ( g ( ω k ) − g ( ω ∗ ) + η k ) = ω k − ω ∗ − α k ( ∇ ω g ( ω k ′ ) ( ω k − ω ∗ ) ) + ( α k ) ( − η k ) = ( 1 − α k ∇ ω g ( ω k ′ ) ) ( ω k − ω ∗ ) + ( α k ) ( − η k ) \omega_{k+1}-\omega^*=\omega_{k}-\omega^*-\alpha_k\bar{g}(\omega_k,\eta_k)\\=\omega_k-\omega^*-\alpha_k(g(\omega_k)+\eta_k)\\=\omega_k-\omega^*-\alpha_k(g(\omega_k)-g(\omega^*)+\eta_k)\\=\omega_k-\omega^*-\alpha_k(\nabla_{\omega}g(\omega_k^{'})(\omega_k-\omega^*))+(\alpha_k)(-\eta_k)\\=(1-\alpha_k\nabla_{\omega}g(\omega_k^{'}))(\omega_k-\omega^*)+(\alpha_k)(-\eta_k) ωk+1ω=ωkωαkgˉ(ωk,ηk)=ωkωαk(g(ωk)+ηk)=ωkωαk(g(ωk)g(ω)+ηk)=ωkωαk(ωg(ωk)(ωkω))+(αk)(ηk)=(1αkωg(ωk))(ωkω)+(αk)(ηk)
其中: ω k ′ = θ ω k + ( 1 − θ ) ω ∗ , θ ∈ [ 0 , 1 ] \omega_k^{'}=\theta\omega_k+(1-\theta)\omega^*,\theta\in[0,1] ωk=θωk+(1θ)ω,θ[0,1]。令 Δ k = ω k − ω ∗ \Delta_k=\omega_k-\omega^* Δk=ωkω,则 Δ k + 1 = ( 1 − α k ∇ ω g ( ω k ′ ) ) Δ k + α k ( − η k ) \Delta_{k+1}=(1-\alpha_k\nabla_{\omega}g(\omega_k^{'}))\Delta_k+\alpha_k(-\eta_k) Δk+1=(1αkωg(ωk))Δk+αk(ηk)

因为 ∑ k = 1 ∞ α k ∇ ω g ( ω k ′ ) > c 1 ∑ k = 1 k α k , ∑ k = 1 k α k = ∞ \sum_{k=1}^{\infty}\alpha_k\nabla_{\omega}g(\omega_k^{'})>c_1\sum_{k=1}^k\alpha_k,\sum_{k=1}^k\alpha_k=\infty k=1αkωg(ωk)>c1k=1kαk,k=1kαk=,所以 ∑ k = 1 ∞ ( α k ∇ ω g ( ω k ′ ) ) = ∞ \sum_{k=1}^{\infty}(\alpha_k\nabla_{\omega}g(\omega_k^{'}))=\infty k=1(αkωg(ωk))=.

∑ k = 1 ∞ ( α k ∇ ω g ( ω k ′ ) ) 2 ≤ c 2 2 ∑ k = 1 ∞ α k 2 < ∞ , E [ − η k ∣ H k ] = 0 \sum_{k=1}^{\infty}(\alpha_k\nabla_{\omega}g(\omega_k^{'}))^2\leq c_2^2\sum_{k=1}^{\infty}\alpha^2_k<\infty ,\mathbf E[-\eta_k|H_k]=0 k=1(αkωg(ωk))2c22k=1αk2<,E[ηkHk]=0

故由Dvoretzky’s 收敛定理 Δ k → 0 , w . p . 1 \Delta_k\rightarrow 0,w.p.1 Δk0,w.p.1,即 ω k → ω ∗ , w . p . 1 \omega_k\rightarrow \omega^*,w.p.1 ωkω,w.p.1

由Robbins-Monro定理可以很容易的估计随机变量的数学期望:若独立同分布随机变量 { x k } k = 1 ∞ \{x_k\}_{k=1}^{\infty} {xk}k=1,数学期望为 E [ X ] \mathbf{E}[X] E[X],采用迭代式 w k + 1 = ( 1 − α k ) w k + α k x k w_{k+1}=(1-\alpha_k)w_k+\alpha_kx_k wk+1=(1αk)wk+αkxk进行计算,若 ∑ k α k = ∞ , ∑ k α k 2 < ∞ \sum_{k}\alpha_k=\infty,\sum_{k}\alpha_k^2<\infty kαk=,kαk2<,可以得到 w k → E [ x ] w_k\rightarrow \mathbf{E}[x] wkE[x]。(证明直接构造 g ( w k ) = w k − E [ x ] , η k = E [ x ] − x k g(w_k)=w_k-\mathbf E[x],\eta_k=\mathbf E[x]-x_k g(wk)=wkE[x],ηk=E[x]xk即可)

(随机梯度下降(SGD)算法的收敛性) 对于最优化问题 min ⁡ w J ( w ) = E X [ f ( w , X ) ] \min_{w}J(w)=\mathbf{E}_{X}[f(w,X)] minwJ(w)=EX[f(w,X)],当采用迭代式 w k + 1 = w k − α k ∇ w f ( w k , x k ) w_{k+1}=w_k-\alpha_k\nabla_{w}f(w_k,x_k) wk+1=wkαkwf(wk,xk)进行参数迭代时,若满足以下条件:

(1) 0 < c 1 ≤ ∇ w 2 f ( w , X ) ≤ c 2 0<c_1\leq \nabla_w^2f(w,X)\leq c_2 0<c1w2f(w,X)c2;

(2) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k=\infty,\sum_{k=1}^{\infty}\alpha_k^2<\infty k=1αk=,k=1αk2<;

(3) { x k } k = 1 ∞ \{x_k\}_{k=1}^{\infty} {xk}k=1是独立同分布随机变量.

w k → w ∗ w_k\rightarrow w^* wkw,其中 ∇ w E X [ f ( w ∗ , X ) ] = 0 , w . p . 1 \nabla_w \mathbf{E}_X[f(w^*,X)]=0,w.p.1 wEX[f(w,X)]=0,w.p.1.

Proof. g ( w k ) = ∇ w E X [ f ( w k , X ) ] g(w_k)=\nabla_w \mathbf{E}_X[f(w_k,X)] g(wk)=wEX[f(wk,X)] η k = ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] \eta_k=\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)] ηk=wf(wk,X)wEX[f(wk,X)] g ˉ ( w k , η k ) = g ( w k ) + η k = ∇ w f ( w k , X ) \bar{g}(w_k,\eta_k)=g(w_k)+\eta_k=\nabla_{w}f(w_k,X) gˉ(wk,ηk)=g(wk)+ηk=wf(wk,X).

由于 0 < c 1 ≤ ∇ w 2 f ( w , X ) ≤ c 2 0<c_1\leq \nabla_w^2f(w,X)\leq c_2 0<c1w2f(w,X)c2,因此:
c 1 ≤ ∇ w g ( w k ) = ∇ w 2 E X [ f ( w k , X ) ] ≤ c 2 c_1\leq\nabla_w g(w_k)=\nabla_w^2 \mathbf{E}_X[f(w_k,X)]\leq c_2 c1wg(wk)=w2EX[f(wk,X)]c2
E [ η k ∣ H k ] = E [ ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] ∣ H k ] = 0 \mathbf E[\eta_k|H_k]=\mathbf{E}[\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)]|H_k]=0 E[ηkHk]=E[wf(wk,X)wEX[f(wk,X)]Hk]=0

同理 E [ η k 2 ∣ H k ] = E [ ( ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] ) 2 ∣ H k ] < ∞ \mathbf E[\eta_k^2|H_k]=\mathbf{E}[(\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)])^2|H_k]<\infty E[ηk2Hk]=E[(wf(wk,X)wEX[f(wk,X)])2Hk]<.

因此由Robbins-Monro定理 w k → w ∗ w_k\rightarrow w^* wkw,其中 g ( w ∗ ) = ∇ w E X [ f ( w ∗ , X ) ] = 0 g(w^*)=\nabla_w \mathbf{E}_X[f(w^*,X)]=0 g(w)=wEX[f(w,X)]=0.

(压缩映射原理) 在非空完备度量空间 ( X , d ) (X,d) (X,d)中,映射 T : X → X T:X\rightarrow X T:XX为压缩映射,即满足条件:
d ( T x 1 , T x 2 ) < C d ( x 1 , x 2 ) , x 1 , x 2 ∈ X , 0 < C < 1 d(Tx_1,Tx_2)<Cd(x_1,x_2),x_1,x_2\in X,0 < C<1 d(Tx1,Tx2)<Cd(x1,x2),x1,x2X,0<C<1
T T T在该空间中有唯一的不动点 x 0 x_0 x0满足 T x 0 = x 0 Tx_0=x_0 Tx0=x0,其可以通过 x n + 1 = T x n x_{n+1}=Tx_{n} xn+1=Txn迭代得到 x n → x 0 x_n\rightarrow x_0 xnx0

Proof.略,因为这是著名的Banach不动点定理,一般的泛函分析教材上都会有介绍。

(马尔科夫链的稳态分布定理) 设Markov Process的状态空间为 S S S,状态量的个数为 ∣ S ∣ |S| S,定义在策略 π \pi π下的状态转移概率矩阵 P π ∈ R ∣ S ∣ × ∣ S ∣ P_{\pi}\in R^{|S|\times |S|} PπRS×S,定义 k k k步状态转移概率矩阵 P π k = { p i j , π ( k ) } ∣ S ∣ × ∣ S ∣ P_{\pi}^k=\{p_{ij,\pi}^{(k)}\}_{|S|\times |S|} Pπk={pij,π(k)}S×S,其中:
p i j , π ( k ) = P r o b ( S t k = j ∣ S t 0 = i , π ) p_{ij,\pi}^{(k)}=\mathbf{Prob}(S_{t_k}=j|S_{t_0}=i,\pi) pij,π(k)=Prob(Stk=jSt0=i,π)
其满足 P π k = P π P π k − 1 P_{\pi}^k=P_{\pi}P_{\pi}^{k-1} Pπk=PπPπk1.对任意一个初始状态分布 d 0 ∈ R ∣ S ∣ , d_0\in R^{|S|}, d0RS,,在策略 π \pi π下经过 k k k轮迭代后的状态分布为 d k : d k T = d 0 T P π k d_k:d_k^T=d_0^TP_{\pi}^k dk:dkT=d0TPπk.

当满足以下条件:

对于任意的两个状态 i , j ∈ S i,j\in S i,jS,都存在有限步长 k k k,使得 [ P π k ] i j > 0 [P_{\pi}^k]_{ij}>0 [Pπk]ij>0

有以下结论:

(1) P π k → 1 ∣ S ∣ d π T P_{\pi}^k\rightarrow \mathbf{1}_{|S|}d_{\pi}^T Pπk1SdπT

(2) d k T → d 0 T 1 ∣ S ∣ d π T = d π T d_k^T\rightarrow d_0^T\mathbf{1}_{|S|}d_{\pi}^T=d_{\pi}^T dkTd0T1SdπT=dπT;

(3) d π T d_{\pi}^T dπT满足 d π T = d π T P π d_{\pi}^T=d_{\pi}^TP_{\pi} dπT=dπTPπ.

此时称这样的Markov Process是regular的。

Proof. 略,因为这是关于Markov Process中经典的遍历定理,一般的随机过程教材上都会有介绍。

(完备度量空间中的柯西列均收敛) 在完备度量空间 ( X , d ) (X,d) (X,d)中,若柯西列 { x n } ⊂ X \{x_n\}\sub X {xn}X,则柯西列必收敛 x n → x ∈ X x_n\rightarrow x \in X xnxX。其子空间 ( Y , d ∣ Y × Y ) ⊂ ( X , d ) (Y,d|_{Y\times Y})\sub (X,d) Y,dY×Y(X,d)为闭集是 ( Y , d ∣ Y × Y ) (Y,d|_{Y\times Y}) (Y,dY×Y)为完备度量空间的充要条件。

Proof. 略,该定理在一般的泛函分析教材上都会有介绍。

(夹逼定理) 如果数列 { X n } , { Y n } , { Z n } \{X_n\},\{Y_n\},\{Z_n\} {Xn},{Yn},{Zn}满足以下条件:

(1) 当 n > N 0 ∈ N ∗ n>N_0\in N^* n>N0N时,有 Y n ≤ X n ≤ Z n Y_n\leq X_n \leq Z_n YnXnZn;

(2) lim ⁡ n → ∞ Y n = lim ⁡ n → ∞ Z n = a < ∞ \lim_{n \rightarrow \infty}Y_n=\lim_{n \rightarrow \infty}Z_n=a<\infty limnYn=limnZn=a<

则数列 { X n } \{X_n\} {Xn}极限存在,且 lim ⁡ n → ∞ X n = a \lim_{n\rightarrow \infty }X_n=a limnXn=a

Proof. 由于 lim ⁡ n → ∞ Y n = lim ⁡ n → ∞ Z n = a \lim_{n \rightarrow \infty}Y_n=\lim_{n \rightarrow \infty}Z_n=a limnYn=limnZn=a,所以由极限的定义:

∀ ε > 0 , ∃ N 1 , n > N 1 , ∣ Y n − a ∣ < ε \forall \varepsilon >0,\exists N_1,n>N_1,|Y_n-a|<\varepsilon ε>0,N1,n>N1,Yna<ε

∀ ε > 0 , ∃ N 2 , n > N 1 , ∣ Z n − a ∣ < ε \forall \varepsilon >0,\exists N_2,n>N_1,|Z_n-a|<\varepsilon ε>0,N2,n>N1,Zna<ε

∀ ε > 0 \forall \varepsilon >0 ε>0,当取 N = max ⁡ { N 0 , N 1 , N 2 } N=\max\{N_0,N_1,N_2\} N=max{N0,N1,N2}时,若 n > N n>N n>N,有 X n ≥ Y n > a − ε X_n\geq Y_n>a-\varepsilon XnYn>aε X n ≤ Z n < a + ε X_n\leq Z_n<a+\varepsilon XnZn<a+ε,得到 ∣ X n − a ∣ < ε |X_n-a|<\varepsilon Xna<ε,故 lim ⁡ n → ∞ X n = a \lim_{n\rightarrow \infty}X_n=a limnXn=a.

(数列的平均值收敛性质) { a n } n = 1 ∞ ⊂ R \{a_n\}_{n=1}^{\infty}\sub R {an}n=1R是收敛列, lim ⁡ n → ∞ a n = a ∗ \lim_{n\rightarrow \infty}a_n=a^* limnan=a,则 lim ⁡ n → ∞ 1 n ∑ k = 1 n a n = a ∗ \lim_{n\rightarrow \infty}\frac{1}{n}\sum_{k=1}^na_n=a^* limnn1k=1nan=a

Proof. b n = 1 n ∑ k = 1 n a n b_n=\frac{1}{n}\sum_{k=1}^na_n bn=n1k=1nan,则有关系 ( n + 1 ) b n + 1 − n b n = a n + 1 (n+1)b_{n+1}-nb_n=a_{n+1} (n+1)bn+1nbn=an+1成立,令 Δ n = b n − a ∗ \Delta_n=b_n-a^* Δn=bna,得到:
( n + 1 ) b n + 1 − n b n = ( n + 1 ) ( Δ n + 1 + a ∗ ) − n ( Δ n + a ∗ ) = ( n + 1 ) Δ n + 1 − n Δ n + a ∗ = a n + 1 (n+1)b_{n+1}-nb_n=(n+1)(\Delta_{n+1}+a^*)-n(\Delta_n+a^*)\\=(n+1)\Delta_{n+1}-n\Delta_n+a^*\\=a_{n+1} (n+1)bn+1nbn=(n+1)(Δn+1+a)n(Δn+a)=(n+1)Δn+1nΔn+a=an+1
化简得到: ( n + 1 ) Δ n + 1 − n Δ n = a n + 1 − a ∗ (n+1)\Delta_{n+1}-n\Delta_n=a_{n+1}-a^* (n+1)Δn+1nΔn=an+1a。由于 a n + 1 → a ∗ a_{n+1}\rightarrow a^* an+1a,有 ( n + 1 ) Δ n + 1 − n Δ n → 0 (n+1)\Delta_{n+1}-n\Delta_n\rightarrow 0 (n+1)Δn+1nΔn0,即 { n Δ n } ⊂ R \{n\Delta_n\}\sub R {nΔn}R是柯西列,由于 R R R的完备性,有 n Δ n → c ∗ ∈ R n\Delta_n\rightarrow c^*\in R nΔncR

这说明 ∀ ε > 0 , ∃ N 1 > 0 , n > N 1 \forall \varepsilon>0,\exists N_1>0,n>N_1 ε>0,N1>0,n>N1 ∣ n Δ n − c ∗ ∣ < ε |n\Delta_n-c^*|<\varepsilon nΔnc<ε,即: ∣ Δ n − c ∗ n ∣ < ε n |\Delta_n-\frac{c^*}{n}|<\frac{\varepsilon}{n} Δnnc<nε,此时有:
∣ Δ n ∣ < ∣ Δ n − c ∗ n ∣ + ∣ c ∗ n ∣ < ε n + ∣ c ∗ ∣ n |\Delta_n|<|\Delta_n-\frac{c^*}{n}|+|\frac{c^*}{n}|<\frac{\varepsilon}{n}+\frac{|c^*|}{n} Δn<Δnnc+nc<nε+nc
若取 N 2 = [ ε + ∣ c ∗ ∣ ε ] + 1 N_2=[\frac{\varepsilon+|c^*|}{\varepsilon}]+1 N2=[εε+c]+1,当 n > N 2 n>N_2 n>N2 ∣ Δ n ∣ < ε |\Delta_n|<\varepsilon Δn<ε.

综上, ∀ ε > 0 \forall \varepsilon>0 ε>0,若取 N = max ⁡ { N 1 , N 2 } N=\max\{N_1,N_2\} N=max{N1,N2},当 n > N n>N n>N时有: ∣ Δ n ∣ < ε |\Delta_n|<\varepsilon Δn<ε.

这说明 Δ n → 0 \Delta_n\rightarrow 0 Δn0,即 b n = 1 n ∑ k = 1 n a n → a ∗ b_n=\frac{1}{n}\sum_{k=1}^na_n\rightarrow a^* bn=n1k=1nana得证。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值