ADPRL - 近似动态规划和强化学习 - Note 5 - Banach Fixed Point Theorem in Dynamic Programming

5. Banach Fixed Point Theorem in Dynamic Programming

如上两节所示,VI算法和PI算法都拥有各自的优势和劣势。具体来说,VI算法是一种简单的迭代算法,但它在总成本函数空间的收敛性方面可能是低效的。虽然PI算法在探索策略空间的有限性和在策略空间中达到更好的收敛性方面有一个很好的特性,但它仍然会受到精确策略评估的瓶颈影响。尽管OPI算法自然地连接了这两种算法,但它的性能是由有限策略评估的数量选择决定的。在本节中,我们旨在通过探索巴拿赫固定点定理的属性来缓解这种限制,并仍然保留简单和快速收敛的有希望的属性。

5.1 巴拿赫不动点定理定理 (Banach fixed point theorem)

首先,为了讨论的完整性,我们回顾一下最近提出的、证明压缩原理的另一种方法[19]。

定义 5.1 度量空间定义

度量空间是一有序对 ( X , d ) (\mathcal{X}, d) (X,d)
( X , d ) (\mathcal{X}, d) (X,d) 其中 X \mathcal{X} X是一个集合, d d d 是一个在 X \mathcal{X} X的度量, 即, 一个函数 d : X × X → R d: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} d:X×XR, 对于所有 x , y , z ∈ X x, y, z \in \mathcal{X} x,y,zX, 以下关系成立:

  1. d ( x , y ) = 0 ⟺ x = y d(x, y)=0 \Longleftrightarrow x=y d(x,y)=0x=y 不可分者同一性原理 (identity of indiscernibles),

  2. d ( x , y ) = d ( y , x ) d(x, y)=d(y, x) d(x,y)=d(y,x) 对称性 (symmetry)

  3. d ( x , z ) ≤ d ( x , y ) + d ( y , z ) d(x, z) \leq d(x, y)+d(y, z) d(x,z)d(x,y)+d(y,z) 三角不等式 (triangle inequality).

定义 5.2 压缩映射

( X , d ) (\mathcal{X}, d) (X,d)为一度量空间,映射关系 T : X → X \mathrm{T}: \mathcal{X} \rightarrow \mathcal{X} T:XX被称为在 X \mathcal{X} X上的压缩映射,如果存在 γ ∈ [ 0 , 1 ) \gamma \in[0,1) γ[0,1), 使得

d (   T ( x ) , T ( y ) ) ≤ γ d ( x , y ) ,  for all  x , y ∈ X (5.1) d(\mathrm{~T}(x), \mathrm{T}(y)) \leq \gamma d(x, y), \quad \text { for all } x, y \in \mathcal{X}\tag{5.1} d( T(x),T(y))γd(x,y), for all x,yX(5.1)

Lemma 5.1 基本压缩不等式 (Fundamental contraction inequality)

如果 T : X → X \mathrm{T}: \mathcal{X} \rightarrow \mathcal{X} T:XX 是一个压缩常数为 γ \gamma γ的压缩映射,那么对于在 X \mathcal{X} X 的任意 x , y x, y x,y
d ( x , y ) ≤ 1 1 − γ ( d ( x ,   T ( x ) ) + d ( y ,   T ( y ) ) ) (5.2) d(x, y) \leq \frac{1}{1-\gamma}(d(x, \mathrm{~T}(x))+d(y, \mathrm{~T}(y)))\tag{5.2} d(x,y)1γ1(d(x, T(x))+d(y, T(y)))(5.2)


Proof

根据三角不等式和 T \mathrm{T} T是一个压缩映射的事实,我们有
d ( x , y ) ≤ d ( x ,   T ( x ) ) + d (   T ( x ) , T ( y ) ) + d (   T ( y ) , y ) ≤ d ( x ,   T ( x ) ) + γ d ( x , y ) + d (   T ( y ) , y ) (5.3) \begin{aligned} d(x, y) & \leq d(x, \mathrm{~T}(x))+d(\mathrm{~T}(x), \mathrm{T}(y))+d(\mathrm{~T}(y), y) \\ & \leq d(x, \mathrm{~T}(x))+\gamma d(x, y)+d(\mathrm{~T}(y), y) \end{aligned}\tag{5.3} d(x,y)d(x, T(x))+d( T(x),T(y))+d( T(y),y)d(x, T(x))+γd(x,y)+d( T(y),y)(5.3)

即证。


Corollary 5.1 压缩映射的固定点

一个压缩映射 T \mathrm{T} T最多有一个固定点。


Proof

让我们假设压缩映射 T T T有一个以上的固定点,例如 x x x y y y d ( x , y ) = 0. d(x, y)=0. d(x,y)=0. 根据 T \mathrm{T} T的固定点定义,我们得到 d ( x ,   T ( x ) ) = 0 d(x, \mathrm{~T}(x))=0 d(x, T(x))=0 d ( y ,   T ( y ) ) = 0 d(y, \mathrm{~T}(y))=0 d(y, T(y))=0。根据Lemma 5.1,我们有 d ( x , y ) = 0. d(x, y)=0. d(x,y)=0. 通过矛盾法,即证。


Proposition 5.1 柯西序列

如果 T \mathrm{T} T是一个压缩,那么对于任何 x ∈ X x\in \mathcal{X} xX T k ( x ) \mathrm{T}^{k}(x) Tk(x) x x x的迭代序列 T \mathrm{T} T是一个柯西序列 (Cauchy sequence)


Proof

x 1 = T k ( x ) x_{1}=\mathrm{T}^{k}(x) x1=Tk(x) x 2 = T l ( x ) . x_{2}=\mathrm{T}^{l}(x). x2=Tl(x). 那么,根据基本压缩不等式,我们有
d (   T k ( x ) , T l ( x ) ) ≤ 1 1 − γ ( d (   T k ( x ) , T k + 1 ( x ) ) + d (   T l ( x ) , T l + 1 ( x ) ) ) ≤ γ k + γ l 1 − γ ( d ( x ,   T ( x ) ) + d ( x ,   T ( x ) ) ) (5.4) \begin{aligned} d\left(\mathrm{~T}^{k}(x), \mathrm{T}^{l}(x)\right) & \leq \frac{1}{1-\gamma}\left(d\left(\mathrm{~T}^{k}(x), \mathrm{T}^{k+1}(x)\right)+d\left(\mathrm{~T}^{l}(x), \mathrm{T}^{l+1}(x)\right)\right) \\ & \leq \frac{\gamma^{k}+\gamma^{l}}{1-\gamma}(d(x, \mathrm{~T}(x))+d(x, \mathrm{~T}(x))) \end{aligned}\tag{5.4} d( Tk(x),Tl(x))1γ1(d( Tk(x),Tk+1(x))+d( Tl(x),Tl+1(x)))1γγk+γl(d(x, T(x))+d(x, T(x)))(5.4)

显然,当 k k k l l l趋于无穷大时, d (   T k ( x ) , T l ( x ) ) d\left(\mathrm{~T}^{k}(x), \mathrm{T}^{l}(x)\right) d( Tk(x),Tl(x))的距离会归于零。最后,我们准备提出经典的巴拿赫不动点定理(Banach fixed-point theorem)


Theorem 5.1 巴拿赫不动点定理 (The Banach fixed-point theorem)

( X , d ) (\mathcal{X}, d) (X,d)是一个非空的完整度量空间,有一个压缩映射 T \mathrm{T} T。那么 T \mathrm{T} T就有一个唯一的固定点 x ∗ x^{*} x。即 T ( x ∗ ) = x ∗ \mathrm{T}\left(x^{*}\right)=x^{*} T(x)=x。此外, x ∗ x^{*} x可以通过迭代来找到。

x n + 1 = T ( x n ) ,  with arbitrary  x 0 ∈ X (5.5) x_{n+1}=\mathrm{T}\left(x_{n}\right), \quad \text { with arbitrary } x_{0} \in \mathcal{X}\tag{5.5} xn+1=T(xn), with arbitrary x0X(5.5)

也就是说,当 n → ∞ n\rightarrow\infty n时,有 x n → x ∗ x_{n} \rightarrow x^{*} xnx


Proof
直接的,我们有

d (   T k ( x ) , x ∗ ) ≤ γ k 1 − γ d ( x ,   T ( x ) ) (5.6) d\left(\mathrm{~T}^{k}(x), x^{*}\right) \leq \frac{\gamma^{k}}{1-\gamma} d(x, \mathrm{~T}(x))\tag{5.6} d( Tk(x),x)1γγkd(x, T(x))(5.6)

即, k → ∞ k \rightarrow \infty k 导致 T k ( x ) \mathrm{T}^{k}(x) Tk(x)收敛到 x ∗ x^{*} x.

5.2 λ \lambda λ-几何平均贝尔曼算子 ( λ \lambda λ-geometrically averaged Bellman operator)

利用OPI算法的一个可能的方法是将各种OPI算法与不同步骤的策略评估集合起来。具体来说,我们可以根据 E q . ( 4.33 ) \mathrm{Eq}.(4.33) Eq.(4.33)中定义的 m m m步骤贝尔曼算子的集合构建一个几何平均贝尔曼算子。
T π , λ m : R K → R K , J ↦ 1 ∑ i = 1 m λ i − 1 ∑ i = 1 m λ i − 1   T π i J = 1 − λ 1 − λ m ∑ i = 1 m λ i − 1   T π i J (5.7) \begin{aligned} \mathrm{T}_{\pi, \lambda}^{m}: \mathbb{R}^{K} \rightarrow \mathbb{R}^{K}, \quad J & \mapsto \frac{1}{\sum_{i=1}^{m} \lambda^{i-1}} \sum_{i=1}^{m} \lambda^{i-1} \mathrm{~T}_{\pi}^{i} J \\ &=\frac{1-\lambda}{1-\lambda^{m}} \sum_{i=1}^{m} \lambda^{i-1} \mathrm{~T}_{\pi}^{i} J \end{aligned}\tag{5.7} Tπ,λm:RKRK,Ji=1mλi11i=1mλi1 TπiJ=1λm1λi=1mλi1 TπiJ(5.7)

很容易看出,总成本函数 J π J_{\pi} Jπ是所有 m m m步贝尔曼算子的唯一固定点。它的压缩特性在下面的结果中得到了描述。

Proposition 5.2 m m m步的几何平均贝尔曼算子的压缩特性 (Contraction property of m m m-step geometrically averaged Bellman operator)

给定无限范围的 M D P { X , U , p , g , γ } M D P\{\mathcal{X}, \mathcal{U}, p, g, \gamma\} MDP{X,U,p,g,γ}, 令 J , J ′ ∈ R K J, J^{\prime} \in \mathbb{R}^{K} J,JRK, m − m- m步截断的、几何平均的贝尔曼算子是一个关于无穷范数的压缩映射,即:

∥ T π , λ m J − T π , λ m J ′ ∥ ∞ ≤ γ ( 1 − λ m γ m ) ( 1 − λ ) ( 1 − λ γ ) ( 1 − λ m ) ∥ J − J ′ ∥ ∞ (5.8) \left\|\mathrm{T}_{\pi, \lambda}^{m} J-\mathrm{T}_{\pi, \lambda}^{m} J^{\prime}\right\|_{\infty} \leq \frac{\gamma\left(1-\lambda^{m} \gamma^{m}\right)(1-\lambda)}{(1-\lambda \gamma)\left(1-\lambda^{m}\right)}\left\|J-J^{\prime}\right\|_{\infty}\tag{5.8} Tπ,λmJTπ,λmJ(1λγ)(1λm)γ(1λmγm)(1λ)JJ(5.8)


Proof

根据无穷范数的三角不等式,可以得出
∥ T π , λ m J − T π , λ m J ′ ∥ ∞ = 1 − λ 1 − λ m ∥ ∑ i = 1 m λ i − 1 (   T π i J − T π i J ′ ) ∥ ∞ ≤ 1 − λ 1 − λ m ( ∑ i = 1 m λ i − 1 ∥   T π i J − T π i J ′ ∥ ∞ ) ≤ 1 − λ 1 − λ m ( ∑ i = 1 m λ i − 1 γ i ∥ J − J ′ ∥ ∞ ) ≤ γ ( 1 − λ ) ( 1 − λ m γ m ) ( 1 − λ γ ) ( 1 − λ m ) ∥ J − J ′ ∥ ∞ (5.9) \begin{aligned} \left\|\mathrm{T}_{\pi, \lambda}^{m} J-\mathrm{T}_{\pi, \lambda}^{m} J^{\prime}\right\|_{\infty} &=\frac{1-\lambda}{1-\lambda^{m}}\left\|\sum_{i=1}^{m} \lambda^{i-1}\left(\mathrm{~T}_{\pi}^{i} J-\mathrm{T}_{\pi}^{i} J^{\prime}\right)\right\|_{\infty} \\ & \leq \frac{1-\lambda}{1-\lambda^{m}}\left(\sum_{i=1}^{m} \lambda^{i-1}\left\|\mathrm{~T}_{\pi}^{i} J-\mathrm{T}_{\pi}^{i} J^{\prime}\right\|_{\infty}\right) \\ & \leq \frac{1-\lambda}{1-\lambda^{m}}\left(\sum_{i=1}^{m} \lambda^{i-1} \gamma^{i}\left\|J-J^{\prime}\right\|_{\infty}\right) \\ & \leq \frac{\gamma(1-\lambda)\left(1-\lambda^{m} \gamma^{m}\right)}{(1-\lambda \gamma)\left(1-\lambda^{m}\right)}\left\|J-J^{\prime}\right\|_{\infty} \end{aligned}\tag{5.9} Tπ,λmJTπ,λmJ=1λm1λi=1mλi1( TπiJTπiJ)1λm1λ(i=1mλi1 TπiJTπiJ)1λm1λ(i=1mλi1γiJJ)(1λγ)(1λm)γ(1λ)(1λmγm)JJ(5.9)

由于压缩系数严格小于1,即证。


值得注意的是,当策略评估的数量 m m m增加时,几何平均贝尔曼算子 T π , λ m \mathrm{T}_{\pi, \lambda}^{m} Tπ,λm的压缩系数就会下降。换句话说, m m m的总成本越大,相关算法的收敛速度就越快。请注意,当 m m m的总成本上升到无穷大时,这样的结构实际上是不现实的。然而,通过使 m m m的总成本达到无穷大,很容易得出几何平均的贝尔曼算子为

T π , λ ∞ : R K → R K , J ↦ ( 1 − λ ) ∑ i = 1 ∞ λ i − 1   T π i J (5.10) \mathrm{T}_{\pi, \lambda}^{\infty}: \mathbb{R}^{K} \rightarrow \mathbb{R}^{K}, \quad J \mapsto(1-\lambda) \sum_{i=1}^{\infty} \lambda^{i-1} \mathrm{~T}_{\pi}^{i} J\tag{5.10} Tπ,λ:RKRK,J(1λ)i=1λi1 TπiJ(5.10)

在我们的Notes中被称为 λ \lambda λ几何平均贝尔曼算子。

Proposition 5.3 λ \lambda λ几何平均贝尔曼算子的压缩性 (Contraction property of the λ \lambda λ-geometrically averaged Bellman operator)

给定一个无限范围 M D P { X , U , p , g , γ } M D P\{\mathcal{X}, \mathcal{U}, p, g, \gamma\} MDP{X,U,p,g,γ},让 J , J ′ ∈ R K J, J^{\prime} \in \mathbb{R}^{K} J,JRK λ \lambda λ几何平均贝尔曼算子是一个关于无穷范数的压缩映射,即

∥ T π , λ ∞ J − T π , λ ∞ J ′ ∥ ∞ ≤ γ ( 1 − λ ) 1 − λ γ ∥ J − J ′ ∥ ∞ (5.11) \left\|\mathrm{T}_{\pi, \lambda}^{\infty} J-\mathrm{T}_{\pi, \lambda}^{\infty} J^{\prime}\right\|_{\infty} \leq \frac{\gamma(1-\lambda)}{1-\lambda \gamma}\left\|J-J^{\prime}\right\|_{\infty}\tag{5.11} Tπ,λJTπ,λJ1λγγ(1λ)JJ(5.11)


Proof
这个结果来自于如命题5.2(eq. )中的将 m m m推导到无穷大。


显然, λ \lambda λ几何平均贝尔曼算子只有理论上的意义,没有实际用途。在下文中,我们研究了一个精心设计的贝尔曼算子,它最初是受 λ \lambda λ几何平均贝尔曼算子启发。它与RL背景下的 资格迹(eligibility trace) 概念的联系将在后面的章节中讨论。

5.3 λ \lambda λ-策略迭代算法 ( λ \lambda λ-Policy Iteration Algorithm)

J ′ ∈ R K J^{\prime} \in \mathbb{R}^{K} JRK成为总成本函数 J π ∈ R K J_{\pi} \in \mathbb{R}^{K} JπRK的一个估计。我们定义以下映射
T π , J ′ ( λ ) : R K → R K , J ↦ ( 1 − λ ) T π J ′ + λ T π J (5.12) \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}: \mathbb{R}^{K} \rightarrow \mathbb{R}^{K}, \quad J \mapsto(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi} J\tag{5.12} Tπ,J(λ):RKRK,J(1λ)TπJ+λTπJ(5.12)

我们称其为称为 λ \lambda λ-级联贝尔曼算子( λ \lambda λ-cascaded Bellman operator)。很容易看出,对于任何 J 1 , J 2 ∈ R K J_{1}, J_{2} \in \mathbb{R}^{K} J1,J2RK中,我们有
∥ T π , J ′ ( λ ) J 1 − T π , J ′ ( λ ) J 2 ∥ ∞ ≤ λ ∥ T π J 1 − T π J 2 ∥ ∞ ≤ λ γ ∥ J 1 − J 2 ∥ ∞ (5.13) \begin{aligned} \left\|\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{1}-\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{2}\right\|_{\infty} & \leq \lambda\left\|\mathrm{T}_{\pi} J_{1}-\mathrm{T}_{\pi} J_{2}\right\|_{\infty} \\ & \leq \lambda \gamma\left\|J_{1}-J_{2}\right\|_{\infty} \end{aligned}\tag{5.13} Tπ,J(λ)J1Tπ,J(λ)J2λTπJ1TπJ2λγJ1J2(5.13)

显然, λ \lambda λ-级联贝尔曼算子是一个压缩,其模为 λ γ \lambda\gamma λγ,相对于无穷范数。也就是说,它有一个唯一的固定点。当 λ = 0 \lambda=0 λ=0时, T π , J ′ ( 0 ) \mathrm{T}_{\pi,{J^{\prime}}}^{(0)} Tπ,J(0)的固定点是 T π J ′ \mathrm{T}_{\pi} J^{\prime} TπJ,这基本上是贝尔曼算子 T π \mathrm{T}_{\pi} Tπ对估计值 J ′ J^{\prime} J的一步应用。 当 λ = 1 \lambda=1 λ=1时, T π , J ′ ( 1 ) \mathrm{T}_{\pi, J^{\prime}}^{(1)} Tπ,J(1)的固定点满足以下等式
J = T π J (5.14) J=\mathrm{T}_{\pi} J\tag{5.14} J=TπJ(5.14)

这确实是策略 π \pi π的真实总成本函数,即 J π J_{\pi} Jπ因此,公式(5.12)中定义的 λ \lambda λ-级联贝尔曼算子 T π , J ′ ( λ ) T_{\pi, J^{\prime}}^{( \lambda)} Tπ,Jλ的固定点可以被视为单步贝尔曼算子和无限步贝尔曼算子的凸组合。这样的关系与 λ \lambda λ-贝尔曼算子 T π ∞ \mathrm{T}_{\pi}^{\infty} Tπ的构造相吻合。

有趣的是, λ \lambda λ-级联贝尔曼算子的单调性和恒定位移特性也都得到了保留。

Lemma 5.2 λ \lambda λ-级联的贝尔曼算子的单调性和恒定位移特性 (Monotonicity and constant shift of the λ \lambda λ-cascaded Bellman operator)

给定一个无限范围 M D P { X , U , p , g , γ } M D P\{\mathcal{X}, \mathcal{U}, p, g, \gamma\} MDP{X,U,p,g,γ} λ \lambda λ-级联贝尔曼算子 T π , J ′ ( λ ) \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} Tπ,J(λ)同时具有单调性和恒移性。

(1) 对于所有 x ∈ X x \in \mathcal{X} xX,让 J 1 , J 2 ∈ R K J_{1}, J_{2} \in \mathbb{R}^{K} J1,J2RK 满足 J 1 ( x ) ≤ J 2 ( x ) J_{1}(x)\leq J_{2}(x) J1(x)J2(x)。那么对于 k = 1 , 2 , … k=1,2, \ldots k=1,2,,我们有
( T π , J ′ ( λ ) ) k J 1 ( x ) ≤ ( T π , J ′ ( λ ) ) k J 2 ( x ) (5.15) \left(\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\right)^{k} J_{1}(x) \leq\left(\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\right)^{k} J_{2}(x)\tag{5.15} (Tπ,J(λ))kJ1(x)(Tπ,J(λ))kJ2(x)(5.15)

(2) 让 J 1 , J 2 ∈ R K J_{1}, J_{2} \in \mathbb{R}^{K} J1,J2RK 满足 J 2 J_{2} J2 J 1 J_{1} J1的常数移位函数的构造,即对于所有 x ∈ X x \in \mathcal{X} xX J 2 ( x ) = J 1 ( x ) + c J_{2}(x)=J_{1}(x)+c J2(x)=J1(x)+c。那么对于 k = 1 , 2 , … k=1,2, \ldots k=1,2,,我们有

( T π , J ′ ( λ ) ) k J 2 ( x ) = ( T π , J ′ ( λ ) ) k J 1 ( x ) + λ k γ k c (5.16) \left(\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\right)^{k} J_{2}(x)=\left(\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\right)^{k} J_{1}(x)+\lambda^{k} \gamma^{k} c\tag{5.16} (Tπ,J(λ))kJ2(x)=(Tπ,J(λ))kJ1(x)+λkγkc(5.16)


Proof
回忆定义 T π , J ′ ( λ ) \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} Tπ,J(λ), 我们有

T π , J ′ ( λ ) J 1 = ( 1 − λ ) T π J ′ + λ T π J 1 ≤ ( 1 − λ ) T π J ′ + λ T π J 2 = T π , J ′ ( λ ) J 2 (5.17) \begin{aligned} \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{1} &=(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi} J_{1} \\ & \leq(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi} J_{2} \\ &=\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{2} \end{aligned}\tag{5.17} Tπ,J(λ)J1=(1λ)TπJ+λTπJ1(1λ)TπJ+λTπJ2=Tπ,J(λ)J2(5.17)

其中不等式是由于标准贝尔曼算子 T π \mathrm{T}_{\pi} Tπ的单调性,以及
T π , J ′ ( λ ) J 2 = ( 1 − λ ) T π J ′ + λ T π ( J 1 + c 1 ) = ( 1 − λ ) T π J ′ + λ T π J 1 + λ γ c 1 = T π , J ′ ( λ ) J 1 + λ γ c 1 (5.18) \begin{aligned} \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{2} &=(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi}\left(J_{1}+c \mathbf{1}\right) \\ &=(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi} J_{1}+\lambda \gamma c \mathbf{1} \\ &=\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{1}+\lambda \gamma c \mathbf{1} \end{aligned}\tag{5.18} Tπ,J(λ)J2=(1λ)TπJ+λTπ(J1+c1)=(1λ)TπJ+λTπJ1+λγc1=Tπ,J(λ)J1+λγc1(5.18)

这个结果简单的从一个归纳论证中得出。


Proposition 5.4 λ \lambda λ-级联的贝尔曼算子的压缩特性 (Contraction property of the λ \lambda λ-cascaded Bellman operator)

给定一个无限范围 M D P { X , U , p , g , γ } M D P\{\mathcal{X}, \mathcal{U}, p, g, \gamma\} MDP{X,U,p,g,γ},以及两个总成本函数估计值 J 1 , J 2 ∈ R K J_{1}, J_{2} \in \mathbb{R}^{K} J1,J2RK是有边界的。 λ \lambda λ-级联贝尔曼算子是一个关于无穷范数模为 λ γ \lambda \gamma λγ的压缩,即,

∥ T π , J ′ ( λ ) J 1 − T π , J ′ ( λ ) J 2 ∥ ∞ ≤ λ γ ∥ J 1 − J 2 ∥ ∞ (5.19) \left\|\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{1}-\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{2}\right\|_{\infty} \leq \lambda \gamma\left\|J_{1}-J_{2}\right\|_{\infty}\tag{5.19} Tπ,J(λ)J1Tπ,J(λ)J2λγJ1J2(5.19)

此外, λ \lambda λ几何平均贝尔曼算子在 J ′ J^{\prime} J的一步评估是 T π , J ′ ( λ ) \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} Tπ,J(λ)的唯一固定点,即,
T π , J ′ ( λ ) ( T π , λ ∞ J ′ ) = T π , λ ∞ J ′ (5.20) \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\left(\mathrm{T}_{\pi, \lambda}^{\infty} J^{\prime}\right)=\mathrm{T}_{\pi, \lambda}^{\infty} J^{\prime}\tag{5.20} Tπ,J(λ)(Tπ,λJ)=Tπ,λJ(5.20)


Proof
直接的有

∥ T π , J ′ ( λ ) J 1 − T π , J ′ ( λ ) J 2 ∥ ∞ ≤ ∥ λ T π J 1 − λ T π J 2 ∥ ∞ ≤ λ γ ∥ J 1 − J 2 ∥ ∞ (5.21) \begin{aligned} \left\|\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{1}-\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} J_{2}\right\|_{\infty} & \leq\left\|\lambda \mathrm{T}_{\pi} J_{1}-\lambda \mathrm{T}_{\pi} J_{2}\right\|_{\infty} \\ & \leq \lambda \gamma\left\|J_{1}-J_{2}\right\|_{\infty} \end{aligned}\tag{5.21} Tπ,J(λ)J1Tπ,J(λ)J2λTπJ1λTπJ2λγJ1J2(5.21)

其中简单地遵循贝尔曼算子的特性。将结果 T π , λ ∞ J ′ \mathrm{T}_{\pi, \lambda}^{\infty} J^{\prime} Tπ,λJ 代入到 T π , J ′ \mathrm{T}_{\pi, J^{\prime}} Tπ,J 得到,

T π , J ′ ( λ ) ( T π , λ ∞ J ′ ) = T π , J ′ ( λ ) ( ( 1 − λ ) ∑ k = 1 ∞ λ k − 1   T π k J ′ ) = ( 1 − λ ) T π J ′ + λ T π ( ( 1 − λ ) ∑ k = 1 ∞ λ k − 1   T π k J ′ ) = ( 1 − λ ) T π J ′ + ( 1 − λ ) ∑ k = 2 ∞ λ k − 1   T π k J ′ = ( 1 − λ ) ∑ k = 1 ∞ λ k − 1   T π k J ′ (5.22) \begin{aligned} \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\left(\mathrm{T}_{\pi, \lambda}^{\infty} J^{\prime}\right) &=\mathrm{T}_{\pi, J^{\prime}}^{(\lambda)}\left((1-\lambda) \sum_{k=1}^{\infty} \lambda^{k-1} \mathrm{~T}_{\pi}^{k} J^{\prime}\right) \\ &=(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi}\left((1-\lambda) \sum_{k=1}^{\infty} \lambda^{k-1} \mathrm{~T}_{\pi}^{k} J^{\prime}\right) \\ &=(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+(1-\lambda) \sum_{k=2}^{\infty} \lambda^{k-1} \mathrm{~T}_{\pi}^{k} J^{\prime} \\ &=(1-\lambda) \sum_{k=1}^{\infty} \lambda^{k-1} \mathrm{~T}_{\pi}^{k} J^{\prime} \end{aligned}\tag{5.22} Tπ,J(λ)(Tπ,λJ)=Tπ,J(λ)((1λ)k=1λk1 TπkJ)=(1λ)TπJ+λTπ((1λ)k=1λk1 TπkJ)=(1λ)TπJ+(1λ)k=2λk1 TπkJ=(1λ)k=1λk1 TπkJ(5.22)

即证。


显然, λ \lambda λ几何平均贝尔曼算子 T π , λ ∞ \mathrm{T}_{\pi, \lambda}^{\infty} Tπ,λ的评估可以通过找到 λ \lambda λ-级联贝尔曼算子 T π , J ′ ( λ ) \mathrm{T}_{\pi, J^{\prime}}^{(\lambda)} Tπ,J(λ)的固定点直接计算,即
J = ( 1 − λ ) T π J ′ + λ T π J (5.23) J=(1-\lambda) \mathrm{T}_{\pi} J^{\prime}+\lambda \mathrm{T}_{\pi} J\tag{5.23} J=(1λ)TπJ+λTπJ(5.23)

在MDP的背景下,我们可以利用贝尔曼算子的紧凑表述的便利,如公式 ( 4.25 ) (4.25) (4.25),这导致了固定点的闭式表达为
J = ( I K − γ λ P π ) − 1 ( G π + γ ( 1 − λ ) P π J ′ ) (5.24) J=\left(I_{K}-\gamma \lambda P_{\pi}\right)^{-1}\left(G_{\pi}+\gamma(1-\lambda) P_{\pi} J^{\prime}\right)\tag{5.24} J=(IKγλPπ)1(Gπ+γ(1λ)PπJ)(5.24)

通过采取与第4.4节中开发最优策略迭代相同的方法,我们构造了一个最优 λ \lambda λ策略迭代算法(Algorithm 4)。与证明Proposition 4.7的策略相同,可以得出以下关于最优 λ \lambda λ-PI算法的收敛性的结果

Proposition 5.5 最优 λ \lambda λ-PI算法的收敛性 (Convergence of Optimistic λ \lambda λ-PI Algorithm)

给定一个无限范围MDP { X , U , p , g , γ } \{\mathcal{X}, \mathcal{U}, p, g, \gamma\} {X,U,p,g,γ},让 { J k } \left\{J_{k}\right\} {Jk} { π k } \left\{\pi_{k}\right\} {πk}是由最优 λ \lambda λ-PI算法生成的序列。然后 J k J_{k} Jk收敛到 J ∗ J^{*} J。此外,对于所有指数 k k k大于某个指数 κ \kappa κ π k \pi_{k} πk对于所有 k > κ k>\kappa k>κ来说是最优的。

∥ J k + 1 − J ∗ ∥ ∞ ≤ ( γ ( 1 − λ ) ( 1 − λ m γ m ) 1 − λ γ + λ m γ m ) ∥ J k − J ∗ ∥ ∞ (5.25) \left\|J_{k+1}-J^{*}\right\|_{\infty} \leq\left(\frac{\gamma(1-\lambda)\left(1-\lambda^{m} \gamma^{m}\right)}{1-\lambda \gamma}+\lambda^{m} \gamma^{m}\right)\left\|J_{k}-J^{*}\right\|_{\infty}\tag{5.25} Jk+1J(1λγγ(1λ)(1λmγm)+λmγm)JkJ(5.25)
在这里插入图片描述


Proof

与简单OPI算法的证明类似,我们假设 T g J 0 ≤ J 0 \mathrm{T}_{\mathfrak{g}} J_{0} \leq J_{0} TgJ0J0。应用策略改进,即 T π 0 J 0 = T g J 0 \mathrm{T}_{\pi_{0}} J_{0}=\mathrm{T}_{\mathfrak{g}} J_{0} Tπ0J0=TgJ0,意味着

T π 0 , J 0 ( λ ) J 0 = T π 0 J 0 = T g J 0 ≤ J 0 (5.26) \mathrm{T}_{\pi_{0}, J_{0}}^{(\lambda)} J_{0}=\mathrm{T}_{\pi_{0}} J_{0}=\mathrm{T}_{\mathfrak{g}} J_{0} \leq J_{0}\tag{5.26} Tπ0,J0(λ)J0=Tπ0J0=TgJ0J0(5.26)

让我们假设 T π k J k ≤ J k \mathrm{T}_{\pi_{k}} J_{k} \leq J_{k} TπkJkJk。根据公式(5.12)中 λ \lambda λ-级联贝尔曼算子的定义,我们有
T π k , J k ( λ ) J k = T π k J k = T g J k ≤ J k (5.27) \mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)} J_{k}=\mathrm{T}_{\pi_{k}} J_{k}=\mathrm{T}_{\mathfrak{g}} J_{k} \leq J_{k}\tag{5.27} Tπk,Jk(λ)Jk=TπkJk=TgJkJk(5.27)

然后结构 J k + 1 = ( T π k , J k ( λ ) ) m k J k J_{k+1}=\left(\mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)}\right)^{m_{k}} J_{k} Jk+1=(Tπk,Jk(λ))mkJk 意味着 J k + 1 ≤ J k J_{k+1} \leq J_{k} Jk+1Jk。因此,我们有

T π k + 1 J k + 1 = T g J k + 1 ≤ T π k J k + 1 = ( 1 − λ ) T π k J k + 1 + λ T π k J k + 1 ≤ ( 1 − λ ) T π k J k + λ T π k J k + 1 = T π k , J k ( λ ) J k + 1 = T π k , J k ( λ ) ∘ ( T π k , J k ( λ ) ) m k J k ≤ J k + 1 , (5.28) \begin{aligned} \mathrm{T}_{\pi_{k+1}} J_{k+1} &=\mathrm{T}_{\mathfrak{g}} J_{k+1} \\ & \leq \mathrm{T}_{\pi_{k} J_{k+1}} \\ &=(1-\lambda) \mathrm{T}_{\pi_{k}} J_{k+1}+\lambda \mathrm{T}_{\pi_{k}} J_{k+1} \\ & \leq(1-\lambda) \mathrm{T}_{\pi_{k}} J_{k}+\lambda \mathrm{T}_{\pi_{k}} J_{k+1} \\ &=\mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)} J_{k+1} \\ &=\mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)} \circ\left(\mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)}\right)^{m_{k}} J_{k} \\ & \leq J_{k+1}, \end{aligned}\tag{5.28} Tπk+1Jk+1=TgJk+1TπkJk+1=(1λ)TπkJk+1+λTπkJk+1(1λ)TπkJk+λTπkJk+1=Tπk,Jk(λ)Jk+1=Tπk,Jk(λ)(Tπk,Jk(λ))mkJkJk+1,(5.28)

其中最后一个不等式由 λ \lambda λ-级联贝尔曼算子的单调性得出。简而言之,我们只需证明不等式

T π k , J k ( λ ) J k = T π k J k = T g J k ≤ J k (5.29) \mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)} J_{k}=\mathrm{T}_{\pi_{k}} J_{k}=\mathrm{T}_{\mathfrak{g}} J_{k} \leq J_{k}\tag{5.29} Tπk,Jk(λ)Jk=TπkJk=TgJkJk(5.29)

对于由最优 λ \lambda λ策略迭代算法产生的所有 J k J_{k} Jk序列都成立。最后,对于任何 k k k,我们有
J k + 1 = ( T π k , J k ( λ ) ) m k J k ≤ T π k J k = T g J k (5.30) \begin{aligned} J_{k+1} &=\left(\mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)}\right)^{m_{k}} J_{k} \\ & \leq \mathrm{T}_{\pi_{k}} J_{k} \\ &=\mathrm{T}_{\mathfrak{g}} J_{k} \end{aligned}\tag{5.30} Jk+1=(Tπk,Jk(λ))mkJkTπkJk=TgJk(5.30)

作为结果有,
J ∗ ≤ J k ≤ T g k J 0 (5.31) J^{*} \leq J_{k} \leq \mathrm{T}_{\mathfrak{g}}^{k} J_{0}\tag{5.31} JJkTgkJ0(5.31)

k → ∞ k\rightarrow\infty k导致的结果是 J k J_{k} Jk在极限时收敛于 J ∗ J^{*} J。 没有假设 T g J 0 ≤ J 0 \mathrm{T}_{\mathfrak{g}} J_{0} \leq J_{0} TgJ0J0的结果与证明命题4.7的思路相同。

对于所有步骤我们假设 m k = m m_{k}=m mk=m。由于 J k → J ∗ J_{k} \rightarrow J^{*} JkJ,因此,对于所有 k k k大于某个指数 κ \kappa κ,使 π k \pi_{k} πk为最优策略,即 T π k = T g \mathrm{T}_{\pi_{k}}=\mathrm{T}_{\mathfrak{g}} Tπk=Tg。 因此,我们有

∥ J k + 1 − J ∗ ∥ ∞ = ∥ ( T π k , J k ( λ ) ) m J k − J ∗ ∥ ∞ = ∥ ( 1 − λ ) ∑ i = 1 m λ i − 1   T π k i J k + λ m   T π k m J k − T π k J ∗ ∥ ∞ ≤ ( γ ( 1 − λ ) ( 1 − λ m γ m ) 1 − λ γ + λ m γ m ) ∥ J k − J ∗ ∥ ∞ (5.32) \begin{aligned} \left\|J_{k+1}-J^{*}\right\|_{\infty} &=\left\|\left(\mathrm{T}_{\pi_{k}, J_{k}}^{(\lambda)}\right)^{m} J_{k}-J^{*}\right\|_{\infty} \\ &=\left\|(1-\lambda) \sum_{i=1}^{m} \lambda^{i-1} \mathrm{~T}_{\pi_{k}}^{i} J_{k}+\lambda^{m} \mathrm{~T}_{\pi_{k}}^{m} J_{k}-\mathrm{T}_{\pi_{k}} J^{*}\right\|_{\infty} \\ & \leq\left(\frac{\gamma(1-\lambda)\left(1-\lambda^{m} \gamma^{m}\right)}{1-\lambda \gamma}+\lambda^{m} \gamma^{m}\right)\left\|J_{k}-J^{*}\right\|_{\infty} \end{aligned}\tag{5.32} Jk+1J=(Tπk,Jk(λ))mJkJ=(1λ)i=1mλi1 TπkiJk+λm TπkmJkTπkJ(1λγγ(1λ)(1λmγm)+λmγm)JkJ(5.32)

其中,不等式由无穷范数的三角不等式得出。结果即证。


Remark 5.1

m = 1 m=1 m=1时,该结果与基本VI算法的结果相吻合。当 m → ∞ m\rightarrow\infty m,对于所有 k k k大于某个指数 κ \kappa κ,使得 π k \pi_{k} πk是一个最优策略时,总成本函数的收敛特点为
∥ J k + 1 − J ∗ ∥ ∞ ≤ γ ( 1 − λ ) 1 − λ γ ∥ J k − J ∗ ∥ ∞ (5.33) \left\|J_{k+1}-J^{*}\right\|_{\infty} \leq \frac{\gamma(1-\lambda)}{1-\lambda \gamma}\left\|J_{k}-J^{*}\right\|_{\infty}\tag{5.33} Jk+1J1λγγ(1λ)JkJ(5.33)

这与命题5.3中的 λ \lambda λ-几何平均贝尔曼算子的结果一致。

5.4 实例

Question 1: Optimistic λ \lambda λ -Policy Iteration: E-Bus, extended

Consider a group of electric buses running round trips 24 hours a day. The task is to identify optimal operating actions at different battery states. The battery’s endurance and charging speed gradually decrease with the increase of battery life. Hence, for different buses, they have different transition probabilities between battery states. The following figure illustrates the state transitions between different states.

  • Five states: H H H - high battery, E E E - empty battery, L 1 , L 2 , L 3 L_{1}, L_{2}, L_{3} L1,L2,L3 : three different low level battery statuses
  • Two actions: S S S- continue to serve, C - charge
  • Numbers on the edges refer to transition probabilities. p 1 = 0.4 , p 2 = 0.6 p_{1}=0.4, p_{2}=0.6 p1=0.4,p2=0.6
  • Discount factor γ = 0.9 \gamma=0.9 γ=0.9 .

在这里插入图片描述

We choose the number of unserviced passengers as the local costs:

  • In the high battery state, if it keeps the service, the unserviced passenger number is 0 .
  • In all the low battery stats, if it keeps the service, the unserviced passenger number is 2 .
  • In all the low battery state and empty battery, if it charges the battery, the unserviced passenger number is 5 .
Task 1

Implement one sweep of λ \lambda λ -Policy Iteration Algorithm. Let us initialize total cost J 0 = 0 J_{0}=0 J0=0 and λ = 0.8 \lambda=0.8 λ=0.8 .
(i) Generate a GIP π 0 \pi_{0} π0 from J 0 J_{0} J0 , and define its Bellman operator T π 0 \mathrm{T}_{\pi_{0}} Tπ0 ;
(ii) Iterate the Bellman operator T π 0 \mathrm{T}_{\pi_{0}} Tπ0 on J 0 J_{0} J0 till reaching the stopping criterion as ∥ J i + 1 − J i ∥ ∞ < 1 0 − 20 \left\|J_{i+1}-J_{i}\right\|_{\infty}< 10^{-20} Ji+1Ji<1020 , where J i J_{i} Ji 's are evaluations of T π 0 \mathrm{T}_{\pi_{0}} Tπ0
(iii) Iterate the \lambda -cascaded Bellman operator T π 0 , J 0 ( λ ) T_{\pi_{0}, J_{0}}^{(\lambda)} Tπ0,J0(λ) on J 0 J_{0} J0 till reaching the same stopping criterion as in (ii);
(iv) Compare the results by task (ii) and (iii), as well as the number of iterations required till stopping.

import numpy as np
import matplotlib.pyplot as plt

# transition probability
p1 = 0.4
p2 = 0.6

# initial cost function
jh = jl1 = jl2 = jl3 = je = 0

# cost of each state-action pair
ghs = 0
gl1s = gl2s = gl3s = 2
gec = gl3c = gl2c = gl1c = 5

lmbd = 0.6
gamma = 0.9  # discount factor


def T_pi(J_k, ul1, ul2, ul3):
    """
    J_k: numpy array, total cost of five states
    ul1, ul2, ul3: str, 'S' or 'C'

    return: numpy array, updated total cost of five states
    """
    jh, jl1, jl2, jl3, je = float(J_k[0]), float(J_k[1]), float(J_k[2]), float(J_k[3]), float(J_k[4])

    jh_ = p1 * (ghs + gamma * jl1) + p2 * (ghs + gamma * jl2)  # S

    if ul1 == 'S':
        jl1_ = p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3)
    elif ul1 == 'C':
        jl1_ = gl1c + gamma * jh

    if ul2 == 'S':
        jl2_ = p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je)
    elif ul2 == 'C':
        jl2_ = p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh)

    if ul3 == 'S':
        jl3_ = gl3s + gamma * je
    elif ul3 == 'C':
        jl3_ = p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1)

    je_ = p1 * (gec + gamma * jl3) + p2 * (gec + gamma * jl2)  # C

    return np.array([jh_, jl1_, jl2_, jl3_, je_])


# Q1: Generate GIP and its Bellman operator T_pi0
jl1_all = [p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3), gl1c + gamma * jh]
ul1 = jl1_all.index(min(jl1_all))

jl2_all = [p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je),
           p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh)]
ul2 = jl2_all.index(min(jl2_all))

jl3_all = [gl3s + gamma * je,
           p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1)]
ul3 = jl3_all.index(min(jl3_all))

Actions = ['S', 'C']
print('Q1:\tThe GIPs of L1, L2, L3 are {}, {}, {}'.format(Actions[ul1], Actions[ul2], Actions[ul3]))

J_i = [0., 0., 0., 0., 0.]
T_pi0 = T_pi(J_i, Actions[ul1], Actions[ul2], Actions[ul3])
print('\t and its Bellman operator are {}'.format(T_pi0))

# Q2 : the number of iterations of Bellman operator
Threshold = 1
i = 0
J_i_data = []
while Threshold > 1e-2:
    J_i_ = T_pi(J_i, Actions[ul1], Actions[ul2], Actions[ul3])
    Threshold = max(abs(J_i_ - J_i))
    J_i_data.append(J_i)
    J_i = J_i_
    i += 1

print('\nQ2: After {} iterations the cost function J_i are {}'.format(i, J_i))

# Q3: the number of iterations of lambda-cascade Bellman operator
Threshold = 1
k = 0
J_k = [0., 0., 0., 0., 0.]
J_0 = [0., 0., 0., 0., 0.]
J_k_data = []
while Threshold > 1e-2:
    J_k_ = (1 - lmbd) * T_pi(J_0, Actions[ul1], Actions[ul2], Actions[ul3]) \
        + lmbd * (T_pi(J_k, Actions[ul1], Actions[ul2], Actions[ul3]))
    Threshold = max(abs(J_k_ - J_k))
    J_k_data.append(J_k)
    J_k = J_k_
    k += 1

print('\nQ3: After {} iterations the cost function J_k are {}'.format(k, J_k))

# Q4: compare the results and plot the results
J_i_error = np.max(np.array(abs(J_i_data - J_i)), axis=1)
J_k_error = np.max(np.array(abs(J_k_data - J_k)), axis=1)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('Iteration i')
ax.set_ylabel('$||J - J*||_{\infty}$')
ax.plot(J_i_error, marker='*', label='$PE$')
ax.plot(J_k_error, marker='.', label='$\lambda \ - \ PE$')
ax.legend()
ax.set_title('Threshlod = 0.01')
plt.show()

Th outputs are:

Q1:	The GIPs of L1, L2, L3 are S, S, S
	 and its Bellman operator are [0. 2. 2. 2. 5.]

Q2: After 57 iterations the cost function J_i are [28.71067246 31.33441578 32.29354488 32.80920569 34.24159683]

Q3: After 11 iterations the cost function J_k are [3.15423695 5.43385663 6.12451136 6.52698135 8.39038175]

在这里插入图片描述

Task 2

Implement and run the \lambda -Policy Iteration Algorithm till convergence to generate an optimal policy. Let us set the number of optimistic λ \lambda λ -Policy evaluation step m k = 5 m_{k}=5 mk=5 .

# Task 2 : generate the optimal policy
print('\n-------The optimal policy---------\n')
J_k = np.array([0., 0., 0., 0., 0.])
for k in range(0, 5):
    # Policy Inprovement
    jh, jl1, jl2, jl3, je = J_k
    ul1 = np.argmin([p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3), gl1c + gamma * jh])

    ul2 = np.argmin([p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je),
           p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh)])

    ul3 = np.argmin([gl3s + gamma * je,
           p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1)])
    
    # Policy Evaluation
    for m in range(5):
        J_k_ = (1 - lmbd) * T_pi(J_k, Actions[ul1], Actions[ul2], Actions[ul3]) \
               + lmbd * (T_pi(J_k, Actions[ul1], Actions[ul2], Actions[ul3]))
        J_k = J_k_

    print('In {}-th step the optimal policy are in L1 is {}, in L2 is {}, '
          'in L3 is {}'.format(k, Actions[ul1], Actions[ul2], Actions[ul3]))

the outputs are

-------The optimal policy---------

In 0-th step the optimal policy are in L1 is S, in L2 is S, in L3 is S
In 1-th step the optimal policy are in L1 is C, in L2 is C, in L3 is S
In 2-th step the optimal policy are in L1 is C, in L2 is C, in L3 is S
In 3-th step the optimal policy are in L1 is C, in L2 is C, in L3 is S
In 4-th step the optimal policy are in L1 is C, in L2 is C, in L3 is S
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Stan Fu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值