【强化学习的数学原理】第二课:贝尔曼公式

【例子 -> return重要性】

在这里插入图片描述

问题:能否用数学工具描述从 s 1 s_1 s1出发,哪个策略是最好的?

回答:return可以评估一个策略

  • 策略1:
    return ⁡ 1 = 0 + γ 1 + γ 2 1 + … = γ ( 1 + γ + γ 2 + … ) = γ 1 − γ \begin{aligned} \operatorname{return}_1 & =0+\gamma 1+\gamma^2 1+\ldots \\ & =\gamma\left(1+\gamma+\gamma^2+\ldots\right) \\ & =\frac{\gamma}{1-\gamma} \end{aligned} return1=0+γ1+γ21+=γ(1+γ+γ2+)=1γγ

  • 策略2:
     return  2 = − 1 + γ 1 + γ 2 1 + … , = − 1 + γ ( 1 + γ + γ 2 + … ) , = − 1 + γ 1 − γ . \begin{aligned} \text { return }_2 & =-1+\gamma 1+\gamma^2 1+\ldots, \\ & =-1+\gamma\left(1+\gamma+\gamma^2+\ldots\right), \\ & =-1+\frac{\gamma}{1-\gamma} . \end{aligned}  return 2=1+γ1+γ21+,=1+γ(1+γ+γ2+),=1+1γγ.

  • 策略3:
     return  3 = 0.5 ( − 1 + γ 1 − γ ) + 0.5 ( γ 1 − γ ) = − 0.5 + γ 1 − γ . \begin{aligned} \text { return }_3 & =0.5\left(-1+\frac{\gamma}{1-\gamma}\right)+0.5\left(\frac{\gamma}{1-\gamma}\right) \\ & =-0.5+\frac{\gamma}{1-\gamma} . \end{aligned}  return 3=0.5(1+1γγ)+0.5(1γγ)=0.5+1γγ.

 return  1 >  return  3 >  return  2 \text { return }_1>\text { return }_3>\text { return }_2  return 1> return 3> return 2

✨return计算

在这里插入图片描述

方法1: v i v_i vi表示从 s i ( i = 1 , 2 , 3 , 4 ) s_i(i=1,2,3,4) si(i=1,2,3,4)出发的return
v 1 = r 1 + γ r 2 + γ 2 r 3 + … v 2 = r 2 + γ r 3 + γ 2 r 4 + … v 3 = r 3 + γ r 4 + γ 2 r 1 + … v 4 = r 4 + γ r 1 + γ 2 r 2 + … \begin{aligned} & v_1=r_1+\gamma r_2+\gamma^2 r_3+\ldots \\ & v_2=r_2+\gamma r_3+\gamma^2 r_4+\ldots \\ & v_3=r_3+\gamma r_4+\gamma^2 r_1+\ldots \\ & v_4=r_4+\gamma r_1+\gamma^2 r_2+\ldots \end{aligned} v1=r1+γr2+γ2r3+v2=r2+γr3+γ2r4+v3=r3+γr4+γ2r1+v4=r4+γr1+γ2r2+
方法2(Bootstrapping):表明我们从不同状态出发得到的return,依赖于从其他状态出发得到的return
v 1 = r 1 + γ ( r 2 + γ r 3 + … ) = r 1 + γ v 2 v 2 = r 2 + γ ( r 3 + γ r 4 + … ) = r 2 + γ v 3 v 3 = r 3 + γ ( r 4 + γ r 1 + … ) = r 3 + γ v 4 v 4 = r 4 + γ ( r 1 + γ r 2 + … ) = r 4 + γ v 1 \begin{aligned} & v_1=r_1+\gamma\left(r_2+\gamma r_3+\ldots\right)=r_1+\gamma v_2 \\ & v_2=r_2+\gamma\left(r_3+\gamma r_4+\ldots\right)=r_2+\gamma v_3 \\ & v_3=r_3+\gamma\left(r_4+\gamma r_1+\ldots\right)=r_3+\gamma v_4 \\ & v_4=r_4+\gamma\left(r_1+\gamma r_2+\ldots\right)=r_4+\gamma v_1 \end{aligned} v1=r1+γ(r2+γr3+)=r1+γv2v2=r2+γ(r3+γr4+)=r2+γv3v3=r3+γ(r4+γr1+)=r3+γv4v4=r4+γ(r1+γr2+)=r4+γv1

[ v 1 v 2 v 3 v 4 ] ⏟ v = [ r 1 r 2 r 3 r 4 ] + [ γ v 2 γ v 3 γ v 4 γ v 1 ] = [ r 1 r 2 r 3 r 4 ] ⏟ r + γ [ 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 ] ⏟ P [ v 1 v 2 v 3 v 4 ] ⏟ v \underbrace{\left[\begin{array}{l} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right]}_{\mathbf{v}}=\left[\begin{array}{l} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right]+\left[\begin{array}{l} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{array}\right]=\underbrace{\left[\begin{array}{l} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right]}_{\mathbf{r}}+\gamma \underbrace{\left[\begin{array}{llll} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{array}\right]}_{\mathbf{P}} \underbrace{\left[\begin{array}{l} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right]}_{\mathbf{v}} v v1v2v3v4 = r1r2r3r4 + γv2γv3γv4γv1 =r r1r2r3r4 +γP 0001100001000010 v v1v2v3v4

v = r + γ P v \mathbf{v}=\mathbf{r}+\gamma \mathbf{P} \mathbf{v} v=r+γPv

  • 这个公式就是贝尔曼公式(针对这个特定的问题)
  • 说明我一个状态的值依赖于其他状态的值
  • 矩阵形式如何求解

【状态值(state value)】

✨单步过程

S t ⟶ A t R t + 1 , S t + 1 S_t \stackrel{A_t}{\longrightarrow} R_{t+1}, S_{t+1} StAtRt+1,St+1

  • t , t + 1 t,t+1 t,t+1:离散时间实例
  • S t S_t St:时间 t t t 的状态
  • A t A_t At:在状态 S t S_t St做的动作
  • R t + 1 R_{t+1} Rt+1:在行为 A t A_t At后获得的奖励
  • S t + 1 S_{t+1} St+1:执行动作 A t A_t At后的下一个状态

每一步受下列概率分布的制约:

  • S t → A t S_t \rightarrow A_t StAt π ( A t = a ∣ S t = s ) \pi\left(A_t=a \mid S_t=s\right) π(At=aSt=s)
  • S t , A t → R t + 1 S_t, A_t \rightarrow R_{t+1} St,AtRt+1:reward probability p ( R t + 1 = r ∣ S t = s , A t = a ) p\left(R_{t+1}=r \mid S_t=s, A_t=a\right) p(Rt+1=rSt=s,At=a)
  • S t , A t → S t + 1 S_t, A_t \rightarrow S_{t+1} St,AtSt+1:state transition probability p ( S t + 1 = s ′ ∣ S t = s , A t = a ) p\left(S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right) p(St+1=sSt=s,At=a)

✨多步过程

S t ⟶ A t R t + 1 , S t + 1 ⟶ A t + 1 R t + 2 , S t + 2 ⟶ A t + 2 R t + 3 , … S_t \stackrel{A_t}{\longrightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\longrightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\longrightarrow} R_{t+3}, \ldots StAtRt+1,St+1At+1Rt+2,St+2At+2Rt+3,

  • discounted return: G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots Gt=Rt+1+γRt+2+γ2Rt+3+

✨状态值

G t G_t Gt是对一个 trajectory 的 discounted return ,state value 是对 G t G_t Gt 的一个期望值
v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s)=\mathbb{E}\left[G_t \mid S_t=s\right] vπ(s)=E[GtSt=s]

  • v π ( s ) v_\pi(s) vπ(s) s s s 出发,从不同的轨迹出发其 G t G_t Gt 也是不同的
  • v π ( s ) v_\pi(s) vπ(s)依赖于不同的 π \pi π,从不同的策略走其轨迹不同其 G t G_t Gt 也是不同的
  • state value 不仅仅是数值而且是价值,价值越大得到更多的return

问题:return 和 state value 有什么不同

回答

  • return 是针对单个 trajectory 得到的 return

  • state value 是对多个 trajectory 得到return 再求平均值

    加入从一个状态出发会有多个trajectory 那么这两个是显然有区别的;加入从一个状态出发只有一个trajectory 那么这两个是一样的

【贝尔曼公式推导】

定义

它描述了不同状态的state value之间的关系

推导

对于随机的一个trajectory :
S t ⟶ A t R t + 1 , S t + 1 ⟶ A t + 1 R t + 2 , S t + 2 ⟶ A t + 2 R t + 3 , … S_t \stackrel{A_t}{\longrightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\longrightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\longrightarrow} R_{t+3}, \ldots StAtRt+1,St+1At+1Rt+2,St+2At+2Rt+3,
其return G t G_t Gt是:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , = R t + 1 + γ ( R t + 2 + γ R t + 3 + … ) , = R t + 1 + γ G t + 1 , \begin{aligned} G_t & =R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots, \\ & =R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right), \\ & =R_{t+1}+\gamma G_{t+1}, \end{aligned} Gt=Rt+1+γRt+2+γ2Rt+3+,=Rt+1+γ(Rt+2+γRt+3+),=Rt+1+γGt+1,
其state value是:
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] \begin{aligned} v_\pi(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right] \end{aligned} vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]
我们分别分析其两个均值:

  • 对于第一个 E [ R t + 1 ∣ S t = s ] : \mathbb{E}\left[R_{t+1} \mid S_t=s\right]: E[Rt+1St=s]:
    E [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r \begin{aligned} \mathbb{E}\left[R_{t+1} \mid S_t=s\right] & =\sum_a \pi(a \mid s) \mathbb{E}\left[R_{t+1} \mid S_t=s, A_t=a\right] \\ & =\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r \end{aligned} E[Rt+1St=s]=aπ(as)E[Rt+1St=s,At=a]=aπ(as)rp(rs,a)r

  • 对于第二个 E [ G t + 1 ∣ S t = s ] : \mathbb{E}\left[G_{t+1} \mid S_t=s\right]: E[Gt+1St=s]:它是未来奖励的均值
    E [ G t + 1 ∣ S t = s ] = ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) ∑ a p ( s ′ ∣ s , a ) π ( a ∣ s ) \begin{aligned} \mathbb{E}\left[G_{t+1} \mid S_t=s\right] & =\sum_{s^{\prime}} \mathbb{E}\left[G_{t+1} \mid S_t=s, S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime}} \mathbb{E}\left[G_{t+1} \mid S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime}} v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime}} v_\pi\left(s^{\prime}\right) \sum_a p\left(s^{\prime} \mid s, a\right) \pi(a \mid s) \end{aligned} E[Gt+1St=s]=sE[Gt+1St=s,St+1=s]p(ss)=sE[Gt+1St+1=s]p(ss)=svπ(s)p(ss)=svπ(s)ap(ss,a)π(as)

贝尔曼公示的表达式:该式子对应于状态空间所有的状态都成立
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] , = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r ⏟ mean of immediate rewards  + γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) , ⏟ mean of future rewards  = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , ∀ s ∈ S . \begin{aligned} v_\pi(s) & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right], \\ & =\underbrace{\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r}_{\text {mean of immediate rewards }}+\underbrace{\gamma \sum_a \pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right),}_{\text {mean of future rewards }} \\ & =\sum_a \pi(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right], \quad \forall s \in \mathcal{S} . \end{aligned} vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s],=mean of immediate rewards  aπ(as)rp(rs,a)r+mean of future rewards  γaπ(as)sp(ss,a)vπ(s),=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)],sS.

✨例子1:

在这里插入图片描述

我们将这个图中所有的贝尔曼公式写出来:

s 1 s1 s1
π ( a = a 3 ∣ s 1 ) = 1  and  π ( a ≠ a 3 ∣ s 1 ) = 0. p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1  and  p ( s ′ ≠ s 3 ∣ s 1 , a 3 ) = 0. p ( r = 0 ∣ s 1 , a 3 ) = 1  and  p ( r ≠ 0 ∣ s 1 , a 3 ) = 0. \begin{aligned} & \pi\left(a=a_3 \mid s_1\right)=1 \text { and } \pi\left(a \neq a_3 \mid s_1\right)=0 . \\ & p\left(s^{\prime}=s_3 \mid s_1, a_3\right)=1 \text { and } p\left(s^{\prime} \neq s_3 \mid s_1, a_3\right)=0 . \\ & p\left(r=0 \mid s_1, a_3\right)=1 \text { and } p\left(r \neq 0 \mid s_1, a_3\right)=0 . \end{aligned} π(a=a3s1)=1 and π(a=a3s1)=0.p(s=s3s1,a3)=1 and p(s=s3s1,a3)=0.p(r=0s1,a3)=1 and p(r=0s1,a3)=0.
在这里插入图片描述

v π ( s 1 ) = 0 + γ v π ( s 3 ) v_\pi\left(s_1\right)=0+\gamma v_\pi\left(s_3\right) vπ(s1)=0+γvπ(s3)
同理得到
v π ( s 1 ) = 0 + γ v π ( s 3 ) , v π ( s 2 ) = 1 + γ v π ( s 4 ) v π ( s 3 ) = 1 + γ v π ( s 4 ) v π ( s 4 ) = 1 + γ v π ( s 4 ) . \begin{aligned} & v_\pi\left(s_1\right)=0+\gamma v_\pi\left(s_3\right), \\ & v_\pi\left(s_2\right)=1+\gamma v_\pi\left(s_4\right) \\ & v_\pi\left(s_3\right)=1+\gamma v_\pi\left(s_4\right) \\ & v_\pi\left(s_4\right)=1+\gamma v_\pi\left(s_4\right) . \end{aligned} vπ(s1)=0+γvπ(s3),vπ(s2)=1+γvπ(s4)vπ(s3)=1+γvπ(s4)vπ(s4)=1+γvπ(s4).
通过求解得到:
v π ( s 4 ) = 1 1 − γ , v π ( s 3 ) = 1 1 − γ , v π ( s 2 ) = 1 1 − γ , v π ( s 1 ) = γ 1 − γ . \begin{aligned} & v_\pi\left(s_4\right)=\frac{1}{1-\gamma}, \\ & v_\pi\left(s_3\right)=\frac{1}{1-\gamma}, \\ & v_\pi\left(s_2\right)=\frac{1}{1-\gamma}, \\ & v_\pi\left(s_1\right)=\frac{\gamma}{1-\gamma} . \end{aligned} vπ(s4)=1γ1,vπ(s3)=1γ1,vπ(s2)=1γ1,vπ(s1)=1γγ.
假设 γ = 0.9 \gamma=0.9 γ=0.9得到:
v π ( s 4 ) = 1 1 − 0.9 = 10 v π ( s 3 ) = 1 1 − 0.9 = 10 v π ( s 2 ) = 1 1 − 0.9 = 10 v π ( s 1 ) = 0.9 1 − 0.9 = 9 \begin{aligned} & v_\pi\left(s_4\right)=\frac{1}{1-0.9}=10 \\ & v_\pi\left(s_3\right)=\frac{1}{1-0.9}=10 \\ & v_\pi\left(s_2\right)=\frac{1}{1-0.9}=10 \\ & v_\pi\left(s_1\right)=\frac{0.9}{1-0.9}=9 \end{aligned} vπ(s4)=10.91=10vπ(s3)=10.91=10vπ(s2)=10.91=10vπ(s1)=10.90.9=9
在这里插入图片描述

假设一个状态价值高则说明有价值

✨例子2:

在这里插入图片描述

其贝尔曼公式:
v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) . \begin{aligned} & v_\pi\left(s_1\right)=0.5\left[0+\gamma v_\pi\left(s_3\right)\right]+0.5\left[-1+\gamma v_\pi\left(s_2\right)\right], \\ & v_\pi\left(s_2\right)=1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_3\right)=1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_4\right)=1+\gamma v_\pi\left(s_4\right) . \end{aligned} vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)],vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).
对其求解:
v π ( s 4 ) = 1 1 − γ , v π ( s 3 ) = 1 1 − γ , v π ( s 2 ) = 1 1 − γ v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] = − 0.5 + γ 1 − γ . \begin{aligned} v_\pi\left(s_4\right) & =\frac{1}{1-\gamma}, \quad v_\pi\left(s_3\right)=\frac{1}{1-\gamma}, \quad v_\pi\left(s_2\right)=\frac{1}{1-\gamma} \\ v_\pi\left(s_1\right) & =0.5\left[0+\gamma v_\pi\left(s_3\right)\right]+0.5\left[-1+\gamma v_\pi\left(s_2\right)\right] \\ & =-0.5+\frac{\gamma}{1-\gamma} . \end{aligned} vπ(s4)vπ(s1)=1γ1,vπ(s3)=1γ1,vπ(s2)=1γ1=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)]=0.5+1γγ.
假设 γ = 0.9 \gamma=0.9 γ=0.9得到:
v π ( s 4 ) = 10 , v π ( s 3 ) = 10 , v π ( s 2 ) = 10 , v π ( s 1 ) = − 0.5 + 9 = 8.5. v_\pi\left(s_4\right)=10, \quad v_\pi\left(s_3\right)=10, \quad v_\pi\left(s_2\right)=10, \quad v_\pi\left(s_1\right)=-0.5+9=8.5 . vπ(s4)=10,vπ(s3)=10,vπ(s2)=10,vπ(s1)=0.5+9=8.5.
通过观察发现其 s 1 s1 s1 的 state value 是8.5没有刚才那个策略好

【贝尔曼公式矩阵向量形式】

v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_\pi(s)=\sum_a \pi(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right] vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]
对贝尔曼公式进行重写得到:
v π ( s ) = r π ( s ) + γ ∑ s ′ p π ( s ′ ∣ s ) v π ( s ′ ) v_\pi(s)=r_\pi(s)+\gamma \sum_{s^{\prime}} p_\pi\left(s^{\prime} \mid s\right) v_\pi\left(s^{\prime}\right) vπ(s)=rπ(s)+γspπ(ss)vπ(s)

  • r π ( s ) ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r r_\pi(s) \triangleq \sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r rπ(s)aπ(as)rp(rs,a)r :为立即奖励的平均值
  • p π ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) p_\pi\left(s^{\prime} \mid s\right) \triangleq \sum_a \pi(a \mid s) p\left(s^{\prime} \mid s, a\right) pπ(ss)aπ(as)p(ss,a) :为从 s s s 转到 s ′ s' s 的概率

对状态从 s i ( i = 1 , … , n ) s_i(i=1, \ldots, n) si(i=1,,n),则贝尔曼公式为:
v π ( s i ) = r π ( s i ) + γ ∑ s j p π ( s j ∣ s i ) v π ( s j ) v_\pi\left(s_i\right)=r_\pi\left(s_i\right)+\gamma \sum_{s_j} p_\pi\left(s_j \mid s_i\right) v_\pi\left(s_j\right) vπ(si)=rπ(si)+γsjpπ(sjsi)vπ(sj)
转为矩阵向量形式:
v π = r π + γ P π v π v_\pi=r_\pi+\gamma P_\pi v_\pi vπ=rπ+γPπvπ

  • v π = [ v π ( s 1 ) , … , v π ( s n ) ] T ∈ R n v_\pi=\left[v_\pi\left(s_1\right), \ldots, v_\pi\left(s_n\right)\right]^T \in \mathbb{R}^n vπ=[vπ(s1),,vπ(sn)]TRn

  • r π = [ r π ( s 1 ) , … , r π ( s n ) ] T ∈ R n r_\pi=\left[r_\pi\left(s_1\right), \ldots, r_\pi\left(s_n\right)\right]^T \in \mathbb{R}^n rπ=[rπ(s1),,rπ(sn)]TRn

  • P π ∈ R n × n P_\pi \in \mathbb{R}^{n \times n} PπRn×n, 其中 [ P π ] i j = p π ( s j ∣ s i ) \left[P_\pi\right]_{i j}=p_\pi\left(s_j \mid s_i\right) [Pπ]ij=pπ(sjsi)为状态转移矩阵

    假如有四个状态 v π = r π + γ P π v π v_\pi=r_\pi+\gamma P_\pi v_\pi vπ=rπ+γPπvπ可以写成:
    [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π = [ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ] ⏟ r π + γ [ p π ( s 1 ∣ s 1 ) p π ( s 2 ∣ s 1 ) p π ( s 3 ∣ s 1 ) p π ( s 4 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 2 ∣ s 2 ) p π ( s 3 ∣ s 2 ) p π ( s 4 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 2 ∣ s 3 ) p π ( s 3 ∣ s 3 ) p π ( s 4 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) ] ⏟ P π [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π \underbrace{\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]}_{v_\pi}=\underbrace{\left[\begin{array}{l} r_\pi\left(s_1\right) \\ r_\pi\left(s_2\right) \\ r_\pi\left(s_3\right) \\ r_\pi\left(s_4\right) \end{array}\right]}_{r_\pi}+\gamma \quad \underbrace{\left[\begin{array}{llll} p_\pi\left(s_1 \mid s_1\right) & p_\pi\left(s_2 \mid s_1\right) & p_\pi\left(s_3 \mid s_1\right) & p_\pi\left(s_4 \mid s_1\right) \\ p_\pi\left(s_1 \mid s_2\right) & p_\pi\left(s_2 \mid s_2\right) & p_\pi\left(s_3 \mid s_2\right) & p_\pi\left(s_4 \mid s_2\right) \\ p_\pi\left(s_1 \mid s_3\right) & p_\pi\left(s_2 \mid s_3\right) & p_\pi\left(s_3 \mid s_3\right) & p_\pi\left(s_4 \mid s_3\right) \\ p_\pi\left(s_1 \mid s_4\right) & p_\pi\left(s_2 \mid s_4\right) & p_\pi\left(s_3 \mid s_4\right) & p_\pi\left(s_4 \mid s_4\right) \end{array}\right]}_{P_\pi} \underbrace{\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]}_{v_\pi} vπ vπ(s1)vπ(s2)vπ(s3)vπ(s4) =rπ rπ(s1)rπ(s2)rπ(s3)rπ(s4) +γPπ pπ(s1s1)pπ(s1s2)pπ(s1s3)pπ(s1s4)pπ(s2s1)pπ(s2s2)pπ(s2s3)pπ(s2s4)pπ(s3s1)pπ(s3s2)pπ(s3s3)pπ(s3s4)pπ(s4s1)pπ(s4s2)pπ(s4s3)pπ(s4s4) vπ vπ(s1)vπ(s2)vπ(s3)vπ(s4)

✨例子1:

在这里插入图片描述

[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0 1 1 1 ] + γ [ 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]=\left[\begin{array}{l} 0 \\ 1 \\ 1 \\ 1 \end{array}\right]+\gamma\left[\begin{array}{llll} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}\right]\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right] vπ(s1)vπ(s2)vπ(s3)vπ(s4) = 0111 +γ 0000000010000111 vπ(s1)vπ(s2)vπ(s3)vπ(s4)

✨例子2:

在这里插入图片描述

[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0.5 ( 0 ) + 0.5 ( − 1 ) 1 1 1 ] + γ [ 0 0.5 0.5 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \left[\begin{array}{c} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]=\left[\begin{array}{c} 0.5(0)+0.5(-1) \\ 1 \\ 1 \\ 1 \end{array}\right]+\gamma\left[\begin{array}{cccc} 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}\right]\left[\begin{array}{c} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right] vπ(s1)vπ(s2)vπ(s3)vπ(s4) = 0.5(0)+0.5(1)111 +γ 00000.50000.50000111 vπ(s1)vπ(s2)vπ(s3)vπ(s4)

【贝尔曼公式求解状态值】

策略评估:是强化学习中非常关键的工具,我们只有评价一个策略好不好才能进一步改进找到最优的策略
v π = r π + γ P π v π v_\pi=r_\pi+\gamma P_\pi v_\pi vπ=rπ+γPπvπ

✨求解方法1(closed-form solution):

v π = ( I − γ P π ) − 1 r π v_\pi=\left(I-\gamma P_\pi\right)^{-1} r_\pi vπ=(IγPπ)1rπ

通过直接求解,但是这种方法需要求解其逆矩阵,所以一般不采用

✨求解方法2(iterative solution):

v k + 1 = r π + γ P π v k v_{k+1}=r_\pi+\gamma P_\pi v_k vk+1=rπ+γPπvk

在这里插入图片描述

我们发现:
v k → v π = ( I − γ P π ) − 1 r π , k → ∞ v_k \rightarrow v_\pi=\left(I-\gamma P_\pi\right)^{-1} r_\pi, \quad k \rightarrow \infty vkvπ=(IγPπ)1rπ,k

✨例子1:

在这里插入图片描述

这两个例子就很好,用了不同的策略但是最终的result一样

✨例子2:

在这里插入图片描述

这两个策略不好, 两个state value都是负的。

我们可以计算state value来评价一个策略好还是不好

【动作值(action value)】

  • state value:agent从一个状态出发所得到的average return

  • action value:agent从一个状态出发并且选择了一个action后得到的average return

    在强化学习中,我们选择怎样的策略,策略指的是在一个状态我要选择什么样的action,action value可以判断选择哪些。

✨定义:

q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s, a)=\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right] qπ(s,a)=E[GtSt=s,At=a]

  • q π ( s , a ) q_\pi(s, a) qπ(s,a):依赖于从哪个状态出发选择怎样的action;其次他也依赖于策略 π \pi π

✨state value与action value关系:

E [ G t ∣ S t = s ] ⏟ v π ( s ) = ∑ a E [ G t ∣ S t = s , A t = a ] ⏟ q π ( s , a ) π ( a ∣ s ) v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) \begin{aligned} &\underbrace{\mathbb{E}\left[G_t \mid S_t=s\right]}_{v_\pi(s)}=\sum_a \underbrace{\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right]}_{q_\pi(s, a)} \pi(a \mid s)\\ &v_\pi(s)=\sum_a \pi(a \mid s) q_\pi(s, a) \end{aligned} vπ(s) E[GtSt=s]=aqπ(s,a) E[GtSt=s,At=a]π(as)vπ(s)=aπ(as)qπ(s,a)

state value:我有许多个action,我选择不同action得到的action value的平均值

根据之前的贝尔曼公式:
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ⏟ q π ( s , a ) ] v_\pi(s)=\sum_a \pi(a \mid s)[\underbrace{\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)}_{q_\pi(s, a)}] vπ(s)=aπ(as)[qπ(s,a) rp(rs,a)r+γsp(ss,a)vπ(s)]
我们得到action value的表达式:
q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) q_\pi(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right) qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)

✨例子:

在这里插入图片描述

s 1 s_1 s1的action value : q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) q_\pi\left(s_1, a_2\right)=-1+\gamma v_\pi\left(s_2\right) qπ(s1,a2)=1+γvπ(s2)

问题: q π ( s 1 , a 1 ‾ ) , q π ( s 1 , a 3 ‾ ) , q π ( s 1 , a 4 ‾ ) , q π ( s 1 , a 5 ‾ ) = ? q_\pi\left(s_1, \underline{a_1}\right), q_\pi\left(s_1, \underline{a_3}\right), q_\pi\left(s_1, \underline{a_4}\right), q_\pi\left(s_1, \underline{a_5}\right)=? qπ(s1,a1),qπ(s1,a3),qπ(s1,a4),qπ(s1,a5)=?

回答:虽然现在策略告诉我们往右走但是该状态不一定好,实际上所以的action都可以计算的
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) \begin{aligned} & q_\pi\left(s_1, a_1\right)=-1+\gamma v_\pi\left(s_1\right) \\ & q_\pi\left(s_1, a_3\right)=0+\gamma v_\pi\left(s_3\right) \\ & q_\pi\left(s_1, a_4\right)=-1+\gamma v_\pi\left(s_1\right) \\ & q_\pi\left(s_1, a_5\right)=0+\gamma v_\pi\left(s_1\right) \end{aligned} qπ(s1,a1)=1+γvπ(s1)qπ(s1,a3)=0+γvπ(s3)qπ(s1,a4)=1+γvπ(s1)qπ(s1,a5)=0+γvπ(s1)

【小结】

在这里插入图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值