The derivation of Bellman equation for value of a policy

In book ‘Reinforcement Learning - An Introduction’, Chapter 3, the author gives out the Bellman equation for v π v_\pi vπ as equation (3.14), but without detailed derivation. That makes me feel confused and uncomfortable, so I try to derive the Bellman equation by myself. The details of derivation are gave out as below:
v π ( s ) = E π ( G t ∣ S t = s ) = E π ( R t + 1 + γ ⋅ G t + 1 ∣ S t = s ) = E π ( R t + 1 ∣ S t = s ) + γ ⋅ E π ( G t + 1 ∣ S t = s ) = ∑ a [ E π ( R t + 1 ∣ S t = s , A t = a ) ⋅ P r ( A t = a ∣ S t = s ) + γ ⋅ E π ( G t + 1 ∣ S t = s , A t = a ) ⋅ P r ( A t = a ∣ S t = s ) ] = ∑ a P r ( A t = a ∣ S t = s ) [ E π ( R t + 1 ∣ S t = s , A t = a ) + γ ⋅ E π ( G t + 1 ∣ S t = s , A t = a ) ] = ∑ a π ( a ∣ s ) [ ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a ) + γ ∑ g g ⋅ P r ( G t + 1 = g ∣ S t = s , A t = a ) ] = ∑ a π ( a ∣ s ) [ ∑ r ∑ s ′ r ⋅ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ⋅ ∑ g g ∑ r ∑ s ′ P r ( G t + 1 = g , R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ] = ∑ a π ( a ∣ s ) [ ∑ r ∑ s ′ r ⋅ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ⋅ ∑ g g ∑ r ∑ s ′ P r ( G t + 1 = g , R t + 1 = r , S t + 1 = s ′ , S t = s , A t = a ) P r ( S t = s , A t = a ) ] = ∑ a π ( a ∣ s ) { ∑ r ∑ s ′ r ⋅ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ⋅ ∑ g g ∑ r ∑ s ′ [ P r ( G t + 1 = g ∣ R t + 1 = r , S t + 1 = s ′ , S t = s , A t = a ) ⋅ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) P r ( S t = s , A t = a ) / P r ( S t = s , A t = a ) ] } = ∑ a π ( a ∣ s ) { ∑ r ∑ s ′ r ⋅ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ⋅ ∑ g g ∑ r ∑ s ′ [ P r ( G t + 1 = g ∣ R t + 1 = r , S t + 1 = s ′ , S t = s , A t = a ) ⋅ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ] } = ∑ a π ( a ∣ s ) { ∑ r ∑ s ′ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ [ r + γ ∑ g g ⋅ P r ( G t + 1 = g ∣ R t + 1 = r , S t + 1 = s ′ , S t = s , A t = a ) ] } \begin{aligned} v_\pi(s) &= \mathbb E_\pi (G_t \mid S_t = s) \\ &= \mathbb E_\pi(R_{t+1} + \gamma \cdot G_{t+1} \mid S_t = s) \\ &= \mathbb E_\pi(R_{t+1} \mid S_t = s) + \gamma \cdot \mathbb E_\pi(G_{t+1} \mid S_t = s) \\ &= \sum_a \bigl [ \mathbb E_\pi (R_{t+1} \mid S_t = s, A_t = a) \cdot Pr(A_t = a \mid S_t =s) \\ &\quad + \gamma \cdot \mathbb E_\pi(G_{t+1} \mid S_t = s, A_t = a)\cdot Pr(A_t= a \mid S_t =s) \bigr ] \\ &= \sum_a Pr(A_t = a\mid S_t = s) \bigl [ \mathbb E_\pi(R_{t+1} \mid S_t = s, A_t =a) + \gamma \cdot \mathbb E_\pi (G_{t+1} \mid S_t =s, A_t = a) \bigr] \\ &= \sum_a \pi(a\mid s) \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r \mid S_t = s, A_t = a) + \gamma \sum_g g \cdot Pr(G_{t+1} = g \mid S_t = s, A_t = a) \Bigr ] \\ &= \sum_a \pi(a \mid s) \Bigl [ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} Pr(G_{t+1} = g, R_{t+1} = r, S_{t+1} = s' \mid S_t = s, A_t = a) \Bigr ] \\ &= \sum_a \pi(a \mid s) \Bigl [ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} \frac {Pr(G_{t+1} = g, R_{t+1} = r, S_{t+1} = s' , S_t = s, A_t = a)} {Pr(S_t = s, A_t = a)} \Bigr ] \\ &= \sum_a \pi(a \mid s) \biggl \{ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} \Bigl [ Pr(G_{t+1} = g \mid R_{t+1} = r, S_{t+1} = s' , S_t = s, A_t = a) \\ &\quad \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t = s, A_t = a) Pr(S_t = s, A_t = a) /Pr(S_t = s, A_t = a) \Bigr ] \biggr \} \\ &= \sum_a \pi(a \mid s) \biggl \{ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} \Bigl [ Pr(G_{t+1} = g \mid R_{t+1} = r, S_{t+1} = s' , S_t = s, A_t = a) \\ &\quad \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t = s, A_t = a) \Bigr ] \biggr \} \\ &= \sum_a \pi(a \mid s) \biggl \{ \sum_r \sum_{s'} Pr(R_{t+1} = r, S_{t+1} = s' |S_t = s, A_t = a) \\ &\quad \cdot \Bigl [ r + \gamma \sum_g g \cdot Pr(G_{t+1} = g| R_{t+1} =r, S_{t+1} = s', S_t =s, A_t = a) \Bigr ] \biggr \} \end{aligned} vπ(s)=Eπ(GtSt=s)=Eπ(Rt+1+γGt+1St=s)=Eπ(Rt+1St=s)+γEπ(Gt+1St=s)=a[Eπ(Rt+1St=s,At=a)Pr(At=aSt=s)+γEπ(Gt+1St=s,At=a)Pr(At=aSt=s)]=aPr(At=aSt=s)[Eπ(Rt+1St=s,At=a)+γEπ(Gt+1St=s,At=a)]=aπ(as)[rrPr(Rt+1=rSt=s,At=a)+γggPr(Gt+1=gSt=s,At=a)]=aπ(as)[rsrPr(Rt+1=r,St+1=sSt=s,At=a)+γggrsPr(Gt+1=g,Rt+1=r,St+1=sSt=s,At=a)]=aπ(as)[rsrPr(Rt+1=r,St+1=sSt=s,At=a)+γggrsPr(St=s,At=a)Pr(Gt+1=g,Rt+1=r,St+1=s,St=s,At=a)]=aπ(as){rsrPr(Rt+1=r,St+1=sSt=s,At=a)+γggrs[Pr(Gt+1=gRt+1=r,St+1=s,St=s,At=a)Pr(Rt+1=r,St+1=sSt=s,At=a)Pr(St=s,At=a)/Pr(St=s,At=a)]}=aπ(as){rsrPr(Rt+1=r,St+1=sSt=s,At=a)+γggrs[Pr(Gt+1=gRt+1=r,St+1=s,St=s,At=a)Pr(Rt+1=r,St+1=sSt=s,At=a)]}=aπ(as){rsPr(Rt+1=r,St+1=sSt=s,At=a)[r+γggPr(Gt+1=gRt+1=r,St+1=s,St=s,At=a)]}
∵ \because In Markov Process, G t + 1 G_{t+1} Gt+1 only relate to S t + 1 S_{t+1} St+1, S t S_t St and A t A_t At give no contribution to G t + 1 G_{t+1} Gt+1,
∴ P r ( G t + 1 = g ∣ R t = 1 = r , S t + 1 = s ′ , S t = s , A t = a ) = P r ( G t + 1 = g ∣ S t + 1 = s ′ ) \therefore Pr(G_{t+1} = g \mid R_{t=1}= r, S_{t+1} = s', S_t = s, A_t = a) = Pr(G_{t+1} = g \mid S_{t+1} =s') Pr(Gt+1=gRt=1=r,St+1=s,St=s,At=a)=Pr(Gt+1=gSt+1=s)
∴ v π ( s ) = ∑ a π ( a ∣ s ) { ∑ r ∑ s ′ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ [ r + γ ∑ g g ⋅ P r ( G t + 1 = g ∣ S t + 1 = s ′ ) ] } = ∑ a π ( a ∣ s ) { ∑ r ∑ s ′ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ [ r + γ E π ( G t + 1 ∣ S t + 1 = s ′ ) ] } = ∑ a π ( a ∣ s ) { ∑ r ∑ s ′ p ( r , s ′ ∣ s , a ) ⋅ [ r + γ v π ( s ′ ) ] } \begin{aligned} \therefore v_\pi(s) &= \sum_a \pi ( a \mid s) \biggl \{ \sum_r \sum_{s'}Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t = a) \\ &\quad \cdot \Bigl [ r + \gamma \sum_g g \cdot Pr(G_{t+1} = g \mid S_{t+1} = s') \Bigr ] \biggr \} \\ &= \sum_a \pi ( a \mid s) \biggl \{ \sum_r \sum_{s'}Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t = a) \\ &\quad \cdot \Bigl [ r + \gamma \mathbb E_\pi(G_{t+1} \mid S_{t+1} = s') \Bigr ] \biggr \} \\ &= \sum_a \pi ( a \mid s) \biggl \{ \sum_r \sum_{s'}p( r, s' \mid s, a) \cdot \Bigl [ r + \gamma v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned} vπ(s)=aπ(as){rsPr(Rt+1=r,St+1=sSt=s,At=a)[r+γggPr(Gt+1=gSt+1=s)]}=aπ(as){rsPr(Rt+1=r,St+1=sSt=s,At=a)[r+γEπ(Gt+1St+1=s)]}=aπ(as){rsp(r,ss,a)[r+γvπ(s)]}
That’s the Bellman equation for v π v_\pi vπ. We get it.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Please revise the paper:Accurate determination of bathymetric data in the shallow water zone over time and space is of increasing significance for navigation safety, monitoring of sea-level uplift, coastal areas management, and marine transportation. Satellite-derived bathymetry (SDB) is widely accepted as an effective alternative to conventional acoustics measurements over coastal areas with high spatial and temporal resolution combined with extensive repetitive coverage. Numerous empirical SDB approaches in previous works are unsuitable for precision bathymetry mapping in various scenarios, owing to the assumption of homogeneous bottom over the whole region, as well as the limitations of constructing global mapping relationships between water depth and blue-green reflectance takes no account of various confounding factors of radiance attenuation such as turbidity. To address the assumption failure of uniform bottom conditions and imperfect consideration of influence factors on the performance of the SDB model, this work proposes a bottom-type adaptive-based SDB approach (BA-SDB) to obtain accurate depth estimation over different sediments. The bottom type can be adaptively segmented by clustering based on bottom reflectance. For each sediment category, a PSO-LightGBM algorithm for depth derivation considering multiple influencing factors is driven to adaptively select the optimal influence factors and model parameters simultaneously. Water turbidity features beyond the traditional impact factors are incorporated in these regression models. Compared with log-ratio, multi-band and classical machine learning methods, the new approach produced the most accurate results with RMSE value is 0.85 m, in terms of different sediments and water depths combined with in-situ observations of airborne laser bathymetry and multi-beam echo sounder.
02-18

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值