【强化学习】Bellman Equation Derivation (贝尔曼方程的推导)

Bellman Equation Derivation:

  • Return( G t G_{t} Gt) 说的是把奖励进行折扣后所获得的收益。
  • State value function( V t ( s ) V_{t}(s) Vt(s))是MRP(Markov Reward Process, < s , r , s ′ > <s, r, s'> <s,r,s>)的return的期望。可以定义为如下形式:
    V ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T ∣ s t = s ] = E [ R t + 1 ∣ s t = s ] + E [ γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T ∣ s t = s ] = R ( s ) + γ E [ R t + 2 + γ R t + 3 + . . . + γ T − t − 1 R T ∣ s t = s = R ( s ) + γ E [ G t + 1 ∣ s t = s ] V(s) = E[G_{t}|s_{t} = s] \\= E[ R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... + \gamma^{T-t-1} R_{T}|s_{t} = s] \\= E[ R_{t+1}|s_{t} = s] + E[ \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... + \gamma^{T-t-1} R_{T}|s_{t} = s] \\= R(s) + \gamma E[R_{t+2} + \gamma R_{t+3} + ... + \gamma^{T-t-1} R_{T}|s_{t} = s = R(s) + \gamma E[G_{t+1}|s_{t} = s] V(s)=E[Gtst=s]=E[Rt+1+γRt+2+γ2Rt+3+...+γTt1RTst=s]=E[Rt+1st=s]+E[γRt+2+γ2Rt+3+...+γTt1RTst=s]=R(s)+γE[Rt+2+γRt+3+...+γTt1RTst=s=R(s)+γE[Gt+1st=s]根据如下等式,记性贝尔曼方程的推导:
    E [ V ( s t + 1 ) ∣ s t ] = E [ E [ G t + 1 ∣ s t + 1 ] ∣ s t ] = E [ G t + 1 ∣ s t ] E[V(s_{t+1})|s_{t}] = E[E[G_{t+1}|s_{t+1}]|s_{t}] = E[G_{t+1}|s_{t}] E[V(st+1)st]=E[E[Gt+1st+1]st]=E[Gt+1st](这个等式用Law of Total Expectation: E ( x ) = ∑ i E ( X ∣ A i ) P ( A i ) E(x) = \sum_{i}E(X|A_{i})P(A_{i}) E(x)=iE(XAi)P(Ai)进行推导,推导过程:推导过程) V ( s t + 1 ) = E [ G t + 1 ∣ s t + 1 ] = G t + 1 ⇒ V ( s t + 1 ) = G t + 1 V(s_{t+1}) = E[G_{t+1}|s_{t+1}] = G_{t+1} \\ \Rightarrow V(s_{t+1}) = G_{t+1} V(st+1)=E[Gt+1st+1]=Gt+1V(st+1)=Gt+1 V ( s ) = R ( s ) + γ E [ G t + 1 ∣ s t = s ] = R ( s ) + γ E [ V ( s t + 1 ) ∣ s t = s ] = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s t + 1 ) V(s) = R(s) + \gamma E[G_{t+1}|s_{t} = s]\\ = R(s) + \gamma E[V(s_{t+1})|s_{t} = s]\\ = R(s) + \gamma \sum_{s'\in S} P(s'|s)V(s_{t+1}) V(s)=R(s)+γE[Gt+1st=s]=R(s)+γE[V(st+1)st=s]=R(s)+γsSP(ss)V(st+1)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值