Bellman Equation Derivation:
- Return( G t G_{t} Gt) 说的是把奖励进行折扣后所获得的收益。
- State value function(
V
t
(
s
)
V_{t}(s)
Vt(s))是MRP(Markov Reward Process,
<
s
,
r
,
s
′
>
<s, r, s'>
<s,r,s′>)的return的期望。可以定义为如下形式:
V ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T ∣ s t = s ] = E [ R t + 1 ∣ s t = s ] + E [ γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T ∣ s t = s ] = R ( s ) + γ E [ R t + 2 + γ R t + 3 + . . . + γ T − t − 1 R T ∣ s t = s = R ( s ) + γ E [ G t + 1 ∣ s t = s ] V(s) = E[G_{t}|s_{t} = s] \\= E[ R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... + \gamma^{T-t-1} R_{T}|s_{t} = s] \\= E[ R_{t+1}|s_{t} = s] + E[ \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... + \gamma^{T-t-1} R_{T}|s_{t} = s] \\= R(s) + \gamma E[R_{t+2} + \gamma R_{t+3} + ... + \gamma^{T-t-1} R_{T}|s_{t} = s = R(s) + \gamma E[G_{t+1}|s_{t} = s] V(s)=E[Gt∣st=s]=E[Rt+1+γRt+2+γ2Rt+3+...+γT−t−1RT∣st=s]=E[Rt+1∣st=s]+E[γRt+2+γ2Rt+3+...+γT−t−1RT∣st=s]=R(s)+γE[Rt+2+γRt+3+...+γT−t−1RT∣st=s=R(s)+γE[Gt+1∣st=s]根据如下等式,记性贝尔曼方程的推导:
E [ V ( s t + 1 ) ∣ s t ] = E [ E [ G t + 1 ∣ s t + 1 ] ∣ s t ] = E [ G t + 1 ∣ s t ] E[V(s_{t+1})|s_{t}] = E[E[G_{t+1}|s_{t+1}]|s_{t}] = E[G_{t+1}|s_{t}] E[V(st+1)∣st]=E[E[Gt+1∣st+1]∣st]=E[Gt+1∣st](这个等式用Law of Total Expectation: E ( x ) = ∑ i E ( X ∣ A i ) P ( A i ) E(x) = \sum_{i}E(X|A_{i})P(A_{i}) E(x)=∑iE(X∣Ai)P(Ai)进行推导,推导过程:推导过程) V ( s t + 1 ) = E [ G t + 1 ∣ s t + 1 ] = G t + 1 ⇒ V ( s t + 1 ) = G t + 1 V(s_{t+1}) = E[G_{t+1}|s_{t+1}] = G_{t+1} \\ \Rightarrow V(s_{t+1}) = G_{t+1} V(st+1)=E[Gt+1∣st+1]=Gt+1⇒V(st+1)=Gt+1 V ( s ) = R ( s ) + γ E [ G t + 1 ∣ s t = s ] = R ( s ) + γ E [ V ( s t + 1 ) ∣ s t = s ] = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s t + 1 ) V(s) = R(s) + \gamma E[G_{t+1}|s_{t} = s]\\ = R(s) + \gamma E[V(s_{t+1})|s_{t} = s]\\ = R(s) + \gamma \sum_{s'\in S} P(s'|s)V(s_{t+1}) V(s)=R(s)+γE[Gt+1∣st=s]=R(s)+γE[V(st+1)∣st=s]=R(s)+γs′∈S∑P(s′∣s)V(st+1)