1.贝尔曼等式
V ( s ) = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) V(s) = R(s) + \gamma \sum\limits_{s' \in S} {P(s'|s)V(s')} V(s)=R(s)+γs′∈S∑P(s′∣s)V(s′)其中:
- R ( s ) R(s) R(s)是Immediate reward
- γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) \gamma \sum\limits_{s' \in S}{P(s'|s)V(s')} γs′∈S∑P(s′∣s)V(s′)是Discounted sum of future reward
- s ′ s' s′是未来的所有状态
- V ( s ′ ) V(s') V(s′)代表未来某一个状态的价值
-
P
(
s
′
∣
s
)
P(s'|s)
P(s′∣s)代表从当前状态转移到未来状态的概率
贝尔曼等式定义了当前状态与未来状态之间的关系
2.基础巩固
- 条件概率公式: P ( A , B ) = P ( A B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) P(A,B) = P(AB) = P(A|B)P(B) = P(B|A)P(A) P(A,B)=P(AB)=P(A∣B)P(B)=P(B∣A)P(A)
- 全概率公式: P ( B ) = ∑ i P ( A i ) P ( B ∣ A i ) P(B) = \sum\limits_i {P({A_i})P(B|{A_i})} P(B)=i∑P(Ai)P(B∣Ai)
- 贝叶斯公式: P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac{{P(B|A)P(A)}}{{P(B)}} P(A∣B)=P(B)P(B∣A)P(A)
3.全期望公式(LIE)
若 A i A_i Ai是样本空间的有限或可数的划分,则全期望公式可表示为: E ( X ) = ∑ i E ( X ∣ A i ) P ( A i ) E\left( X \right) = \sum\limits_i {E\left( {X|{A_i}} \right)P({A_i})} E(X)=i∑E(X∣Ai)P(Ai)为了简洁,令 s = s t s=s_t s=st, g ′ = G t + 1 g'=G_{t+1} g′=Gt+1, s ′ = s t + 1 s'=s_{t+1} s′=st+1,则回报的期望可以表示为: E [ G t + 1 ∣ s t + 1 ] = E [ g ′ ∣ s ′ ] = ∑ g ′ g p ( g ′ ∣ s ′ ) \mathbb{E}\left[ {{G_{t + 1}}|{s_{t + 1}}} \right] = \mathbb{E}\left[ {g'|s'} \right] = \sum\limits_{g'} {gp(g'|s')} E[Gt+1∣st+1]=E[g′∣s′]=g′∑gp(g′∣s′)令 s t = s s_t=s st=s,对上式求期望有: E [ E [ G t + 1 ∣ s t + 1 ] ∣ s t ] = E [ E [ g ′ ∣ s ′ ] ∣ s ] = ∑ s ′ ∑ g ′ E [ E [ g ′ ∣ s ′ ] ] p ( g ′ ∣ s ′ , s ) p ( s ′ ∣ s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ ∣ s ′ , s ) p ( s ′ ∣ s ) p ( s ) p ( s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ ∣ s ′ , s ) p ( s ′ , s ) p ( s ) = ∑ g ′ ∑ s ′ g ′ p ( g ′ , s ′ , s ) p ( s ) = ∑ g ′ g ′ p ( g ′ , s ) p ( s ) = ∑ g ′ g ′ p ( g ′ ∣ s ) = E [ g ′ ∣ s ] = E [ G t + 1 ∣ s t ] \begin{array}{l}\mathbb{E}\left[ {\mathbb{E}\left[ {{G_{t + 1}}|{s_{t + 1}}} \right]|{s_t}} \right] = \mathbb{E}\left[ {\mathbb{E}\left[ {g'|s'} \right]|s} \right]\\{\rm{ = }}\sum\limits_{s'} {\sum\limits_{g'} {\mathbb{E}\left[ {\mathbb{E}\left[ {g'|s'} \right]} \right]p(g'|s',s)p(s'|s)} } \\{\rm{ = }}\sum\limits_{s'} {\sum\limits_{g'} {\frac{{g'p(g'|s',s)p(s'|s)p(s)}}{{p(s)}}} } \\{\rm{ = }}\sum\limits_{s'} {\sum\limits_{g'} {\frac{{g'p(g'|s',s)p(s',s)}}{{p(s)}}} } \\{\rm{ = }}\sum\limits_{g'} {\sum\limits_{s'} {\frac{{g'p(g',s',s)}}{{p(s)}}} } \\ = \sum\limits_{g'} {\frac{{g'p(g',s)}}{{p(s)}}} \\ = \sum\limits_{g'} {g'p(g'|s)} \\ = \mathbb{E}\left[ {g'|s} \right] = \mathbb{E}\left[ {{G_{t + 1}}|{s_t}} \right]\end{array} E[E[Gt+1∣st+1]∣st]=E[E[g′∣s′]∣s]=s′∑g′∑E[E[g′∣s′]]p(g′∣s′,s)p(s′∣s)=s′∑g′∑p(s)g′p(g′∣s′,s)p(s′∣s)p(s)=s′∑g′∑p(s)g′p(g′∣s′,s)p(s′,s)=g′∑s′∑p(s)g′p(g′,s′,s)=g′∑p(s)g′p(g′,s)=g′∑g′p(g′∣s)=E[g′∣s]=E[Gt+1∣st]即: E [ V ( s t + 1 ) ∣ s t ] = E [ E [ G t + 1 ∣ s t + 1 ] ∣ s t ] = E [ G t + 1 ∣ s t ] \mathbb{E}\left[ {V({s_{t + 1}})|{s_t}} \right] = \mathbb{E}\left[ {\mathbb{E}\left[ {{G_{t + 1}}|{s_{t + 1}}} \right]|{s_t}} \right] = \mathbb{E}\left[ {{G_{t + 1}}|{s_t}} \right] E[V(st+1)∣st]=E[E[Gt+1∣st+1]∣st]=E[Gt+1∣st]
4.贝尔曼等式推导
V ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … ∣ s t = s ] = E [ R t + 1 ∣ s t = s ] + γ E [ R t + 2 + γ R t + 3 + γ 2 R t + 4 … ∣ s t = s ] = R ( s ) + γ E [ G t + 1 ∣ s t = s ] = R ( s ) + γ E [ V ( s t + 1 ) ∣ s t = s ] = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) \begin{array}{l}V(s) = \mathbb{E}\left[ {{G_t}|{s_t} = s} \right]\\ = \mathbb{E}\left[ {{R_{t + 1}} + \gamma {R_{t + 2}} + {\gamma ^2}{R_{t + 3}} + \ldots |{s_t} = s} \right]\\ = \mathbb{E}\left[ {{R_{t + 1}}|{s_t} = s} \right] + \gamma \mathbb{E}\left[ {{R_{t + 2}} + \gamma {R_{t + 3}} + {\gamma ^2}{R_{t + 4}} \ldots |{s_t} = s} \right]\\ = R(s) + \gamma \mathbb{E}\left[ {{G_{t + 1}}|{s_t} = s} \right]\\ = R(s) + \gamma \mathbb{E}\left[ {V({s_{t + 1}})|{s_t} = s} \right]\\ = R(s){\rm{ + }}\gamma \sum\limits_{s' \in S} {P(s'|s)V(s')} \end{array} V(s)=E[Gt∣st=s]=E[Rt+1+γRt+2+γ2Rt+3+…∣st=s]=E[Rt+1∣st=s]+γE[Rt+2+γRt+3+γ2Rt+4…∣st=s]=R(s)+γE[Gt+1∣st=s]=R(s)+γE[V(st+1)∣st=s]=R(s)+γs′∈S∑P(s′∣s)V(s′)贝尔曼等式就是当前状态与未来状态的迭代关系,表示当前状态的值函数可以通过下个状态的值函数来计算。
小白经验记录,大神请批评指正。
如果文章对你有用,请点个赞吧~