贝尔曼方程推导[2]

贝尔曼方程推导[2]

参考内容

贝尔曼方程

所有的数学工具都在上面了。接下来就是分析四个表达式之间的关系。首先把上述四个表达式拓展一下,也就是拓展到下一步进行分析。
t + 1 t+1 t+1步的动作价值函数:
V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) = ∑ s t + 2 ⋯ ∑ s t + n ∑ a t + 2 ⋯ ∑ a t + n [ r t + 1 ( s t + 1 , a t + 1 ) + γ r t + 2 ( s t + 2 , a t + 2 ) + ⋯ + γ n − 1 r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] = ∑ s t + 2 ⋯ ∑ s t + n ∑ a t + 2 ⋯ ∑ a t + n U ( a t + 1 ∣ s t + 1 , f a , f r , f t ) ∗ [ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] \begin{split} V(a_{t+1}|f_a,f_r,f_t,s_{t+1})&=\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+2}}\cdots\sum_{a_{t+n}}[r_{t+1}(s_{t+1},a_{t+1})+\gamma{r_{t+2}(s_{t+2},a_{t+2})}+\cdots+\gamma^{n-1}r_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+2}|s_{t+1},a_{t+1})*P(a_{t+2}|s_{t+2})*\cdots*P(a_{t+n}|s_{t+n})] \\&=\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+2}}\cdots\sum_{a_{t+n}}U(a_{t+1}|s_{t+1},f_a,f_r,f_t)*[P(s_{t+2}|s_{t+1},a_{t+1})*P(a_{t+2}|s_{t+2})*\cdots*P(a_{t+n}|s_{t+n})] \end{split} V(at+1fa,fr,ft,st+1)=st+2st+nat+2at+n[rt+1(st+1,at+1)+γrt+2(st+2,at+2)++γn1rt+n(st+n,at+n)][P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]=st+2st+nat+2at+nU(at+1st+1,fa,fr,ft)[P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]

将上面 t t t步动作价值函数抄下来:
V ( a t ∣ f a , f r , f t , s t ) = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n U ( a t ∣ s t , f a , f r , f t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) \begin{split} V(a_t|f_a,f_r,f_t,s_t)&=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^nr_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n})] \\&=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}U(a_t|s_t,f_a,f_r,f_t)*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n}) \end{split} V(atfa,fr,ft,st)=st+1st+nat+1at+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(st+1st,at)P(at+1st+1)P(at+nst+n)]=st+1st+nat+1at+nU(atst,fa,fr,ft)P(st+1st,at)P(at+1st+1)P(at+nst+n)

由于:
U ( a t ∣ s t , f a , f r , f t ) = r t ( s t , a t ) + γ U ( a t + 1 ∣ s t + 1 , f a , f r , f t ) U(a_t|s_t,f_a,f_r,f_t)=r_t(s_t,a_t)+\gamma{U(a_{t+1}|s_{t+1},f_a,f_r,f_t)} U(atst,fa,fr,ft)=rt(st,at)+γU(at+1st+1,fa,fr,ft)

对于 t t t步动作价值函数与 t + 1 t+1 t+1步动作价值函数有贝尔曼方程 A . 1 A.1 A.1
V ( a t ∣ f a , f r , f t , s t ) = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n r t ( s t , a t ) [ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] + ∑ s t + 1 ∑ a t + 1 γ V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) = r t ( s t , a t ) + ∑ s t + 1 ∑ a t + 1 γ V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) = r t ( s t , a t ) + γ ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n U ( a t + 1 ∣ s t + 1 , f a , f r , f t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) = r t ( s t , a t ) + γ ∑ s t + 1 ∑ a t + 1 ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ∑ s t + 2 ⋯ ∑ s t + n ∑ a t + 2 ⋯ ∑ a t + n U ( a t + 1 ∣ s t + 1 , f a , f r , f t ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) = r t ( s t , a t ) + γ ∑ s t + 1 ∑ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) \begin{split} V(a_t|f_a,f_r,f_t,s_t)&=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}r_t(s_t,a_t)[P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n})]+\sum_{s_{t+1}}\sum_{a_{t+1}}\gamma{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1}) \\&=r_t(s_t,a_t)+\sum_{s_{t+1}}\sum_{a_{t+1}}\gamma{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1}) \\&=r_t(s_t,a_t)+\gamma\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}U(a_{t+1}|s_{t+1},f_a,f_r,f_t)*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n}) \\&=r_t(s_t,a_t)+\gamma\sum_{s_{t+1}}\sum_{a_{t+1}}*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+2}}\cdots\sum_{a_{t+n}}U(a_{t+1}|s_{t+1},f_a,f_r,f_t)*\cdots*P(a_{t+n}|s_{t+n}) \\&=r_t(s_t,a_t)+\gamma\sum_{s_{t+1}}\sum_{a_{t+1}}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1}) \end{split} V(atfa,fr,ft,st)=st+1st+nat+1at+nrt(st,at)[P(st+1st,at)P(at+1st+1)P(at+nst+n)]+st+1at+1γV(at+1fa,fr,ft,st+1)P(st+1st,at)P(at+1st+1)=rt(st,at)+st+1at+1γV(at+1fa,fr,ft,st+1)P(st+1st,at)P(at+1st+1)=rt(st,at)+γst+1st+nat+1at+nU(at+1st+1,fa,fr,ft)P(st+1st,at)P(at+1st+1)P(at+nst+n)=rt(st,at)+γst+1at+1P(st+1st,at)P(at+1st+1)st+2st+nat+2at+nU(at+1st+1,fa,fr,ft)P(at+nst+n)=rt(st,at)+γst+1at+1V(at+1fa,fr,ft,st+1)P(st+1st,at)P(at+1st+1)

t + 1 t+1 t+1步的最优动作价值函数:
V ( a t + 1 ∣ f a ∗ , f r , f t , s t + 1 ) = ∑ s t + 2 ⋯ ∑ s t + n ∑ a t + 2 ⋯ ∑ a t + n [ r t + 1 ( s t + 1 , a t + 1 ) + γ r t + 2 ( s t + 2 , a t + 2 ) + ⋯ + γ n − 1 r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ∗ ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] = ∑ s t + 2 ⋯ ∑ s t + n ∑ a t + 2 ⋯ ∑ a t + n U ( a t + 1 ∣ s t + 1 , f a , f r , f t ) ∗ [ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ∗ ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] \begin{split} V(a_{t+1}|f_a^*,f_r,f_t,s_{t+1})&=\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+2}}\cdots\sum_{a_{t+n}}[r_{t+1}(s_{t+1},a_{t+1})+\gamma{r_{t+2}(s_{t+2},a_{t+2})}+\cdots+\gamma^{n-1}r_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+2}|s_{t+1},a_{t+1})*P^*(a_{t+2}|s_{t+2})*\cdots*P^*(a_{t+n}|s_{t+n})] \\&=\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+2}}\cdots\sum_{a_{t+n}}{U(a_{t+1}|s_{t+1},f_a,f_r,f_t)}*[P(s_{t+2}|s_{t+1},a_{t+1})*P^*(a_{t+2}|s_{t+2})*\cdots*P^*(a_{t+n}|s_{t+n})] \end{split} V(at+1fa,fr,ft,st+1)=st+2st+nat+2at+n[rt+1(st+1,at+1)+γrt+2(st+2,at+2)++γn1rt+n(st+n,at+n)][P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]=st+2st+nat+2at+nU(at+1st+1,fa,fr,ft)[P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]

t t t步最优动作价值函数抄下来:
V ( a t ∣ f a ∗ , f r , f t , s t ) = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 1 ∣ s t , a t ) ∗ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(a_t|f_a^*,f_r,f_t,s_t)=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^nr_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+1}|s_t,a_t)*P^*(a_{t+1}|s_{t+1})*\cdots*P^*(a_{t+n}|s_{t+n})] V(atfa,fr,ft,st)=st+1st+nat+1at+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(st+1st,at)P(at+1st+1)P(at+nst+n)]

同样有上式成立贝尔曼方程 A . 1 A.1 A.1
V ( a t ∣ f a ∗ , f r , f t , s t ) = r t ( s t , a t ) + ∑ s t + 1 ∑ a t + 1 γ V ( a t + 1 ∣ f a ∗ , f r , f t , s t + 1 ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ∗ ( a t + 1 ∣ s t + 1 ) V(a_t|f_a^*,f_r,f_t,s_t)=r_t(s_t,a_t)+\sum_{s_{t+1}}\sum_{a_{t+1}}\gamma{V(a_{t+1}|f_a^*,f_r,f_t,s_{t+1})}*P(s_{t+1}|s_t,a_t)*P^*(a_{t+1}|s_{t+1}) V(atfa,fr,ft,st)=rt(st,at)+st+1at+1γV(at+1fa,fr,ft,st+1)P(st+1st,at)P(at+1st+1)

由于最优策略就是最大化动作价值函数,所以:
max ⁡ f a V ( a t ∣ f a , f r , f t , s t ) = max ⁡ f a [ r t ( s t , a t ) + γ ∑ s t + 1 ∑ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ] = r t ( s t , a t ) + γ max ⁡ f a [ ∑ s t + 1 ∑ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ] = r t ( s t , a t ) + γ ∑ s t + 1 P ( s t + 1 ∣ s t , a t ) ∗ max ⁡ f a [ ∑ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( a t + 1 ∣ s t + 1 ) ] \begin{split} \underset{f_a}{\max}V(a_t|f_a,f_r,f_t,s_t)&=\underset{f_a}{\max}[r_t(s_t,a_t)+\gamma\sum_{s_{t+1}}\sum_{a_{t+1}}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})] \\&=r_t(s_t,a_t)+\gamma\underset{f_a}{\max}[\sum_{s_{t+1}}\sum_{a_{t+1}}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})] \\&=r_t(s_t,a_t)+\gamma\sum_{s_{t+1}}P(s_{t+1}|s_t,a_t)*\underset{f_a}{\max}[\sum_{a_{t+1}}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(a_{t+1}|s_{t+1})] \end{split} famaxV(atfa,fr,ft,st)=famax[rt(st,at)+γst+1at+1V(at+1fa,fr,ft,st+1)P(st+1st,at)P(at+1st+1)]=rt(st,at)+γfamax[st+1at+1V(at+1fa,fr,ft,st+1)P(st+1st,at)P(at+1st+1)]=rt(st,at)+γst+1P(st+1st,at)famax[at+1V(at+1fa,fr,ft,st+1)P(at+1st+1)]

注意:我们需要调整的就是策略 P ( a t + k ∣ s t + k ) P(a_{t+k}|s_{t+k}) P(at+kst+k),当我们顺序化考虑,首先考虑优化 P ( a t + 1 ∣ s t + 1 ) P(a_{t+1}|s_{t+1}) P(at+1st+1)时,此时不会影响 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) {V(a_{t+1}|f_a,f_r,f_t,s_{t+1})} V(at+1fa,fr,ft,st+1),也就是不会影响后面的动作价值函数。所以最优化的结果就是:
P ∗ ( a t + 1 ∣ s t + 1 ) = [ 0 , ⋯   , 1 , 0 , ⋯   , 0 ] P^*(a_{t+1}|s_{t+1})=[0,\cdots,1,0,\cdots,0] P(at+1st+1)=[0,,1,0,,0]

其中 1 1 1对应位置上的动作具有如下特征:
a t + 1 = arg ⁡ max ⁡ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) a_{t+1}=\arg\underset{a_{t+1}}{\max}V(a_{t+1}|f_a,f_r,f_t,s_{t+1}) at+1=argat+1maxV(at+1fa,fr,ft,st+1)

以上就是贝尔曼方程 A . 4 A.4 A.4

也可以换个方向理解,其中的变换是通过分析因果关系得出的:
max ⁡ f a [ ∑ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( a t + 1 ∣ s t + 1 ) ] = max ⁡ P ( a t + 1 ∣ s t + 1 ) max ⁡ P ( a t + 2 ∣ s t + 2 ) ⋯ max ⁡ P ( a t + n ∣ s t + n ) [ ∑ a t + 1 V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ∗ P ( a t + 1 ∣ s t + 1 ) ] = max ⁡ P ( a t + 1 ∣ s t + 1 ) [ ∑ a t + 1 [ max ⁡ P ( a t + 2 ∣ s t + 2 ) ⋯ max ⁡ P ( a t + n ∣ s t + n ) V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ] ∗ P ( a t + 1 ∣ s t + 1 ) ] \begin{split} \underset{f_a}{\max}[\sum_{a_{t+1}}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(a_{t+1}|s_{t+1})]&=\underset{P(a_{t+1}|s_{t+1})}{\max}\underset{P(a_{t+2}|s_{t+2})}{\max}\cdots\underset{P(a_{t+n}|s_{t+n})}{\max}[\sum_{a_{t+1}}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}*P(a_{t+1}|s_{t+1})] \\&=\underset{P(a_{t+1}|s_{t+1})}{\max}[\sum_{a_{t+1}}[\underset{P(a_{t+2}|s_{t+2})}{\max}\cdots\underset{P(a_{t+n}|s_{t+n})}{\max}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}]*P(a_{t+1}|s_{t+1})] \end{split} famax[at+1V(at+1fa,fr,ft,st+1)P(at+1st+1)]=P(at+1st+1)maxP(at+2st+2)maxP(at+nst+n)max[at+1V(at+1fa,fr,ft,st+1)P(at+1st+1)]=P(at+1st+1)max[at+1[P(at+2st+2)maxP(at+nst+n)maxV(at+1fa,fr,ft,st+1)]P(at+1st+1)]

通过分析我们发现:调整 P ( a t + 1 ∣ s t + 1 ) P(a_{t+1}|s_{t+1}) P(at+1st+1)不会影响后面的 [ max ⁡ P ( a t + 2 ∣ s t + 2 ) ⋯ max ⁡ P ( a t + n ∣ s t + n ) V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ] [\underset{P(a_{t+2}|s_{t+2})}{\max}\cdots\underset{P(a_{t+n}|s_{t+n})}{\max}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}] [P(at+2st+2)maxP(at+nst+n)maxV(at+1fa,fr,ft,st+1)],所以可以把 [ max ⁡ P ( a t + 2 ∣ s t + 2 ) ⋯ max ⁡ P ( a t + n ∣ s t + n ) V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ] [\underset{P(a_{t+2}|s_{t+2})}{\max}\cdots\underset{P(a_{t+n}|s_{t+n})}{\max}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}] [P(at+2st+2)maxP(at+nst+n)maxV(at+1fa,fr,ft,st+1)]看成独立于 P ( a t + 1 ∣ s t + 1 ) P(a_{t+1}|s_{t+1}) P(at+1st+1)的常量,最优策略函数就是:
P ∗ ( a t + 1 ∣ s t + 1 ) = [ 0 , ⋯   , 1 , 0 , ⋯   , 0 ] P^*(a_{t+1}|s_{t+1})=[0,\cdots,1,0,\cdots,0] P(at+1st+1)=[0,,1,0,,0]

其中 1 1 1对应位置上的动作变量 a t + 1 ∗ a_{t+1}^* at+1为:
a t + 1 ∗ = arg ⁡ max ⁡ a t + 1 [ max ⁡ P ( a t + 2 ∣ s t + 2 ) ⋯ max ⁡ P ( a t + n ∣ s t + n ) V ( a t + 1 ∣ f a , f r , f t , s t + 1 ) ] a_{t+1}^*=\arg\underset{a_{t+1}}{\max}[\underset{P(a_{t+2}|s_{t+2})}{\max}\cdots\underset{P(a_{t+n}|s_{t+n})}{\max}{V(a_{t+1}|f_a,f_r,f_t,s_{t+1})}] at+1=argat+1max[P(at+2st+2)maxP(at+nst+n)maxV(at+1fa,fr,ft,st+1)]

说人话就是,在当前动作价值函数是策略意义上最优的时候,当前策略就是以 1 1 1的概率选择使当前策略最优意义下动作价值函数达到最大值所对应的动作。这就是最优策略

t + 1 t+1 t+1步的状态价值函数:
V ( s t + 1 ∣ f a , f r , f t ) = ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 2 ⋯ ∑ s t + n [ r t + 1 ( s t + 1 , a t + 1 ) + γ r t + 2 ( s t + 2 , a t + 2 ) + ⋯ + γ n − 1 r t + n ( s t + n , a t + n ) ] ∗ [ P ( a t + 1 ∣ s t + 1 ) ∗ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] V(s_{t+1}|f_a,f_r,f_t)=\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}[r_{t+1}(s_{t+1},a_{t+1})+\gamma{r_{t+2}(s_{t+2},a_{t+2})}+\cdots+\gamma^{n-1}{r_{t+n}(s_{t+n},a_{t+n})}]*[P(a_{t+1}|s_{t+1})*P(s_{t+2}|s_{t+1},a_{t+1})*P(a_{t+2}|s_{t+2})*\cdots*P(a_{t+n}|s_{t+n})] V(st+1fa,fr,ft)=at+1at+nst+2st+n[rt+1(st+1,at+1)+γrt+2(st+2,at+2)++γn1rt+n(st+n,at+n)][P(at+1st+1)P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]

t t t步状态价值函数抄下来:
V ( s t ∣ f a , f r , f t ) = ∑ a t ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 1 ⋯ ∑ s t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( a t ∣ s t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] V(s_t|f_a,f_r,f_t)=\sum_{a_t}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^n{r_{t+n}(s_{t+n},a_{t+n})}]*[P(a_t|s_t)*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n})] V(stfa,fr,ft)=atat+1at+nst+1st+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(atst)P(st+1st,at)P(at+1st+1)P(at+nst+n)]

可以看到,通过逐个减少固定变量的个数可以消除掉其余的变量。首先固定所有,然后将最后一个变为不固定,消除掉,然后逐渐消除到 s t + 1 s_{t+1} st+1,此时固定 a t a_t at,发现可以消除 P ( s t + 1 ∣ a t , s t ) P(s_{t+1}|a_t,s_t) P(st+1at,st),得到贝尔曼方程 A . 3 A.3 A.3
V ( s t ∣ f a , f r , f t ) = ∑ a t ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 1 ⋯ ∑ s t + n r t ( s t , a t ) ∗ [ P ( a t ∣ s t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] + γ V ( s t + 1 ∣ f a , f r , f t ) ∗ P ( a t ∣ s t ) ∗ P ( s t + 1 ∣ a t , s t ) = ∑ a t r t ( s t , a t ) ∗ P ( a t ∣ s t ) + ∑ a t ∑ s t + 1 γ V ( s t + 1 ∣ f a , f r , f t ) ∗ P ( a t ∣ s t ) ∗ P ( s t + 1 ∣ a t , s t ) \begin{split} V(s_t|f_a,f_r,f_t)&=\sum_{a_t}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}r_t(s_t,a_t)*[P(a_t|s_t)*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n})]+\gamma{V(s_{t+1}|f_a,f_r,f_t)}*P(a_t|s_t)*P(s_{t+1}|a_t,s_t) \\&=\sum_{a_t}r_t(s_t,a_t)*P(a_t|s_t)+\sum_{a_t}\sum_{s_{t+1}}\gamma{V(s_{t+1}|f_a,f_r,f_t)}*P(a_t|s_t)*P(s_{t+1}|a_t,s_t) \end{split} V(stfa,fr,ft)=atat+1at+nst+1st+nrt(st,at)[P(atst)P(st+1st,at)P(at+1st+1)P(at+nst+n)]+γV(st+1fa,fr,ft)P(atst)P(st+1at,st)=atrt(st,at)P(atst)+atst+1γV(st+1fa,fr,ft)P(atst)P(st+1at,st)

t + 1 t+1 t+1步的最优状态价值函数:
V ( s t + 1 ∣ f a ∗ , f r , f t ) = ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 2 ⋯ ∑ s t + n [ r t + 1 ( s t + 1 , a t + 1 ) + γ r t + 2 ( s t + 2 , a t + 2 ) + ⋯ + γ n − 1 r t + n ( s t + n , a t + n ) ] ∗ [ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ∗ ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(s_{t+1}|f_a^*,f_r,f_t)=\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}[r_{t+1}(s_{t+1},a_{t+1})+\gamma{r_{t+2}(s_{t+2},a_{t+2})}+\cdots+\gamma^{n-1}{r_{t+n}(s_{t+n},a_{t+n})}]*[P^*(a_{t+1}|s_{t+1})*P(s_{t+2}|s_{t+1},a_{t+1})*P^*(a_{t+2}|s_{t+2})*\cdots*P^*(a_{t+n}|s_{t+n})] V(st+1fa,fr,ft)=at+1at+nst+2st+n[rt+1(st+1,at+1)+γrt+2(st+2,at+2)++γn1rt+n(st+n,at+n)][P(at+1st+1)P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]

t t t步最优状态价值函数抄下来:
V ( s t ∣ f a ∗ , f r , f t ) = ∑ a t ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 1 ⋯ ∑ s t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ∗ ( a t ∣ s t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(s_t|f_a^*,f_r,f_t)=\sum_{a_t}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^n{r_{t+n}(s_{t+n},a_{t+n})}]*[P^*(a_t|s_t)*P(s_{t+1}|s_t,a_t)*P^*(a_{t+1}|s_{t+1})*\cdots*P^*(a_{t+n}|s_{t+n})] V(stfa,fr,ft)=atat+1at+nst+1st+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(atst)P(st+1st,at)P(at+1st+1)P(at+nst+n)]

可以看出也有类似的公式贝尔曼方程 A . 3 A.3 A.3
V ( s t ∣ f a ∗ , f r , f t ) = ∑ a t r t ( s t , a t ) ∗ P ∗ ( a t ∣ s t ) + ∑ a t ∑ s t + 1 γ V ( s t + 1 ∣ f a ∗ , f r , f t ) ∗ P ∗ ( a t ∣ s t ) ∗ P ( s t + 1 ∣ a t , s t ) V(s_t|f_a^*,f_r,f_t)=\sum_{a_t}r_t(s_t,a_t)*P^*(a_t|s_t)+\sum_{a_t}\sum_{s_{t+1}}\gamma{V(s_{t+1}|f_a^*,f_r,f_t)}*P^*(a_t|s_t)*P(s_{t+1}|a_t,s_t) V(stfa,fr,ft)=atrt(st,at)P(atst)+atst+1γV(st+1fa,fr,ft)P(atst)P(st+1at,st)

然后分析一下交叉项之间的表达式,也就是动作价值函数与状态价值函数之间的关系:
t t t步动作价值函数:
V ( a t ∣ f a , f r , f t , s t ) = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] V(a_t|f_a,f_r,f_t,s_t)=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^nr_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n})] V(atfa,fr,ft,st)=st+1st+nat+1at+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(st+1st,at)P(at+1st+1)P(at+nst+n)]

t t t步状态价值函数:
V ( s t ∣ f a , f r , f t ) = ∑ a t ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 1 ⋯ ∑ s t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( a t ∣ s t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ( a t + n ∣ s t + n ) ] V(s_t|f_a,f_r,f_t)=\sum_{a_t}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^n{r_{t+n}(s_{t+n},a_{t+n})}]*[P(a_t|s_t)*P(s_{t+1}|s_t,a_t)*P(a_{t+1}|s_{t+1})*\cdots*P(a_{t+n}|s_{t+n})] V(stfa,fr,ft)=atat+1at+nst+1st+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(atst)P(st+1st,at)P(at+1st+1)P(at+nst+n)]

可以发现,将 a t a_t at固定的话,下方的表达式与上方表达式存在关系:
V ( s t ∣ f a , f r , f t ) = ∑ a t V ( a t ∣ f a , f r , f t , s t ) ∗ P ( a t ∣ s t ) V(s_t|f_a,f_r,f_t)=\sum_{a_t}V(a_t|f_a,f_r,f_t,s_t)*P(a_t|s_t) V(stfa,fr,ft)=atV(atfa,fr,ft,st)P(atst)

最优的状态价值函数与动作价值函数关系为:
最优动作价值函数:
V ( a t ∣ f a ∗ , f r , f t , s t ) = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 1 ∣ s t , a t ) ∗ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(a_t|f_a^*,f_r,f_t,s_t)=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^nr_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+1}|s_t,a_t)*P^*(a_{t+1}|s_{t+1})*\cdots*P^*(a_{t+n}|s_{t+n})] V(atfa,fr,ft,st)=st+1st+nat+1at+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(st+1st,at)P(at+1st+1)P(at+nst+n)]

最优状态价值函数:
V ( s t ∣ f a ∗ , f r , f t ) = ∑ a t ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 1 ⋯ ∑ s t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ∗ ( a t ∣ s t ) ∗ P ( s t + 1 ∣ s t , a t ) ∗ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(s_t|f_a^*,f_r,f_t)=\sum_{a_t}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^n{r_{t+n}(s_{t+n},a_{t+n})}]*[P^*(a_t|s_t)*P(s_{t+1}|s_t,a_t)*P^*(a_{t+1}|s_{t+1})*\cdots*P^*(a_{t+n}|s_{t+n})] V(stfa,fr,ft)=atat+1at+nst+1st+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(atst)P(st+1st,at)P(at+1st+1)P(at+nst+n)]

关系为:
V ( s t ∣ f a ∗ , f r , f t ) = ∑ a t V ( a t ∣ f a ∗ , f r , f t , s t ) ∗ P ∗ ( a t ∣ s t ) V(s_t|f_a^*,f_r,f_t)=\sum_{a_t}V(a_t|f_a^*,f_r,f_t,s_t)*P^*(a_t|s_t) V(stfa,fr,ft)=atV(atfa,fr,ft,st)P(atst)

最优动作价值函数与最优状态价值函数之间的关系为:
最优动作价值函数:
V ( a t ∣ f a ∗ , f r , f t , s t ) = ∑ s t + 1 ⋯ ∑ s t + n ∑ a t + 1 ⋯ ∑ a t + n [ r t ( s t , a t ) + γ r t + 1 ( s t + 1 , a t + 1 ) + ⋯ + γ n r t + n ( s t + n , a t + n ) ] ∗ [ P ( s t + 1 ∣ s t , a t ) ∗ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(a_t|f_a^*,f_r,f_t,s_t)=\sum_{s_{t+1}}\cdots\sum_{s_{t+n}}\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}[r_t(s_t,a_t)+\gamma{r_{t+1}(s_{t+1},a_{t+1})}+\cdots+\gamma^nr_{t+n}(s_{t+n},a_{t+n})]*[P(s_{t+1}|s_t,a_t)*P^*(a_{t+1}|s_{t+1})*\cdots*P^*(a_{t+n}|s_{t+n})] V(atfa,fr,ft,st)=st+1st+nat+1at+n[rt(st,at)+γrt+1(st+1,at+1)++γnrt+n(st+n,at+n)][P(st+1st,at)P(at+1st+1)P(at+nst+n)]

最优状态价值函数:
V ( s t + 1 ∣ f a ∗ , f r , f t ) = ∑ a t + 1 ⋯ ∑ a t + n ∑ s t + 2 ⋯ ∑ s t + n [ r t + 1 ( s t + 1 , a t + 1 ) + γ r t + 2 ( s t + 2 , a t + 2 ) + ⋯ + γ n − 1 r t + n ( s t + n , a t + n ) ] ∗ [ P ∗ ( a t + 1 ∣ s t + 1 ) ∗ P ( s t + 2 ∣ s t + 1 , a t + 1 ) ∗ P ∗ ( a t + 2 ∣ s t + 2 ) ∗ ⋯ ∗ P ∗ ( a t + n ∣ s t + n ) ] V(s_{t+1}|f_a^*,f_r,f_t)=\sum_{a_{t+1}}\cdots\sum_{a_{t+n}}\sum_{s_{t+2}}\cdots\sum_{s_{t+n}}[r_{t+1}(s_{t+1},a_{t+1})+\gamma{r_{t+2}(s_{t+2},a_{t+2})}+\cdots+\gamma^{n-1}{r_{t+n}(s_{t+n},a_{t+n})}]*[P^*(a_{t+1}|s_{t+1})*P(s_{t+2}|s_{t+1},a_{t+1})*P^*(a_{t+2}|s_{t+2})*\cdots*P^*(a_{t+n}|s_{t+n})] V(st+1fa,fr,ft)=at+1at+nst+2st+n[rt+1(st+1,at+1)+γrt+2(st+2,at+2)++γn1rt+n(st+n,at+n)][P(at+1st+1)P(st+2st+1,at+1)P(at+2st+2)P(at+nst+n)]

相互之间关系为贝尔曼方程 A . 2 A.2 A.2
V ( a t ∣ f a ∗ , f r , f t , s t ) = r t ( s t , a t ) + ∑ s t + 1 γ V ( s t + 1 ∣ f a ∗ , f r , f t ) ∗ P ( s t + 1 ∣ s t , a t ) V(a_t|f_a^*,f_r,f_t,s_t)=r_t(s_t,a_t)+\sum_{s_{t+1}}\gamma{V(s_{t+1}|f_a^*,f_r,f_t)}*P(s_{t+1}|s_t,a_t) V(atfa,fr,ft,st)=rt(st,at)+st+1γV(st+1fa,fr,ft)P(st+1st,at)

总结

自我感觉有了这套符号系统,分析理论方便多了。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值