贝尔曼公式
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] + ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r + ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r + ∑ s ′ v π ( s ′ ) ∑ a p ( s ′ ∣ s , a ) π ( a ∣ s ) = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r ⏟ mean of immediate rewards + γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ⏟ mean of future rewards = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] ⏟ q π ( s , a ) = ∑ a π ( a ∣ s ) E [ G t ∣ S t = s , A t = a ] ⏟ q π ( s , a ) = ∑ a π ( a ∣ s