状态价值函数 v π ( s ) v_{\pi}(s) vπ(s)
定义为:从状态s开始,采用策略 π \pi π 的期望回报:
v π ( s ) = E π [ G t ∣ S t = s ] v_\pi(s) = E_\pi[G_t|S_t = s] vπ(s)=Eπ[Gt∣St=s]
其中 G t G_t Gt 定义为未来奖励之和:
G t = R t + 1 + γ R t + 2 + . . . = ∑ τ = 0 ∞ γ τ R t + 1 + τ \begin{alignedat}{2} G_t &= R_{t+1} + \gamma R_{t+2} + ...\\ &=\sum_{\tau = 0}^{\infty}\gamma^\tau R_{t+1+\tau} \end{alignedat} Gt=Rt+1+γRt+2+...=τ=0∑∞γτRt+1+τ
状态-动作的期望奖励
r ( s , a ) = E π [ R t + 1