比较懒,引用一下别人的图片
策略 π \pi π就是在状态 s s s下采取动作 a a a的概率分布,可以标示为以下形式:
π θ ( a ∣ s ) = π ( a ∣ s , θ ) = P r { A t = a ∣ S t = s , θ t = θ } (1) \tag{1} \pi_{\theta}(a|s)=\pi(a|s,\theta)=P_r\{A_t=a|S_t=s,\theta_t=\theta\} πθ(a∣s)=π(a∣s,θ)=Pr{
At=a∣St=s,θt=θ}(1)
其中时刻 t t t,环境状态为 s s s,参数为 θ \theta θ,输出动作 a a a的概率为 P r P_r Pr
智能体与环境做一次回合的动作轨迹:
τ = { s 1 , a 1 , s 2 , a 2 , . . . . . . , s t , a t , s t + 1 } (2) \tag{2} \tau=\{s_1,a_1,s_2,a_2,......,s_t,a_t,s_{t+1}\} τ={
s1,a1,s2,a2,......,st,at,st+1}(2)
由 ( 2 ) (2) (2)式可以算出 τ \tau τ轨迹发生的概率为:
p ( τ ) = p { s 1 , a 1 , s 2 , a 2 , . . . . . . , s t , a t , s t + 1 } = p ( s 1 ) ∗ p ( a 1 ∣ s 1 ) ∗ p ( s 2 ∣ s 1 , a 1 ) ∗ p ( a 2 ∣ s 2 ) ∗ p ( s 3 ∣ s 2 , a 2 ) ∗ . . . ∗ p ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) = p ( s 1 ) ∗ ∏ t = 1 T p ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) (3) \tag{3} \begin{aligned} p(\tau)&=p\{s_1,a_1,s_2,a_2,......,s_t,a_t,s_{t+1}\} \\ &=p(s_1)*p(a_1|s_1)*p(s_2|s_1,a_1)*p(a_2|s_2)*p(s_3|s_2,a_2)*...*p(a_t|s_t)*p(s_{t+1}|s_t,a_t) \\ &=p(s_1)*\prod_{t=1}^Tp(a_t|s_t)*p(s_{t+1}|s_t,a_t) \end{aligned} p(τ)=p{
s1,a1,s2,a2,......,st,at,st+1}=p(s1)∗p(a1∣s1)∗p(s2∣s1,a1)∗p(a2∣s2)∗p(s3∣s2,a2)∗...∗p(at∣st)∗p(st+1∣st,at)=p(s1)∗t=1∏Tp(at∣st)∗p(st+1∣st,at)(3)
由 ( 1 ) (1) (1)式可将 ( 3 ) (3) (3)式写成:
p θ ( τ ) = π θ ( τ ) = p ( s 1 ) ∗ ∏ t = 1 T π θ ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) (4) \tag{4} p_\theta(\tau)=\pi_{\theta}(\tau)=p(s_1)*\prod_{t=1}^T\pi_\theta(a_t|s_t)*p(s_{t+1}|s_t,a_t) pθ(τ)=πθ(τ)=p(s1)∗t=1∏Tπθ(at∣st)∗p(st+1∣st,at)(4)
其中, p θ ( τ ) p_\theta(\tau) pθ(τ)标示在 π θ \pi_\theta πθ采用 θ \theta θ参数下,动作轨迹出现的概率,展开式中 p ( s 1 ) p(s_1) p(s1)和 p ( s t + 1 ∣ s t , a t ) t ∈ { 1 , 2 , 3 , . . . , T } p(s_{t+1}|s_t,a_t) \kern{1em} t \in \{1,2,3,...,T\} p(st+1∣st,at)t∈{
1,2,3,...,T}都是环境所产生的,因此跟 θ \theta θ没有关系
智能体与环境做一次回合的总回报为:
R ( τ ) = r 1 + r 2 + . . . . . . = ∑ t = 1 T r t = ∑ t = 1 T r ( s t , a t ) (5) \tag{5} R(\tau)=r_1+r_2+......=\sum_{t=1}^Tr_t=\sum_{t=1}^Tr(s_t,a_t) R(τ)=r