Q Ω ( s , ω ) = ∑ a π ω , θ ( a ∣ s ) Q U ( s , ω , a ) (1) Q_{\Omega}(s, \omega)=\sum_{a} \pi_{\omega, \theta}(a | s) Q_{U}(s, \omega, a) \quad \text{(1)} QΩ(s,ω)=a∑πω,θ(a∣s)QU(s,ω,a)(1)
Q U ( s , ω , a ) = r ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) U ( ω , s ′ ) (2) Q_{U}(s, \omega, a)=r(s, a)+\gamma \sum_{s^{\prime}} \mathrm{P}\left(s^{\prime} | s, a\right) U\left(\omega, s^{\prime}\right) \quad \text{(2)} QU(s,ω,a)=r(s,a)+γs′∑P(s′∣s,a)U(ω,s′)(2)
U ( ω , s ′ ) = ( 1 − β ω , ϑ ( s ′ ) ) Q Ω ( s ′ , ω ) + β ω , ϑ ( s ′ ) V Ω ( s ′ ) (3) U\left(\omega, s^{\prime}\right)=\left(1-\beta_{\omega, \vartheta}\left(s^{\prime}\right)\right) Q_{\Omega}\left(s^{\prime}, \omega\right)+\beta_{\omega, \vartheta}\left(s^{\prime}\right) V_{\Omega}\left(s^{\prime}\right) \quad \text{(3)} U(ω,s′)=(1−βω,ϑ(s′))QΩ(s′,ω)+βω,ϑ(s′)VΩ(s′)(3)
从(1, 2, 3),我们有:
∂
Q
Ω
(
s
,
ω
)
∂
ϑ
=
∑
a
π
ω
,
θ
(
a
∣
s
)
∑
s
′
γ
P
(
s
′
∣
s
,
a
)
∂
U
(
ω
,
s
′
)
∂
ϑ
\frac{\partial Q_{\Omega}(s, \omega)}{\partial \vartheta}=\sum_{a} \pi_{\omega, \theta}(a | s) \sum_{s^{\prime}} \gamma \mathrm{P}\left(s^{\prime} | s, a\right) \frac{\partial U\left(\omega, s^{\prime}\right)}{\partial \vartheta}
∂ϑ∂QΩ(s,ω)=a∑πω,θ(a∣s)s′∑γP(s′∣s,a)∂ϑ∂U(ω,s′)
因此,关键量是 U U U的梯度。这是一种call-and-return执行的自然的结果,其中,只有在进入下一个状态时才能评估termination函数的“优劣”。相关的梯度可以进一步扩展为:
∂ U ( ω , s ′ ) ∂ ϑ = − ∂ β ω , ϑ ( s ′ ) ∂ ϑ A Ω ( s ′ , ω ) + γ ∑ ω ′ ∑ s ′ ′ P ( s ′ ′ , ω ′ ∣ s ′ , ω ) ∂ U ( ω ′ , s ′ ′ ) ∂ ϑ \begin{aligned} \frac{\partial U\left(\omega, s^{\prime}\right)}{\partial \vartheta} &=-\frac{\partial \beta_{\omega, \vartheta}\left(s^{\prime}\right)}{\partial \vartheta} A_{\Omega}\left(s^{\prime}, \omega\right) + \gamma \sum_{\omega^{\prime}} \sum_{s^{\prime \prime}} \mathrm{P}\left(s^{\prime \prime}, \omega^{\prime} | s^{\prime}, \omega\right) \frac{\partial U\left(\omega^{\prime}, s^{\prime \prime}\right)}{\partial \vartheta} \end{aligned} ∂ϑ∂U(ω,s′)=−∂ϑ∂βω,ϑ(s′)AΩ(s′,ω)+γω′∑s′′∑P(s′′,ω′∣s′,ω)∂ϑ∂U(ω′,s′′)
其中, A Ω A_\Omega AΩ是options上的优势函数: A Ω ( s ′ , ω ) = Q Ω ( s ′ , ω ) − V Ω ( s ′ ) A_{\Omega}\left(s^{\prime}, \omega\right)=Q_{\Omega}\left(s^{\prime}, \omega\right)-V_{\Omega}\left(s^{\prime}\right) AΩ(s′,ω)=QΩ(s′,ω)−VΩ(s′)。循环地扩展 ∂ U ( ω ′ , s ′ ′ ) ∂ ϑ \frac{\partial U\left(\omega^{\prime}, s^{\prime \prime}\right)}{\partial \vartheta} ∂ϑ∂U(ω′,s′′)可以得到intra-option相似的形式,但不同的是,这里的state-option对是根据Markov链随time step转移的:
μ Ω ( s t + 1 , ω t ∣ s t , ω t − 1 ) \mu_{\Omega}\left(s_{t+1}, \omega_{t} | s_{t}, \omega_{t-1}\right) μΩ(st+1,ωt∣st,ωt−1)
termination梯度理论。给定一个马尔可夫options集,其随机termination的函数关于参数 v v v是可微的。关于 v v v和初始条件 ( v 1 , ω 0 ) (v_1, \omega_0) (v1,ω0)的期望折扣汇报的梯度是:
− ∑ s ′ , ω μ Ω ( s ′ , ω ∣ s 1 , ω 0 ) ∂ β ω , ϑ ( s ′ ) ∂ ϑ A Ω ( s ′ , ω ) -\sum_{s^{\prime}, \omega} \mu_{\Omega}\left(s^{\prime}, \omega | s_{1}, \omega_{0}\right) \frac{\partial \beta_{\omega, \vartheta}\left(s^{\prime}\right)}{\partial \vartheta} A_{\Omega}\left(s^{\prime}, \omega\right) −s′,ω∑μΩ(s′,ω∣s1,ω0)∂ϑ∂βω,ϑ(s′)AΩ(s′,ω)
其中, μ Ω ( s ′ , ω ∣ s 1 , ω 0 ) \mu_{\Omega}\left(s^{\prime}, \omega | s_{1}, \omega_{0}\right) μΩ(s′,ω∣s1,ω0)是一个从 ( s 1 , ω 0 ) (s_1, \omega_0) (s1,ω0)开始的state-option对折扣加权:
μ Ω ( s , ω ∣ s 1 , ω 0 ) = ∑ t = 0 ∞ γ t P ( s t + 1 = s , ω t = ω ∣ s 1 , ω 0 ) \mu_{\Omega}\left(s, \omega | s_{1}, \omega_{0}\right)=\sum_{t=0}^{\infty} \gamma^{t} \mathrm{P}\left(s_{t+1}=s, \omega_{t}=\omega | s_{1}, \omega_{0}\right) μΩ(s,ω∣s1,ω0)=t=0∑∞γtP(st+1=s,ωt=ω∣s1,ω0)