现在我们来讨论 T = ∞ T=\infty T=∞情形下的MDP,在这里,我们再重复一次我们的模型假设:
- 行为空间和状态空间都有限集
- 状态转移具有马尔可夫性,即
P ( s t + 1 = s t + 1 ′ ∣ a t = a t ′ , s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ) = P ( s t + 1 = s t + 1 ′ ∣ a t = a t ′ , s t = s t ′ ) \begin{aligned} &P(s_{t+1}=s_{t+1}^\prime|a_{t}=a_{t}^\prime,s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)\\ =&P(s_{t+1}=s_{t+1}^\prime|a_{t}=a_{t}^\prime,s_t=s_t^\prime) \end{aligned} =P(st+1=st+1′∣at=at′,st=st′,at−1=at−1′,⋯,s0=s0′)P(st+1=st+1′∣at=at′,st=st′)马尔可夫性等价于在已知现在状态和行为的条件下,未来状态和过去的行为状态是向相互独立的,即
P ( s t + 1 = s t + 1 ′ , a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ∣ a t = a t ′ , s t = s t ′ ) = P ( s t + 1 = s t + 1 ′ ∣ a t = a t ′ , s t = s t ′ ) P ( a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ∣ a t = a t ′ , s t = s t ′ ) \begin{aligned} &P(s_{t+1}=s_{t+1}^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime|a_t=a_t^\prime,s_t=s_t^\prime)\\ =&P(s_{t+1}=s_{t+1}^\prime|a_t=a_t^\prime,s_t=s_t^\prime)P(a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime|a_t=a_t^\prime,s_t=s_t^\prime) \end{aligned} =P(st+1=st+1′,at−1=at−1′,⋯,s0=s0′∣at=at′,st=st′)P(st+1=st+1′∣at=at′,st=st′)P(at−1=at−1′,⋯,s0=s0′∣at=at′,st=st′) - 状态转移具有时齐性:
P ( s t + 1 = s ′ ∣ a t = a , s t = s ) = P ( s 1 = s ′ ∣ a 0 = a , s 0 = s ) \begin{aligned} P(s_{t+1}=s^\prime|a_{t}=a,s_t=s)=P(s_1=s^\prime|a_0=a,s_0=s) \end{aligned} P(st+1=s′∣at=a,st=s)=P(s1=s′∣a0=a,s0=s)我们把这个概率称为转移概率,记为 p ( s ′ ∣ a , s ) p(s^\prime|a,s) p(s′∣a,s),这个概率与策略无关,而是由环境决定,所以我们不加上标 π \pi π - 我们当然可以制定一列策略
{
π
1
,
π
2
,
⋯
}
\{\pi_1,\pi_2,\cdots\}
{π1,π2,⋯},然后求解策略列,但这样求解起来很不方便,在这里我们假定策略是一个平稳的策略,也就是不同阶段使用同一策略
π
\pi
π,换句话说,
π
=
{
π
,
π
,
⋯
}
\pi=\{\pi,\pi,\cdots\}
π={π,π,⋯},我们要找出这个最优的平稳策略。
p π ( a t = a ∣ s t = s ) = π i ( a ∣ s ) = π 0 ( a ∣ s ) ∀ s ∈ S , a ∈ A , t = 0 , 1 , 2 , ⋯ p^\pi(a_t=a|s_t=s)=\pi_i(a|s)=\pi_0(a|s)\quad \forall s\in\mathcal{S},a\in\mathcal{A},t=0,1,2,\cdots pπ(at=a∣st=s)=πi(a∣s)=π0(a∣s)∀s∈S,a∈A,t=0,1,2,⋯这里 A = ∪ s ∈ S A ( s ) \mathcal{A}=\cup_{s\in\mathcal{S}}\mathcal{A}(s) A=∪s∈SA(s),我们不妨假设所有状态共用一个行为空间 A \mathcal{A} A,不能选择的行为策略概率限制为0即可。 - 某一时刻的行为只由该时刻的状态和所选策略决定,与历史状态和行为都没有关系
P π ( a t + 1 = a ∣ s t + 1 = s , a t = a t ′ , s t = s t ′ , ⋯ , s 0 = s 0 ′ ) = P π ( a t + 1 = a ∣ s t + 1 = s ) = π ( a ∣ s ) \begin{aligned} &P^\pi(a_{t+1}=a|s_{t+1}=s,a_t=a_t^\prime,s_t=s_t^\prime,\cdots,s_0=s_0^\prime)\\ =&P^\pi(a_{t+1}=a|s_{t+1}=s)=\pi(a|s) \end{aligned} =Pπ(at+1=a∣st+1=s,at=at′,st=st′,⋯,s0=s0′)Pπ(at+1=a∣st+1=s)=π(a∣s)同样地,这个假设等价于在给定当前状态的条件下,当前的行为和过去状态行为相互独立
P π ( a t + 1 = a , a t = a t ′ , s t = s t ′ , ⋯ , s 0 = s 0 ′ ∣ s t + 1 = s ) = P π ( a t + 1 = a ∣ s t + 1 = s ) P π ( a t = a t ′ , s t = s t ′ , ⋯ , s 0 = s 0 ′ ∣ s t + 1 = s ) \begin{aligned} &P^\pi(a_{t+1}=a,a_t=a_t^\prime,s_t=s_t^\prime,\cdots,s_0=s_0^\prime|s_{t+1}=s)\\ =&P^\pi(a_{t+1}=a|s_{t+1}=s)P^\pi(a_t=a_t^\prime,s_t=s_t^\prime,\cdots,s_0=s_0^\prime|s_{t+1}=s) \end{aligned} =Pπ(at+1=a,at=at′,st=st′,⋯,s0=s0′∣st+1=s)Pπ(at+1=a∣st+1=s)Pπ(at=at′,st=st′,⋯,s0=s0′∣st+1=s)
在上面的假定下,有了上面的假定就能确定任何一条有限长度的轨迹的概率:
P
π
(
s
T
,
a
T
−
1
,
s
T
−
1
,
⋯
,
a
0
,
s
0
)
=
P
π
(
s
T
∣
a
T
−
1
,
s
T
−
1
,
⋯
,
a
0
,
s
0
)
P
π
(
a
T
−
1
,
s
T
−
1
,
⋯
,
a
0
,
s
0
)
=
p
(
s
T
∣
a
T
−
1
,
s
T
−
1
)
P
(
a
T
−
1
∣
s
T
−
1
,
⋯
,
a
0
,
s
0
)
P
π
(
s
T
−
1
,
⋯
,
a
0
,
s
0
)
=
p
(
s
T
∣
a
T
−
1
,
s
T
−
1
)
π
(
a
T
−
1
∣
s
T
−
1
)
P
π
(
s
T
−
1
,
⋯
,
a
0
,
s
0
)
\begin{aligned} &P^\pi(s_T,a_{T-1},s_{T-1},\cdots,a_0,s_0)\\ =&P^\pi(s_T|a_{T-1},s_{T-1},\cdots,a_0,s_0)P^\pi(a_{T-1},s_{T-1},\cdots,a_0,s_0)\\ =&p(s_T|a_{T-1},s_{T-1})P(a_{T-1}|s_{T-1},\cdots,a_0,s_0)P^\pi(s_{T-1},\cdots,a_0,s_0)\\ =&p(s_T|a_{T-1},s_{T-1})\pi(a_{T-1}|s_{T-1})P^\pi(s_{T-1},\cdots,a_0,s_0) \end{aligned}
===Pπ(sT,aT−1,sT−1,⋯,a0,s0)Pπ(sT∣aT−1,sT−1,⋯,a0,s0)Pπ(aT−1,sT−1,⋯,a0,s0)p(sT∣aT−1,sT−1)P(aT−1∣sT−1,⋯,a0,s0)Pπ(sT−1,⋯,a0,s0)p(sT∣aT−1,sT−1)π(aT−1∣sT−1)Pπ(sT−1,⋯,a0,s0)递归地进行求解即可,同时,我们有下面的推论:
- 给定策略
π
\pi
π,下一期的状态只与当前状态有关,与过去状态和行为都无关,即
P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ) = P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ ) \begin{aligned} &P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)\\ =&P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime) \end{aligned} =Pπ(st+1=st+1′∣st=st′,at−1=at−1′,⋯,s0=s0′)Pπ(st+1=st+1′∣st=st′)
证:
P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ) = ∑ a ∈ A [ P π ( s t + 1 = s t + 1 ′ ∣ a t = a , s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ) ∗ P π ( a t = a ∣ s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯ , s 0 = s 0 ′ ) ] = ∑ a ∈ A P π ( s t + 1 = s t + 1 ′ ∣ a t = a , s t = s t ′ ) P π ( a t = a ∣ s t = s t ′ ) = P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ ) \begin{aligned} &P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)\\ =&\sum_{a\in\mathcal{A}}[P^\pi(s_{t+1}=s_{t+1}^\prime|a_t=a,s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)*\\ &P^\pi(a_t=a|s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)]\\ =&\sum_{a\in\mathcal{A}}P^\pi(s_{t+1}=s_{t+1}^\prime|a_t=a,s_t=s_t^\prime)P^\pi(a_t=a|s_t=s_t^\prime)\\ =&P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime) \end{aligned} ===Pπ(st+1=st+1′∣st=st′,at−1=at−1′,⋯,s0=s0′)a∈A∑[Pπ(st+1=st+1′∣at=a,st=st′,at−1=at−1′,⋯,s0=s0′)∗Pπ(at=a∣st=st′,at−1=at−1′,⋯,s0=s0′)]a∈A∑Pπ(st+1=st+1′∣at=a,st=st′)Pπ(at=a∣st=st′)Pπ(st+1=st+1′∣st=st′)
第一项相等用的是假设2,第二项相等用的是假设6
上面的证明也给出了计算这个概率的方法:
P
π
(
s
t
+
1
=
s
t
+
1
′
∣
s
t
=
s
t
′
)
=
∑
a
∈
A
p
(
s
t
+
1
′
∣
a
,
s
t
′
)
π
(
a
∣
s
t
′
)
\begin{aligned} P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime)=\sum_{a\in\mathcal{A}}p(s_{t+1}^\prime|a,s_t^\prime)\pi(a|s_t^\prime) \end{aligned}
Pπ(st+1=st+1′∣st=st′)=a∈A∑p(st+1′∣a,st′)π(a∣st′)
-
对 T ≥ t + 2 T\geq t+2 T≥t+2,给定策略 π \pi π
P π ( a T = a T ′ , s T = s T ′ , ⋯ , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ a t + 1 = a t + 1 ′ , s t + 1 = s t + 1 ′ , ⋯ , a 0 = a 0 ′ , s 0 = s 0 ′ ) = P π ( a T = a T ′ , s T = s T ′ , ⋯ , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ a t + 1 = a t + 1 ′ , s t + 1 = s t + 1 ′ ) = P π ( a T − t − 1 = a T ′ , s T − t − 1 = s T ′ , ⋯ , a 1 = a t + 2 ′ , s 1 = s t + 2 ′ ∣ a 0 = a t + 1 ′ , s 0 = s t + 1 ′ ) \begin{aligned} &P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|a_{t+1}=a_{t+1}^\prime,s_{t+1}=s_{t+1}^\prime,\cdots,a_0=a_0^\prime,s_0=s_0^\prime)\\ =&P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|a_{t+1}=a_{t+1}^\prime,s_{t+1}=s_{t+1}^\prime)\\ =&P^\pi(a_{T-t-1}=a_T^\prime,s_{T-t-1}=s_T^\prime,\cdots,a_{1}=a_{t+2}^\prime,s_{1}=s_{t+2}^\prime|a_{0}=a_{t+1}^\prime,s_{0}=s_{t+1}^\prime) \end{aligned} ==Pπ(aT=aT′,sT=sT′,⋯,at+2=at+2′,st+2=st+2′∣at+1=at+1′,st+1=st+1′,⋯,a0=a0′,s0=s0′)Pπ(aT=aT′,sT=sT′,⋯,at+2=at+2′,st+2=st+2′∣at+1=at+1′,st+1=st+1′)Pπ(aT−t−1=aT′,sT−t−1=sT′,⋯,a1=at+2′,s1=st+2′∣a0=at+1′,s0=st+1′)
第一个等号是马尔可夫性的一种推广,第二个等号是时齐性的一种推广,这个证明可以通过数学归纳法完成,比较麻烦,这里省略。 -
同样地,有
P π ( a T = a T ′ , s T = s T ′ , ⋯ , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ s t + 1 = s t + 1 ′ , ⋯ , a 0 = a 0 ′ , s 0 = s 0 ′ ) = P π ( a T = a T ′ , s T = s T ′ , ⋯ , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ s t + 1 = s t + 1 ′ ) = P π ( a T − t − 1 = a T ′ , s T − t − 1 = s T ′ , ⋯ , a 1 = a t + 2 ′ , s 1 = s t + 2 ′ ∣ s 0 = s t + 1 ′ ) \begin{aligned} &P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|s_{t+1}=s_{t+1}^\prime,\cdots,a_0=a_0^\prime,s_0=s_0^\prime)\\ =&P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|s_{t+1}=s_{t+1}^\prime)\\ =&P^\pi(a_{T-t-1}=a_T^\prime,s_{T-t-1}=s_T^\prime,\cdots,a_{1}=a_{t+2}^\prime,s_{1}=s_{t+2}^\prime|s_{0}=s_{t+1}^\prime) \end{aligned} ==Pπ(aT=aT′,sT=sT′,⋯,at+2=at+2′,st+2=st+2′∣st+1=st+1′,⋯,a0=a0′,s0=s0′)Pπ(aT=aT′,sT=sT′,⋯,at+2=at+2′,st+2=st+2′∣st+1=st+1′)Pπ(aT−t−1=aT′,sT−t−1=sT′,⋯,a1=at+2′,s1=st+2′∣s0=st+1′)
第一个等号是马尔可夫性的一种推广,第二个等号是时齐性的一种推广
状态价值函数和状态-行为价值函数
给定策略
π
\pi
π,状态
s
s
s的状态价值函数定义为
V
t
π
(
s
)
=
E
π
[
∑
i
=
0
∞
γ
i
R
(
s
t
+
i
,
a
t
+
i
,
s
t
+
i
+
1
)
∣
s
t
=
s
]
V_t^\pi(s)=E^\pi\left[\sum_{i=0}^\infty\gamma^i R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]
Vtπ(s)=Eπ[i=0∑∞γiR(st+i,at+i,st+i+1)∣∣∣∣st=s]
不难证明
V
t
π
(
s
)
V_t^\pi(s)
Vtπ(s)也是时齐的,也就是
V
t
π
(
s
)
=
V
0
π
(
s
)
t
=
0
,
1
,
⋯
,
∀
s
∈
S
V_t^\pi(s)=V_0^\pi(s)\quad t=0,1,\cdots,\forall s\in\mathcal{S}
Vtπ(s)=V0π(s)t=0,1,⋯,∀s∈S,于是我们可以省去时间下标
证:
由于 γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ∈(0,1),并且回报函数有界,由有界收敛定理,有
V t π ( s ) = E π [ ∑ i = 0 ∞ γ i R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] = ∑ i = 0 ∞ γ i E π [ R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] \begin{aligned} V_t^\pi(s)=&E^\pi\left[\sum_{i=0}^\infty\gamma^i R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]\\ =&\sum_{i=0}^\infty\gamma^iE^\pi\left[ R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right] \end{aligned} Vtπ(s)==Eπ[i=0∑∞γiR(st+i,at+i,st+i+1)∣∣∣∣st=s]i=0∑∞γiEπ[R(st+i,at+i,st+i+1)∣∣∣∣st=s]
E π [ R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] = ∑ s ′ , s ′ ′ ∈ S , a ∈ A P ( s t + i = s ′ , a t + i = a , s t + i + 1 = s ′ ′ ∣ s t = s ) R ( s ′ , a , s ′ ′ ) = ∑ s ′ , s ′ ′ ∈ S , a ∈ A P ( s i = s ′ , a i = a , s i + 1 = s ′ ′ ∣ s 0 = s ) R ( s ′ , a , s ′ ′ ) = E π [ R ( s i , a i , s i + 1 ) ∣ s 0 = s ] \begin{aligned} &E^\pi\left[ R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]\\ =&\sum_{s^\prime,s^{\prime\prime}\in\mathcal{S},a\in\mathcal{A}}P(s_{t+i}=s^\prime,a_{t+i}=a,s_{t+i+1}=s^{\prime\prime}|s_t=s)R(s^\prime,a,s^{\prime\prime})\\ =&\sum_{s^\prime,s^{\prime\prime}\in\mathcal{S},a\in\mathcal{A}}P(s_{i}=s^\prime,a_{i}=a,s_{i+1}=s^{\prime\prime}|s_0=s)R(s^\prime,a,s^{\prime\prime})\\ =&E^\pi\left[ R(s_{i},a_{i},s_{i+1})\bigg |s_0=s\right] \end{aligned} ===Eπ[R(st+i,at+i,st+i+1)∣∣∣∣st=s]s′,s′′∈S,a∈A∑P(st+i=s′,at+i=a,st+i+1=s′′∣st=s)R(s′,a,s′′)s′,s′′∈S,a∈A∑P(si=s′,ai=a,si+1=s′′∣s0=s)R(s′,a,s′′)Eπ[R(si,ai,si+1)∣∣∣∣s0=s]
其中第二个由推论8得到,所以
V t π ( s ) = ∑ i = 0 ∞ γ i E π [ R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] = ∑ i = 0 ∞ γ i E π [ R ( s i , a i , s i + 1 ) ∣ s 0 = s ] = V 0 π ( s ) \begin{aligned} V_t^\pi(s) =&\sum_{i=0}^\infty\gamma^iE^\pi\left[ R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]\\ =&\sum_{i=0}^\infty\gamma^iE^\pi\left[ R(s_{i},a_{i},s_{i+1})\bigg |s_0=s\right]\\ =&V_0^\pi(s) \end{aligned} Vtπ(s)===i=0∑∞γiEπ[R(st+i,at+i,st+i+1)∣∣∣∣st=s]i=0∑∞γiEπ[R(si,ai,si+1)∣∣∣∣s0=s]V0π(s)
所以此时Bellman等式简化为
V
π
(
s
)
=
E
π
[
R
(
s
,
a
0
,
s
1
)
+
γ
V
π
(
s
1
)
∣
s
0
=
s
]
V^\pi(s)=E^\pi[R(s,a_0,s_1)+\gamma V^\pi(s_1)|s_0=s]
Vπ(s)=Eπ[R(s,a0,s1)+γVπ(s1)∣s0=s]
展开来即是
V
π
(
s
)
=
E
π
[
R
(
s
,
a
0
,
s
1
)
∣
s
0
=
s
]
+
γ
∑
s
′
∈
S
p
(
s
1
=
s
′
∣
s
0
=
s
)
V
π
(
s
1
)
\begin{aligned} V^\pi(s)=E^\pi[R(s,a_0,s_1)|s_0=s]+\gamma\sum_{s^\prime\in\mathcal{S}}p(s_1=s^\prime|s_0=s)V^\pi(s_1) \end{aligned}
Vπ(s)=Eπ[R(s,a0,s1)∣s0=s]+γs′∈S∑p(s1=s′∣s0=s)Vπ(s1)可以看出这是一个关于
V
π
(
s
)
V^\pi(s)
Vπ(s)的线性方程组,其未知数的个数是状态空间中状态的个数
同样地可以得出状态-行为价值函数也是时齐的:
Q
π
(
a
,
s
)
=
E
π
[
R
(
s
,
a
,
s
1
)
∣
s
0
=
s
,
a
0
=
a
]
+
γ
∑
s
′
∈
S
p
(
s
′
∣
s
,
a
)
V
π
(
s
′
)
V
π
(
s
)
=
∑
a
∈
S
π
(
a
∣
s
)
Q
(
a
,
s
)
\begin{aligned} &Q^\pi(a,s)=E^\pi[R(s,a,s_1)|s_0=s,a_0=a]+\gamma\sum_{s^\prime\in\mathcal{S}}p(s^\prime|s,a)V^\pi(s^\prime)\\ &V^\pi(s)=\sum_{a\in\mathcal{S}}\pi(a|s)Q(a,s) \end{aligned}
Qπ(a,s)=Eπ[R(s,a,s1)∣s0=s,a0=a]+γs′∈S∑p(s′∣s,a)Vπ(s′)Vπ(s)=a∈S∑π(a∣s)Q(a,s)
现在有两个策略 π ′ , π ′ ′ \pi^\prime,\pi^{\prime\prime} π′,π′′, π ′ ≥ π ′ ′ \pi^\prime\geq \pi^{\prime\prime} π′≥π′′定义为 V π ′ ( s ) ≥ V π ′ ′ ( s ) ∀ s ∈ S V^{\pi^\prime}(s)\ge V^{\pi^{\prime\prime}}(s)\quad \forall s\in\mathcal{S} Vπ′(s)≥Vπ′′(s)∀s∈S,我们的目标是选择一个最优策略 π ⋆ \pi^\star π⋆,使得 π ⋆ ≥ π ∀ π \pi^\star\geq \pi \quad \forall \pi π⋆≥π∀π,其价值函数记为 V ∗ ( s ) V^*(s) V∗(s)。