前文链接
策略迭代
策略评估
矩阵算法
在动态特性
P
(
s
′
,
r
′
∣
s
,
a
)
\mathcal{P}(s', r'\mid s, a)
P(s′,r′∣s,a),给定
π
\pi
π得到
q
π
(
s
,
a
)
q_\pi(s, a)
qπ(s,a).
令
V
π
(
s
)
=
(
V
π
(
s
1
)
V
π
(
s
2
)
⋮
V
π
(
s
∣
s
∣
)
)
∣
s
∣
×
1
V_\pi(s)=\left( \begin{matrix} V_\pi(s_1)\\ V_\pi(s_2)\\ \vdots\\ V_\pi(s_{|s|}) \end{matrix} \right)_{|s|\times 1}
Vπ(s)=⎝⎜⎜⎜⎛Vπ(s1)Vπ(s2)⋮Vπ(s∣s∣)⎠⎟⎟⎟⎞∣s∣×1
根据
V
π
(
s
)
V_\pi(s)
Vπ(s)表达式可以得到
V
π
(
s
)
=
E
π
[
G
t
∣
S
t
=
s
]
=
E
π
[
R
t
+
1
+
γ
V
π
S
t
+
1
∣
S
t
=
s
]
=
∑
a
π
(
a
∣
s
i
)
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
V
π
(
s
′
)
]
\begin{aligned} V_\pi(s)&=\mathbb{E}_\pi[G_t\mid S_t=s] \\ &=\mathbb{E}_\pi[R_{t+1}+\gamma V_{\pi}S_{t+1}\mid S_t=s] \\ &=\sum_a\pi(a\mid s_i)\sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_\pi(s')] \end{aligned}
Vπ(s)=Eπ[Gt∣St=s]=Eπ[Rt+1+γVπSt+1∣St=s]=a∑π(a∣si)s′,r∑P(s′,r∣s,a)[r+γVπ(s′)]
展开得到
V
π
(
s
)
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
r
⏟
(
A
)
+
γ
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
V
π
(
s
′
)
⏟
(
B
)
\begin{aligned} V_\pi(s)=\underbrace{\sum_a\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)r}_{(A)}+\underbrace{\gamma\sum_{a}\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)V_\pi(s')}_{(B)} \end{aligned}
Vπ(s)=(A)
a∑π(a∣s)s′,r∑P(s′,r∣s,a)r+(B)
γa∑π(a∣s)s′,r∑P(s′,r∣s,a)Vπ(s′)
其中
(
A
)
(A)
(A)部分可以表示为
∑
a
π
(
a
∣
s
)
∑
r
r
P
(
r
∣
s
,
a
)
=
∑
a
π
(
a
∣
s
)
E
[
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
\sum_{a}\pi(a\mid s)\sum_{r}r\mathcal{P}(r\mid s, a)=\sum_{a}\pi(a\mid s)\mathbb{E}[R_{t+1}\mid S_t=s, A_t=a]
a∑π(a∣s)r∑rP(r∣s,a)=a∑π(a∣s)E[Rt+1∣St=s,At=a]
令
r
(
s
,
a
)
≜
E
[
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
r(s, a)\triangleq\mathbb{E}[R_{t+1}\mid S_t=s, A_t=a]
r(s,a)≜E[Rt+1∣St=s,At=a]
得到
(
A
)
=
∑
a
π
(
a
∣
s
)
r
(
s
,
a
)
=
r
π
(
s
)
(A)=\sum_a\pi(a\mid s)r(s, a)=r_\pi(s)
(A)=a∑π(a∣s)r(s,a)=rπ(s)
引入向量
r
π
r_\pi
rπ
r
π
=
(
r
π
(
s
1
)
r
π
(
s
2
)
⋮
r
π
(
s
∣
s
∣
)
)
∣
s
∣
×
1
r_\pi=\left( \begin{matrix} r_\pi(s_1)\\ r_\pi(s_2)\\ \vdots\\ r_\pi(s_{|s|}) \end{matrix} \right)_{|s|\times 1}
rπ=⎝⎜⎜⎜⎛rπ(s1)rπ(s2)⋮rπ(s∣s∣)⎠⎟⎟⎟⎞∣s∣×1
(
B
)
(B)
(B)部分值为
(
B
)
=
γ
∑
a
π
(
a
∣
s
)
∑
s
′
P
(
s
′
∣
s
,
a
)
V
π
(
s
′
)
=
γ
∑
s
′
∑
a
π
(
a
∣
s
)
P
(
s
′
∣
s
,
a
)
⏟
P
π
(
s
,
s
′
)
V
π
(
s
′
)
=
γ
∑
s
′
P
π
(
s
,
s
′
)
V
π
(
s
′
)
\begin{aligned} (B)&=\gamma\sum_a\pi(a\mid s)\sum_{s'}\mathcal{P}(s'\mid s, a)V_\pi(s') \\ &=\gamma\sum_{s'}\underbrace{\sum_a\pi(a\mid s)\mathcal{P}(s'\mid s, a)}_{P_\pi(s, s')}V_\pi(s')\\ &=\gamma\sum_{s'}P_\pi(s, s')V_\pi(s') \end{aligned}
(B)=γa∑π(a∣s)s′∑P(s′∣s,a)Vπ(s′)=γs′∑Pπ(s,s′)
a∑π(a∣s)P(s′∣s,a)Vπ(s′)=γs′∑Pπ(s,s′)Vπ(s′)
令
P
π
≜
[
P
π
(
s
,
s
′
)
]
P_\pi\triangleq[P_\pi(s, s')]
Pπ≜[Pπ(s,s′)],得到
V
π
=
r
π
+
γ
P
π
V
π
V_\pi=r_\pi+\gamma P_\pi V_\pi
Vπ=rπ+γPπVπ
解出
V
π
=
(
I
−
γ
P
π
)
−
1
r
π
V_\pi=(\mathbf{I}-\gamma P_\pi)^{-1}r_\pi
Vπ=(I−γPπ)−1rπ
算法时间复杂度为
O
(
∣
s
∣
3
)
\mathcal{O}(|s|^3)
O(∣s∣3).
迭代法
给定
π
\pi
π,求出
V
π
V_\pi
Vπ.
V
π
(
s
)
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
(
r
+
γ
V
π
(
s
′
)
)
V_\pi(s)=\sum_a\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)(r+\gamma V_\pi(s'))
Vπ(s)=a∑π(a∣s)s′,r∑P(s′,r∣s,a)(r+γVπ(s′))
构造一个数列
{
V
k
}
k
=
1
∞
→
V
π
\{V_k\}_{k=1}^\infty\to V_\pi
{Vk}k=1∞→Vπ
迭代方程为
V
k
+
1
(
s
)
≜
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
(
r
+
γ
V
k
(
s
′
)
)
V_{k+1}(s)\triangleq \sum_a\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)(r+\gamma V_k(s'))
Vk+1(s)≜a∑π(a∣s)s′,r∑P(s′,r∣s,a)(r+γVk(s′))
策略改进
求出 q π ( s , a ) q_\pi(s, a) qπ(s,a)构造出新的 π ′ \pi' π′,使得 q π ′ ( s , a ) > q π ( s , a ) q_{\pi'}(s, a)>q_{\pi}(s, a) qπ′(s,a)>qπ(s,a).
策略改进定理:给定 π , π ′ \pi, \pi' π,π′如果 ∀ s ∈ S \forall s\in S ∀s∈S, q π ( s , π ′ ( s ) ) q_\pi(s, \pi'(s)) qπ(s,π′(s)),那么则有 ∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S, V_{\pi'}(s)\geq V_{\pi}(s) ∀s∈S,Vπ′(s)≥Vπ(s)
证明:
根据公式
q
π
(
s
,
a
)
=
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
V
π
(
s
′
)
]
=
E
[
R
t
+
1
+
γ
V
π
(
S
t
+
1
)
∣
S
t
=
s
,
A
t
=
a
]
=
E
[
R
t
+
1
+
γ
V
π
(
S
t
+
1
)
∣
S
t
=
s
,
A
t
=
π
′
(
s
)
]
=
E
π
′
[
R
t
+
1
+
γ
V
π
(
S
t
+
1
)
∣
S
t
=
s
]
≤
E
π
′
[
R
t
+
1
+
γ
q
π
(
S
t
+
1
,
π
′
(
S
t
+
1
)
)
∣
S
t
=
s
]
=
E
π
′
[
R
t
+
1
+
γ
E
π
′
[
R
t
+
2
+
γ
V
π
(
S
t
+
2
)
∣
S
t
+
1
]
∣
S
t
=
s
]
=
E
π
′
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
V
π
(
S
t
+
2
)
∣
S
t
=
s
]
…
≤
E
π
′
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
⏟
G
t
∣
S
t
=
s
]
=
V
π
′
(
s
)
\begin{aligned} q_\pi(s, a)&=\sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_\pi(s')] \\ &=\mathbb{E}[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s, A_t=a]\\ &=\mathbb{E}[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s, A_t=\pi'(s)] \\ &=\mathbb{E}_{\pi'}[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s] \\ &\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1}, \pi'(S_{t+1}))\mid S_t=s] \\ &=\mathbb{E}_{\pi'}[R_{t+1}+\gamma\mathbb{E}_{\pi'}[R_{t+2}+\gamma V_\pi(S_{t+2})\mid S_{t+1}]\mid S_t=s] \\ &=\mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2V_\pi(S_{t+2})\mid S_t=s]\\ &\dots\\ &\leq \mathbb{E}_{\pi'}[\underbrace{R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots}_{G_t}\mid S_t=s]\\ &=V_{\pi'}(s) \end{aligned}
qπ(s,a)=s′,r∑P(s′,r∣s,a)[r+γVπ(s′)]=E[Rt+1+γVπ(St+1)∣St=s,At=a]=E[Rt+1+γVπ(St+1)∣St=s,At=π′(s)]=Eπ′[Rt+1+γVπ(St+1)∣St=s]≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))∣St=s]=Eπ′[Rt+1+γEπ′[Rt+2+γVπ(St+2)∣St+1]∣St=s]=Eπ′[Rt+1+γRt+2+γ2Vπ(St+2)∣St=s]…≤Eπ′[Gt
Rt+1+γRt+2+γ2Rt+3+…∣St=s]=Vπ′(s)
贪心策略(Greedy Policy): ∀ s ∈ S π ′ ( s ) = arg max a q π ( s , a ) \forall s \in S \pi'(s)=\argmax_a q_\pi(s, a) ∀s∈Sπ′(s)=aargmaxqπ(s,a)
V
π
(
s
)
≤
max
a
q
π
(
s
,
a
)
=
q
π
(
s
,
π
′
(
s
)
)
V_\pi(s)\leq \max_a q_\pi(s, a)=q_\pi(s, \pi'(s))
Vπ(s)≤amaxqπ(s,a)=qπ(s,π′(s))
由策略改进定理可知
∀
s
∈
S
,
V
π
′
(
s
)
≥
V
π
(
s
)
\forall s\in S, V_{\pi'}(s)\geq V_\pi(s)
∀s∈S,Vπ′(s)≥Vπ(s)
如果 V π ′ = V π V_{\pi'}=V_\pi Vπ′=Vπ那么 V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ′=Vπ=V∗
证明:
当
V
π
′
=
V
π
⇒
q
π
′
=
q
π
V_{\pi'}=V_{\pi}\Rightarrow q_{\pi'}=q_{\pi}
Vπ′=Vπ⇒qπ′=qπ.
所以
∀
s
∈
S
V
π
′
(
s
)
=
∑
a
π
′
(
a
∣
s
)
q
π
′
(
s
,
a
)
q
π
(
s
,
a
)
=
q
π
(
s
,
π
′
(
s
)
)
=
max
a
q
π
(
s
,
a
)
=
max
a
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
V
π
′
(
s
′
)
]
(1)
\begin{aligned} &\forall s\in S\\ V_{\pi'}(s)&=\sum_{a}\pi'(a\mid s)q_{\pi'}(s, a)q_{\pi}(s, a)\\ &=q_\pi(s, \pi'(s))\\ &=\max_a q_\pi(s, a)\\ &=\max_a \sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_{\pi'}(s')] \end{aligned}\tag{1}
Vπ′(s)∀s∈S=a∑π′(a∣s)qπ′(s,a)qπ(s,a)=qπ(s,π′(s))=amaxqπ(s,a)=amaxs′,r∑P(s′,r∣s,a)[r+γVπ′(s′)](1)
可以发现
(
1
)
(1)
(1)是Bellman Optimality Equation.
价值迭代
价值迭代时极端情况下的策略迭代
V
k
+
1
(
s
)
=
max
a
∑
s
′
,
r
P
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
V
k
(
s
′
)
]
V_{k+1}(s)=\max_a\sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_k(s')]
Vk+1(s)=amaxs′,r∑P(s′,r∣s,a)[r+γVk(s′)]