深度强化学习(二)(贝尔曼方程)
一.贝尔曼方程(将 Q π Q_\pi Qπ 表示成 Q π Q_\pi Qπ )
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] . (1.1) Q_\pi\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}, A_{t+1}}\left[R_t+\gamma \cdot Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right] .\tag{1.1} Qπ(st,at)=ESt+1,At+1[Rt+γ⋅Qπ(St+1,At+1)∣St=st,At=at].(1.1)
proof:令
S
t
+
1
:
=
{
S
t
+
1
,
S
t
+
2
,
⋯
}
\mathcal{S}_{t+1:}=\left\{S_{t+1}, S_{t+2}, \cdots\right\}
St+1:={St+1,St+2,⋯},
A
t
+
1
:
=
{
A
t
+
1
,
A
t
+
2
,
⋯
}
\mathcal{A}_{t+1:}=\left\{A_{t+1}, A_{t+2}, \cdots\right\}
At+1:={At+1,At+2,⋯},由
U
t
U_t
Ut的定义知
U
t
=
R
t
+
γ
⋅
U
t
+
1
U_t=R_t+\gamma \cdot U_{t+1}
Ut=Rt+γ⋅Ut+1
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
:
,
A
t
+
1
:
[
U
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
:
,
A
t
+
1
:
[
R
t
+
γ
⋅
U
t
+
1
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
,
A
t
+
1
[
R
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
⏟
(
1
)
+
γ
⋅
E
S
t
+
1
:
,
A
t
+
1
:
[
U
t
+
1
∣
S
t
=
s
t
,
A
t
=
a
t
]
⏟
(
2
)
\begin{aligned} Q_\pi\left(s_t, a_t\right)&=\mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_t \mid S_t=s_t, A_t=a_t\right]\\ &=\mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[R_t+\gamma \cdot U_{t+1} \mid S_t=s_t, A_t=a_t\right]\\ &= \underbrace{\Bbb E_{\cal S_{t+1},\cal A_{t+1}}\left[R_t|S_t=s_t,A_t=a_t \right]}_{(1)}+\gamma\cdot\underbrace{ \mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_{t+1} \mid S_t=s_t, A_t=a_t\right]}_{(2)}\\ \end{aligned}
Qπ(st,at)=ESt+1:,At+1:[Ut∣St=st,At=at]=ESt+1:,At+1:[Rt+γ⋅Ut+1∣St=st,At=at]=(1)
ESt+1,At+1[Rt∣St=st,At=at]+γ⋅(2)
ESt+1:,At+1:[Ut+1∣St=st,At=at]
其中,
t
t
t时刻的回报
R
t
R_{t}
Rt只与
t
+
1
t+1
t+1时刻的状态
S
t
+
1
S_{t+1}
St+1有关,而
S
t
+
1
S_{t+1}
St+1只与
S
t
,
A
t
S_t,A_t
St,At有关,则
(
1
)
=
E
S
t
+
1
,
A
t
+
1
[
R
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
[
R
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
,
A
t
+
1
[
R
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
\begin{aligned} (1)&=\Bbb E_{\cal S_{t+1},\cal A_{t+1}}\left[R_t|S_t=s_t,A_t=a_t \right]\\ &= \Bbb E_{S_{t+1}}\left [R_t|S_t=s_t,A_t=a_t\right]\\ &= \Bbb E_{S_{t+1},A_{t+1}}\left [R_t|S_t=s_t,A_t=a_t\right] \end{aligned}
(1)=ESt+1,At+1[Rt∣St=st,At=at]=ESt+1[Rt∣St=st,At=at]=ESt+1,At+1[Rt∣St=st,At=at]
对
(
2
)
(2)
(2)中的式子变形可得
(
2
)
=
E
S
t
+
1
:
,
A
t
+
1
:
[
U
t
+
1
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
,
A
t
+
1
,
S
t
+
2
,
A
t
+
2
[
U
t
+
1
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
,
A
t
+
1
[
E
S
t
+
2
,
A
t
+
2
[
U
t
+
1
∣
S
t
+
1
,
A
t
+
1
,
S
t
=
s
t
,
A
t
=
a
t
]
∣
S
t
=
s
t
,
A
t
=
a
t
]
利用马尔可夫性
=
E
S
t
+
1
,
A
t
+
1
[
E
S
t
+
2
,
A
t
+
2
[
U
t
+
1
∣
S
t
+
1
,
A
t
+
1
]
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
,
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
\begin{aligned} (2)&= \mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_{t+1} \mid S_t=s_t, A_t=a_t\right]\\ &= \Bbb E_{S_{t+1},A_{t+1},\cal S _{t+2},\cal A _{t+2}}\left[U_{t+1}|S_t=s_t,A_t=a_t \right]\\ &= \Bbb E_{S_{t+1},A_{t+1}}\left[\Bbb E_{\cal S_{t+2},\cal A_{t+2}}\left[U_{t+1}|S_{t+1},A_{t+1},S_t=s_t,A_t=a_t\right]|S_t=s_t,A_t=a_t \right]利用马尔可夫性\\ &=\Bbb E_{S_{t+1},A_{t+1}}\left[\Bbb E_{\cal S_{t+2},\cal A_{t+2}}\left[U_{t+1}|S_{t+1},A_{t+1}\right]|S_t=s_t,A_t=a_t \right] \\ &=\mathbb{E}_{S_{t+1}, A_{t+1}}\left[Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right] \end{aligned}
(2)=ESt+1:,At+1:[Ut+1∣St=st,At=at]=ESt+1,At+1,St+2,At+2[Ut+1∣St=st,At=at]=ESt+1,At+1[ESt+2,At+2[Ut+1∣St+1,At+1,St=st,At=at]∣St=st,At=at]利用马尔可夫性=ESt+1,At+1[ESt+2,At+2[Ut+1∣St+1,At+1]∣St=st,At=at]=ESt+1,At+1[Qπ(St+1,At+1)∣St=st,At=at]
由此证毕。
二.贝尔曼方程 (将 Q π 表示成 V π ) \text { (将 } Q_\pi \text { 表示成 } V_\pi \text { ) } (将 Qπ 表示成 Vπ )
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
Q π ( s t , a t ) = E S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t , A t = a t ] (1.2) Q_\pi\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t, A_t=a_t\right]\tag{1.2} Qπ(st,at)=ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At=at](1.2)
proof:
由于
V
π
(
S
t
+
1
)
=
E
A
t
+
1
∼
π
(
⋅
∣
S
t
+
1
)
[
Q
(
S
t
+
1
,
A
t
+
1
)
]
=
E
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
+
1
]
\text { 由于 } V_\pi\left(S_{t+1}\right)=\mathbb{E}_{A_{t+1}\sim \pi\left(\cdot \mid S_{t+1}\right)}\left[Q\left(S_{t+1}, A_{t+1}\right)\right]=\Bbb E_{A_{t+1}}\left[ Q_{\pi}(S_{t+1},A_{t+1})|S_{t+1}\right]
由于 Vπ(St+1)=EAt+1∼π(⋅∣St+1)[Q(St+1,At+1)]=EAt+1[Qπ(St+1,At+1)∣St+1]
(
2
)
=
E
S
t
+
1
,
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
[
E
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
+
1
]
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
[
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
\begin{aligned} (2)= &\mathbb{E}_{S_{t+1}, A_{t+1}}\left[Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right]\\ =&\Bbb E_{S_{t+1}}\left[\Bbb E_{A_{t+1}}\left[ Q_{\pi}(S_{t+1},A_{t+1})|S_{t+1}\right]|S_t=s_t,A_t=a_t\right]\\ =&\Bbb E_{S_{t+1}}\left[V_\pi\left(S_{t+1}\right)|S_t=s_t,A_t=a_t\right] \end{aligned}
(2)===ESt+1,At+1[Qπ(St+1,At+1)∣St=st,At=at]ESt+1[EAt+1[Qπ(St+1,At+1)∣St+1]∣St=st,At=at]ESt+1[Vπ(St+1)∣St=st,At=at]
证毕
三.贝尔曼方程(将 V π V_\pi Vπ 表示成 V π V_\pi Vπ )
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
V π ( s t ) = E A t , S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t ] (1.3) V_\pi\left(s_t\right)=\mathbb{E}_{A_t, S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t\right]\tag{1.3} Vπ(st)=EAt,St+1[Rt+γ⋅Vπ(St+1)∣St=st](1.3)
proof:
V
π
(
s
t
)
=
E
A
t
,
S
t
+
1
,
A
t
+
1
[
U
t
∣
S
t
=
s
t
]
=
E
A
t
,
S
t
+
1
,
A
t
+
1
,
[
R
t
+
γ
U
t
+
1
∣
S
t
=
s
t
]
=
E
A
t
,
S
t
+
1
,
A
t
+
1
[
R
t
∣
S
t
=
s
t
]
+
γ
E
A
t
,
S
t
+
1
,
A
t
+
1
[
U
t
+
1
∣
S
t
=
s
t
]
=
E
A
t
,
S
t
+
1
[
R
t
∣
S
t
=
s
t
]
+
γ
E
S
t
+
1
[
E
A
t
A
t
+
1
,
S
t
+
2
[
U
t
+
1
∣
S
t
+
1
,
S
t
=
s
t
]
∣
S
t
=
s
t
]
=
E
A
t
,
S
t
+
1
[
R
t
∣
S
t
=
s
t
]
+
γ
E
S
t
+
1
[
E
A
t
+
1
,
S
t
+
2
[
U
t
+
1
∣
S
t
+
1
]
∣
S
t
=
s
t
]
马尔可夫性
=
E
A
t
,
S
t
+
1
[
R
t
∣
S
t
=
s
t
]
+
γ
E
S
t
+
1
[
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
]
=
E
A
t
,
S
t
+
1
[
R
t
∣
S
t
=
s
t
]
+
γ
E
A
t
,
S
t
+
1
[
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
]
马尔可夫性
证毕
\begin{aligned} V_\pi\left(s_t\right)&=\Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}}\left[U_t \mid S_t=s_t\right] \\ & =\Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}},\left[R_t+\gamma U_{t+1}|S_t=s_t\right] \\ & =\Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}}\left[R_t \mid S_t=s_t\right] +\gamma \Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}}\left[U_{t+1} \mid S_t=s_t\right] \\ & =\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right] +\gamma \Bbb E_{S_{t+1}}\left[\Bbb E_{A_t \cal A_{t+1}, \cal S_{t+2}}\left[U_{t+1} \mid S_{t+1},S_t=s_t\right]\mid S_{t}=s_t\right]\qquad \\ & =\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{S_{t+1}}\left[ E_{ \cal A_{t+1}, \cal S_{t+2}}\left[U_{t+1} \mid S_{t+1}\right]\mid S_{t}=s_t\right]马尔可夫性\\ & = \Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{S_{t+1}}\left[V_{\pi}(S_{t+1})\mid S_{t}=s_t\right]\\ &=\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{A_t, S_{t+1}}\left[V_{\pi}(S_{t+1})\mid S_{t}=s_t\right]马尔可夫性\\ \textbf{证毕} \end{aligned}
Vπ(st)证毕=EAt,St+1,At+1[Ut∣St=st]=EAt,St+1,At+1,[Rt+γUt+1∣St=st]=EAt,St+1,At+1[Rt∣St=st]+γEAt,St+1,At+1[Ut+1∣St=st]=EAt,St+1[Rt∣St=st]+γESt+1[EAtAt+1,St+2[Ut+1∣St+1,St=st]∣St=st]=EAt,St+1[Rt∣St=st]+γESt+1[EAt+1,St+2[Ut+1∣St+1]∣St=st]马尔可夫性=EAt,St+1[Rt∣St=st]+γESt+1[Vπ(St+1)∣St=st]=EAt,St+1[Rt∣St=st]+γEAt,St+1[Vπ(St+1)∣St=st]马尔可夫性
或者直接利用式
1.2
1.2
1.2,两边同时对
A
t
∼
π
(
⋅
∣
s
t
)
A_t\sim \pi(\cdot|s_t)
At∼π(⋅∣st)求期望得
E
A
t
∼
π
(
⋅
∣
s
t
)
[
Q
π
(
s
t
,
A
t
)
]
=
E
A
t
∼
π
(
⋅
∣
s
t
)
[
E
S
t
+
1
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
,
A
t
]
]
⇕
E
A
t
[
Q
π
(
S
t
,
A
t
)
∣
S
t
=
s
t
]
=
E
A
t
[
E
S
t
+
1
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
,
A
t
]
∣
S
t
=
s
t
]
=
E
S
t
+
1
,
A
t
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
]
\begin{aligned} \Bbb E_{A_t\sim \pi(\cdot|s_t)}[Q_\pi\left(s_t, A_t\right)]&=\Bbb E_{A_t\sim \pi(\cdot|s_t)}[\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t,A_t\right]]\\ \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \Updownarrow \\ \Bbb E_{A_t}[Q_\pi\left(S_t, A_t\right)\mid S_t=s_t]&=\Bbb E_{A_t}[\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t,A_t\right]\mid S_t=s_t]\\ &=\mathbb{E}_{S_{t+1},A_{t}}[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t] \end{aligned}
EAt∼π(⋅∣st)[Qπ(st,At)]⇕EAt[Qπ(St,At)∣St=st]=EAt∼π(⋅∣st)[ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At]]=EAt[ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At]∣St=st]=ESt+1,At[Rt+γ⋅Vπ(St+1)∣St=st]
利用式
1.3
1.3
1.3,进一步写出显示表达式可得
V
π
(
s
t
)
=
E
A
t
,
S
t
+
1
[
R
t
∣
S
t
=
s
t
]
+
γ
E
A
t
,
S
t
+
1
[
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
]
=
E
A
t
[
E
S
t
+
1
[
R
t
∣
A
t
,
S
t
=
s
t
]
∣
S
t
=
s
t
]
+
γ
E
A
t
[
E
S
t
+
1
[
V
π
(
S
t
+
1
)
∣
A
t
,
S
t
=
s
t
]
∣
S
t
=
s
t
]
=
∑
A
t
π
(
a
t
∣
s
t
)
E
S
t
+
1
[
R
t
∣
A
t
,
S
t
=
s
t
]
+
γ
∑
A
t
π
(
a
t
∣
s
t
)
E
S
t
+
1
[
V
π
(
S
t
+
1
)
∣
A
t
,
S
t
=
s
t
]
=
∑
A
t
π
(
a
t
∣
s
t
)
∑
S
t
+
1
r
⋅
p
(
s
t
+
1
∣
s
t
,
a
t
)
+
γ
∑
A
t
π
(
a
t
∣
s
t
)
∑
S
t
+
1
V
π
(
s
t
+
1
)
⋅
p
(
s
t
+
1
∣
s
t
,
a
t
)
\begin{aligned} V_{\pi}(s_t)&=\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{A_t, S_{t+1}}\left[V_{\pi}(S_{t+1})\mid S_{t}=s_t\right]\\ &= \Bbb E_{A_t}[\Bbb E_{S_{t+1}}[R_t\mid A_t,S_t=s_t ]\mid S_t=s_t] +\gamma \Bbb E_{A_t}\left[\Bbb E_{S_{t+1}}\left[V_{\pi(S_{t+1})}\mid A_t,S_t=s_t\right]\mid S_t=s_t \right]\\ & =\sum_{A_t}\pi(a_t\mid s_{t})\Bbb E_{S_{t+1}}[R_t\mid A_t ,S_t=s_t]+\gamma \sum_{A_t}\pi(a_t\mid s_t)\Bbb E_{S_{t+1}}\left[V_{\pi(S_{t+1})}\mid A_t,S_t=s_t\right] \\ &=\sum_{A_t}\pi(a_t\mid s_{t})\sum_{S_{t+1}}r\cdot p(s_{t+1}\mid s_t,a_t)+\gamma \sum_{A_t}\pi(a_t\mid s_t)\sum_{S_{t+1}}V_{\pi}(s_{t+1})\cdot p(s_{t+1}\mid s_t,a_t) \end{aligned}
Vπ(st)=EAt,St+1[Rt∣St=st]+γEAt,St+1[Vπ(St+1)∣St=st]=EAt[ESt+1[Rt∣At,St=st]∣St=st]+γEAt[ESt+1[Vπ(St+1)∣At,St=st]∣St=st]=At∑π(at∣st)ESt+1[Rt∣At,St=st]+γAt∑π(at∣st)ESt+1[Vπ(St+1)∣At,St=st]=At∑π(at∣st)St+1∑r⋅p(st+1∣st,at)+γAt∑π(at∣st)St+1∑Vπ(st+1)⋅p(st+1∣st,at)
其中
r
=
r
(
s
t
,
s
t
+
1
,
a
t
)
r=r(s_t,s_{t+1},a_t)
r=r(st,st+1,at)
四.最优贝尔曼方程
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
Q ⋆ ( s t , a t ) = E S t + 1 ∼ p ( ⋅ ∣ s t , a t ) [ R t + γ ⋅ max A ∈ A Q ⋆ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] (1.4) Q_{\star}\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1} \sim p\left(\cdot \mid s_t, a_t\right)}\left[R_t+\gamma \cdot \max _{A \in \mathcal{A}} Q_{\star}\left(S_{t+1}, A\right) \mid S_t=s_t, A_t=a_t\right] \tag{1.4} Q⋆(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ⋅A∈AmaxQ⋆(St+1,A)∣St=st,At=at](1.4)
由贝尔曼方程可知
Q
⋆
(
s
t
,
a
t
)
=
E
S
t
+
1
,
A
t
+
1
[
R
t
+
γ
⋅
Q
⋆
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\star}\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}, A_{t+1}}\left[R_t+\gamma \cdot Q_{\star}\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right]
Q⋆(st,at)=ESt+1,At+1[Rt+γ⋅Q⋆(St+1,At+1)∣St=st,At=at]
因为动作
A
t
+
1
=
argmax
A
Q
⋆
(
S
t
+
1
,
A
)
A_{t+1}=\operatorname{argmax}_A Q_{\star}\left(S_{t+1}, A\right)
At+1=argmaxAQ⋆(St+1,A) 是状态
S
t
+
1
S_{t+1}
St+1 的确定性函数, 所以
Q
⋆
(
s
t
,
a
t
)
=
E
S
t
+
1
[
R
t
+
γ
⋅
max
A
∈
A
Q
⋆
(
S
t
+
1
,
A
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\star}\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot \max _{A \in \mathcal{A}} Q_{\star}\left(S_{t+1}, A\right) \mid S_t=s_t, A_t=a_t\right]
Q⋆(st,at)=ESt+1[Rt+γ⋅A∈AmaxQ⋆(St+1,A)∣St=st,At=at]
五.多步目标下的贝尔曼方程
设 R k R_k Rk 是 S k 、 A k 、 S k + 1 S_k 、 A_k 、 S_{k+1} Sk、Ak、Sk+1 的函数, ∀ k = 1 , ⋯ , n \forall k=1, \cdots, n ∀k=1,⋯,n 。 那么
Q π ( s t , a t ) ⏟ U t 的期望 = E S t + 1 , A t + 1 , ⋯ , S t + m , A t + m [ ( ∑ i = 0 m − 1 γ i R t + i ) + γ m ⋅ Q π ( S t + m , A t + m ) ⏟ U t + m 的期望 ∣ S t = s t , A t = a t ] . \underbrace{Q_\pi\left(s_t, a_t\right)}_{U_t \text { 的期望 }}=\mathbb{E}_{S_{t+1}, A_{t+1}, \cdots, S_{t+m}, A_{t+m}}[\left(\sum_{i=0}^{m-1} \gamma^i R_{t+i}\right)+\gamma^m \cdot \underbrace{Q_\pi\left(S_{t+m}, A_{t+m}\right)}_{U_{t+m} \text { 的期望 }} \mid S_t=s_t, A_t=a_t] . Ut 的期望 Qπ(st,at)=ESt+1,At+1,⋯,St+m,At+m[(i=0∑m−1γiRt+i)+γm⋅Ut+m 的期望 Qπ(St+m,At+m)∣St=st,At=at].
proof:设一局游戏的长度为
n
n
n 。根据定义,
t
t
t 时刻的回报
U
t
U_t
Ut 是
t
t
t 时刻之后的所有奖励的加权和:
U
t
=
R
t
+
γ
R
t
+
1
+
γ
2
R
t
+
2
+
⋯
+
γ
n
−
t
R
n
.
U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\cdots+\gamma^{n-t} R_n .
Ut=Rt+γRt+1+γ2Rt+2+⋯+γn−tRn.
同理,
t
+
m
t+m
t+m 时刻的回报可以写成:
U
t
+
m
=
R
t
+
m
+
γ
R
t
+
m
+
1
+
γ
2
R
t
+
m
+
2
+
⋯
+
γ
n
−
t
−
m
R
n
.
U_{t+m}=R_{t+m}+\gamma R_{t+m+1}+\gamma^2 R_{t+m+2}+\cdots+\gamma^{n-t-m} R_n .
Ut+m=Rt+m+γRt+m+1+γ2Rt+m+2+⋯+γn−t−mRn.
下面我们推导两个回报的关系。把
U
t
U_t
Ut 写成:
U
t
=
(
R
t
+
γ
R
t
+
1
+
⋯
+
γ
m
−
1
R
t
+
m
−
1
)
+
(
γ
m
R
t
+
m
+
⋯
+
γ
n
−
t
R
n
)
=
(
∑
i
=
0
m
−
1
γ
i
R
t
+
i
)
+
γ
m
(
R
t
+
m
+
γ
R
t
+
m
+
1
+
⋯
+
γ
n
−
t
−
m
R
n
)
⏟
等于
U
t
+
m
.
\begin{aligned} U_t & =\left(R_t+\gamma R_{t+1}+\cdots+\gamma^{m-1} R_{t+m-1}\right)+\left(\gamma^m R_{t+m}+\cdots+\gamma^{n-t} R_n\right) \\ & =\left(\sum_{i=0}^{m-1} \gamma^i R_{t+i}\right)+\gamma^m \underbrace{\left(R_{t+m}+\gamma R_{t+m+1}+\cdots+\gamma^{n-t-m} R_n\right)}_{\text {等于 } U_{t+m}} . \end{aligned}
Ut=(Rt+γRt+1+⋯+γm−1Rt+m−1)+(γmRt+m+⋯+γn−tRn)=(i=0∑m−1γiRt+i)+γm等于 Ut+m
(Rt+m+γRt+m+1+⋯+γn−t−mRn).
因此, 回报可以写成这种形式:
U
t
=
(
∑
i
=
0
m
−
1
γ
i
R
t
+
i
)
+
γ
m
U
t
+
m
.
U_t=\left(\sum_{i=0}^{m-1} \gamma^i R_{t+i}\right)+\gamma^m U_{t+m} .
Ut=(∑i=0m−1γiRt+i)+γmUt+m.则
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
:
,
A
t
+
1
:
[
U
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
:
,
A
t
+
1
:
[
(
∑
i
=
0
m
−
1
γ
i
R
t
+
i
)
+
γ
m
U
t
+
m
∣
S
t
=
s
t
,
A
t
=
a
t
]
=
E
S
t
+
1
,
A
t
+
1
[
∑
i
=
0
m
−
1
γ
i
R
t
+
i
∣
S
t
=
s
t
,
A
t
=
a
t
]
⏟
(
1
)
+
γ
m
⋅
E
S
t
+
1
:
,
A
t
+
1
:
[
U
t
+
m
∣
S
t
=
s
t
,
A
t
=
a
t
]
⏟
(
2
)
\begin{aligned} Q_\pi\left(s_t, a_t\right)&=\mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_t \mid S_t=s_t, A_t=a_t\right]\\ &=\mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[\left(\sum_{i=0}^{m-1} \gamma^i R_{t+i}\right)+\gamma^m U_{t+m} \mid S_t=s_t, A_t=a_t\right]\\ &= \underbrace{\Bbb E_{\cal S_{t+1},\cal A_{t+1}}\left[\sum_{i=0}^{m-1} \gamma^i R_{t+i}|S_t=s_t,A_t=a_t \right]}_{(1)}+\gamma^{m}\cdot\underbrace{ \mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_{t+m} \mid S_t=s_t, A_t=a_t\right]}_{(2)} \end{aligned}
Qπ(st,at)=ESt+1:,At+1:[Ut∣St=st,At=at]=ESt+1:,At+1:[(i=0∑m−1γiRt+i)+γmUt+m∣St=st,At=at]=(1)
ESt+1,At+1[i=0∑m−1γiRt+i∣St=st,At=at]+γm⋅(2)
ESt+1:,At+1:[Ut+m∣St=st,At=at]
其中(1) = E S t + 1 , A t + 1 [ ∑ i = 0 m − 1 γ i R t + i ∣ S t = s t , A t = a t ] = E S t + 1 , ⋯ , S t + m , A t + 1 , ⋯ , A t + m − 1 [ ∑ i = 0 m − 1 γ i R t + i ∣ S t = s t , A t = a t ] 最简形式,其他都与 R t , ⋯ , R t + m − 1 无关 = E S t + 1 , ⋯ , S t + m , A t + 1 , ⋯ , A t + m [ ∑ i = 0 m − 1 γ i R t + i ∣ S t = s t , A t = a t ] \begin{aligned} \text{其中(1)}&=\Bbb E_{\cal S_{t+1},\cal A_{t+1}}\left[\sum_{i=0}^{m-1} \gamma^i R_{t+i}|S_t=s_t,A_t=a_t \right]\\ &=\Bbb E_{S_{t+1},\cdots,S_{t+m},A_{t+1},\cdots,A_{t+m-1}}\left[\sum_{i=0}^{m-1} \gamma^i R_{t+i}|S_t=s_t,A_t=a_t \right]\text{最简形式,其他都与}R_{t},\cdots,R_{t+m-1}无关\\ &=\Bbb E_{S_{t+1},\cdots,S_{t+m},A_{t+1},\cdots,A_{t+m}}\left[\sum_{i=0}^{m-1} \gamma^i R_{t+i}|S_t=s_t,A_t=a_t \right] \end{aligned} 其中(1)=ESt+1,At+1[i=0∑m−1γiRt+i∣St=st,At=at]=ESt+1,⋯,St+m,At+1,⋯,At+m−1[i=0∑m−1γiRt+i∣St=st,At=at]最简形式,其他都与Rt,⋯,Rt+m−1无关=ESt+1,⋯,St+m,At+1,⋯,At+m[i=0∑m−1γiRt+i∣St=st,At=at]
其中(2) = E S t + 1 : , A t + 1 : [ U t + m ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 , ⋯ , S t + m , A t + m , S t + m + 1 , A t + m + 1 [ U t + m ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 , ⋯ , S t + m , A t + m [ E S t + m + 1 , A t + m + 1 [ U t + m ∣ S t + 1 , A t + 1 , ⋯ , S t + m , A t + m , S t = s t , A t = a t ] ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 , ⋯ , S t + m , A t + m [ E S t + m + 1 , A t + m + 1 [ U t + m ∣ S t + m , A t + m ] ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 , ⋯ , S t + m , A t + m [ Q π ( S t + m , A t + m ) ∣ S t = s t , A t = a t ] \begin{aligned} \text{其中(2)}&= \mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_{t+m} \mid S_t=s_t, A_t=a_t\right]\\ &=\Bbb E_{S_{t+1},A_{t+1},\cdots,S_{t+m},A_{t+m},\cal S_{t+m+1},\cal A_{t+m+1}}\left[U_{t+m} \mid S_t=s_t, A_t=a_t\right]\\ &=\Bbb E_{S_{t+1},A_{t+1},\cdots,S_{t+m},A_{t+m}}[\Bbb E_{\cal S_{t+m+1},\cal A_{t+m+1}}[U_{t+m}\mid S_{t+1},A_{t+1},\cdots,S_{t+m},A_{t+m},S_t=s_t, A_t=a_t]\mid S_t=s_t, A_t=a_t]\\ &=\Bbb E_{S_{t+1},A_{t+1},\cdots,S_{t+m},A_{t+m}}[\Bbb E_{\cal S_{t+m+1},\cal A_{t+m+1}}[U_{t+m}\mid S_{t+m},A_{t+m}]\mid S_t=s_t, A_t=a_t]\\ &=\Bbb E_{S_{t+1},A_{t+1},\cdots,S_{t+m},A_{t+m}}[Q_{\pi}(S_{t+m},A_{t+m})\mid S_t=s_t,A_t=a_t] \end{aligned} 其中(2)=ESt+1:,At+1:[Ut+m∣St=st,At=at]=ESt+1,At+1,⋯,St+m,At+m,St+m+1,At+m+1[Ut+m∣St=st,At=at]=ESt+1,At+1,⋯,St+m,At+m[ESt+m+1,At+m+1[Ut+m∣St+1,At+1,⋯,St+m,At+m,St=st,At=at]∣St=st,At=at]=ESt+1,At+1,⋯,St+m,At+m[ESt+m+1,At+m+1[Ut+m∣St+m,At+m]∣St=st,At=at]=ESt+1,At+1,⋯,St+m,At+m[Qπ(St+m,At+m)∣St=st,At=at]
证毕