更新于2024.10.21,在做了一些工作后对贝尔曼方程有了新的理解,此外发现文章中存在的一些错误,故重新编辑,在这里进行一个更新。
贝尔曼方程实际上有着许多的表达方式,在不同的场景下也有着不同的称呼,但是这些表达方式之间均可以相互转化,且均描述了某一状态
s
t
a
t
e
state
state(或是状态-动作对
s
t
a
t
e
−
a
c
t
i
o
n
p
a
i
r
state-action pair
state−actionpair)与其他所有状态或是状态-动作对的关系,本文包含以下几个部分,一是状态价值函数间的贝尔曼方程,二是动作价值函数间的贝尔曼方程,三是总结了不同部分的联系:
由定义出发,可以得
V
V
V与
Q
Q
Q的联系:
v
π
(
s
)
=
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
(
V
−
Q
)
v_\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)]\quad(V-Q)
vπ(s)=Ea∼π(⋅∣s)[qπ(s,a)](V−Q)
而贝尔曼方程则有三种表述方式:
v
π
(
s
)
=
E
a
∼
π
(
⋅
∣
s
)
[
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
v
π
(
s
′
)
]
]
(
V
−
V
)
q
π
(
s
,
a
)
=
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
v
π
(
s
′
)
]
(
Q
−
V
)
q
π
(
s
,
a
)
=
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
)
]
(
Q
−
Q
)
\begin{aligned} &v_\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}\left[\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[v_\pi(s^{\prime})\right]\right] \quad(V-V)\\ &q_\pi(s, a)=\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[v_\pi(s^{\prime})\right] \quad(Q-V)\\ &q_\pi(s, a)=\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)])\right] \quad(Q-Q) \end{aligned}
vπ(s)=Ea∼π(⋅∣s)[E[r∣s,a]+γEs′∼p(⋅∣s,a)[vπ(s′)]](V−V)qπ(s,a)=E[r∣s,a]+γEs′∼p(⋅∣s,a)[vπ(s′)](Q−V)qπ(s,a)=E[r∣s,a]+γEs′∼p(⋅∣s,a)[Ea∼π(⋅∣s)[qπ(s,a)])](Q−Q)
(一)贝尔曼方程-推导与联系
一、状态价值函数(state-Value function)间的贝尔曼方程
这一次我们从累计折扣回报
G
t
G_{t}
Gt出发,
G
t
G_{t}
Gt被定义为:
G
t
≐
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
G_t \doteq R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots
Gt≐Rt+1+γRt+2+γ2Rt+3+…
状态价值函数
v
π
(
s
)
(
s
t
a
t
e
−
V
a
l
u
e
f
u
n
c
t
i
o
n
)
v_\pi(s)(state-Value function)
vπ(s)(state−Valuefunction)被定义为:
v
π
(
s
)
≐
E
[
G
t
∣
S
t
=
s
]
v_\pi(s) \doteq \mathbb{E}\left[G_t \mid S_t=s\right]
vπ(s)≐E[Gt∣St=s]
这也是sutton《强化学习》3.5节中给出的形式,我们的目的是构建状态之间的联系。首先由
G
t
=
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
=
R
t
+
1
+
γ
(
R
t
+
2
+
γ
R
t
+
3
+
…
)
=
R
t
+
1
+
γ
G
t
+
1
\begin{aligned} G_t & =R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \\ & =R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \\ & =R_{t+1}+\gamma G_{t+1} \end{aligned}
Gt=Rt+1+γRt+2+γ2Rt+3+…=Rt+1+γ(Rt+2+γRt+3+…)=Rt+1+γGt+1
进而状态价值函数可以被表示为:
v
π
(
s
)
=
E
[
G
t
∣
S
t
=
s
]
=
E
[
R
t
+
1
+
γ
G
t
+
1
∣
S
t
=
s
]
=
E
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
[
G
t
+
1
∣
S
t
=
s
]
\begin{aligned} v_\pi(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right] \end{aligned}
vπ(s)=E[Gt∣St=s]=E[Rt+1+γGt+1∣St=s]=E[Rt+1∣St=s]+γE[Gt+1∣St=s]
状态价值函数由两部分构成,对于
E
[
R
t
+
1
∣
S
t
=
s
]
\mathbb{E}\left[R_{t+1} \mid S_t=s\right]
E[Rt+1∣St=s]有:
E
[
R
t
+
1
∣
S
t
=
s
]
=
∑
r
∈
R
r
t
+
1
p
(
r
t
+
1
∣
S
t
=
s
)
=
∑
r
∈
R
∑
a
∈
A
r
t
+
1
p
(
r
t
+
1
,
A
t
=
a
∣
S
t
=
s
)
=
∑
r
∈
R
∑
a
∈
A
r
t
+
1
p
(
r
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
π
(
a
∣
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
∑
r
∈
R
r
t
+
1
p
(
r
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
=
∑
a
∈
A
π
(
a
∣
s
)
∑
s
′
∈
S
∑
r
∈
R
r
p
(
s
′
,
r
∣
s
,
a
)
‾
=
∑
a
∈
A
π
(
a
∣
s
)
∑
r
∈
R
p
(
r
∣
s
,
a
)
r
.
\begin{aligned} \mathbb{E}\left[R_{t+1} \mid S_t=s\right] & =\sum_{r \in R}r_{t+1}p(r_{t+1}\mid S_{t}=s)\\ &=\sum_{r \in R}\sum_{a \in A}r_{t+1}p(r_{t+1},A_{t}=a\mid S_{t}=s)\\ &=\sum_{r \in R}\sum_{a \in A}r_{t+1}p(r_{t+1}\mid S_{t}=s,A_{t}=a)\pi(a\mid s)\\ &=\sum_{a \in A}\pi(a\mid s)\sum_{r \in R}r_{t+1}p(r_{t+1}\mid S_{t}=s,A_{t}=a)\\ &=\underline{\sum_{a \in A}\pi(a\mid s)\sum_{s^{\prime} \in S}\sum_{r \in R}rp(s^{\prime},r\mid s,a)}\\ & =\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{r \in \mathcal{R}} p(r \mid s, a) r . \end{aligned}
E[Rt+1∣St=s]=r∈R∑rt+1p(rt+1∣St=s)=r∈R∑a∈A∑rt+1p(rt+1,At=a∣St=s)=r∈R∑a∈A∑rt+1p(rt+1∣St=s,At=a)π(a∣s)=a∈A∑π(a∣s)r∈R∑rt+1p(rt+1∣St=s,At=a)=a∈A∑π(a∣s)s′∈S∑r∈R∑rp(s′,r∣s,a)=a∈A∑π(a∣s)r∈R∑p(r∣s,a)r.
下划线部分即对应于sutton《强化学习》3.5节中贝尔曼方程的的第一部分。接下来来处理
E
[
G
t
+
1
∣
S
t
=
s
]
\mathbb{E}\left[G_{t+1} \mid S_t=s\right]
E[Gt+1∣St=s]:
E
[
G
t
+
1
∣
S
t
=
s
]
=
∑
G
t
+
1
p
(
G
t
+
1
∣
S
t
=
s
)
=
∑
s
′
∈
S
∑
G
t
+
1
p
(
G
t
+
1
,
s
′
∣
S
t
=
s
)
=
∑
s
′
∈
S
∑
G
t
+
1
p
(
G
t
+
1
∣
S
t
=
s
,
S
t
+
1
=
s
′
)
p
(
s
′
∣
S
t
+
1
=
s
′
)
=
∑
s
′
∈
S
E
[
G
t
+
1
∣
S
t
=
s
,
S
t
+
1
=
s
′
]
p
(
s
′
∣
s
)
=
∑
s
′
∈
S
E
[
G
t
+
1
∣
S
t
+
1
=
s
′
]
p
(
s
′
∣
s
)
=
∑
s
′
∈
S
v
π
(
s
′
)
p
(
s
′
∣
s
)
‾
=
∑
s
′
∈
S
v
π
(
s
′
)
∑
a
∈
A
p
(
s
′
∣
s
,
a
)
π
(
a
∣
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
∑
s
′
∈
S
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
\begin{aligned} \mathbb{E}\left[G_{t+1} \mid S_t=s\right] &=\sum G_{t+1}p(G_{t+1}\mid S_t=s)\\ &=\sum_{s^{\prime}\in S}\sum G_{t+1}p(G_{t+1},s^{\prime}\mid S_t=s)\\ &=\sum_{s^{\prime}\in S}\sum G_{t+1}p(G_{t+1}\mid S_t=s,S_{t+1}=s^{\prime})p(s^{\prime}\mid S_{t+1}=s^{\prime})\\ & =\sum_{s^{\prime} \in \mathcal{S}} \mathbb{E}\left[G_{t+1} \mid S_t=s, S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime} \in \mathcal{S}} \mathbb{E}\left[G_{t+1} \mid S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ & =\underline{\sum_{s^{\prime} \in \mathcal{S}} v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s\right)}\\ &=\sum_{s^{\prime} \in \mathcal{S}} v_\pi\left(s^{\prime}\right) \sum_{a \in \mathcal{A}} p\left(s^{\prime} \mid s, a\right) \pi(a \mid s)\\ &=\sum_{a \in \mathcal{A}} \pi(a \mid s)\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right)v_\pi\left(s^{\prime}\right)\\ \end{aligned}
E[Gt+1∣St=s]=∑Gt+1p(Gt+1∣St=s)=s′∈S∑∑Gt+1p(Gt+1,s′∣St=s)=s′∈S∑∑Gt+1p(Gt+1∣St=s,St+1=s′)p(s′∣St+1=s′)=s′∈S∑E[Gt+1∣St=s,St+1=s′]p(s′∣s)=s′∈S∑E[Gt+1∣St+1=s′]p(s′∣s)=s′∈S∑vπ(s′)p(s′∣s)=s′∈S∑vπ(s′)a∈A∑p(s′∣s,a)π(a∣s)=a∈A∑π(a∣s)s′∈S∑p(s′∣s,a)vπ(s′)
为了与sutton的表述相同,对于下划线部分:
∑
s
′
∈
S
v
π
(
s
′
)
p
(
s
′
∣
s
)
=
∑
a
∈
A
∑
s
′
∈
S
v
π
(
s
′
)
p
(
s
′
,
a
∣
s
)
=
∑
a
∈
A
∑
s
′
∈
S
v
π
(
s
′
)
p
(
s
′
∣
s
,
a
)
π
(
a
∣
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
∑
s
′
∈
S
v
π
(
s
′
)
p
(
s
′
∣
s
,
a
)
=
∑
a
∈
A
π
(
a
∣
s
)
∑
s
′
∈
S
∑
r
∈
R
v
π
(
s
′
)
p
(
s
′
,
r
∣
s
,
a
)
‾
\begin{aligned} \sum_{s^{\prime} \in \mathcal{S}} v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s\right)&=\sum_{a\in A}\sum_{s^{\prime} \in \mathcal{S}}v_\pi\left(s^{\prime}\right) p\left(s^{\prime},a \mid s\right)\\ &=\sum_{a\in A}\sum_{s^{\prime} \in \mathcal{S}}v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s,a\right)\pi(a\mid s)\\ &=\sum_{a\in A}\pi(a\mid s)\sum_{s^{\prime} \in \mathcal{S}}v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s,a\right)\\ &=\underline{\sum_{a\in A}\pi(a\mid s)\sum_{s^{\prime} \in \mathcal{S}}\sum_{r \in \mathcal{R}}v_\pi\left(s^{\prime}\right) p\left(s^{\prime},r \mid s,a\right)}\\ \end{aligned}
s′∈S∑vπ(s′)p(s′∣s)=a∈A∑s′∈S∑vπ(s′)p(s′,a∣s)=a∈A∑s′∈S∑vπ(s′)p(s′∣s,a)π(a∣s)=a∈A∑π(a∣s)s′∈S∑vπ(s′)p(s′∣s,a)=a∈A∑π(a∣s)s′∈S∑r∈R∑vπ(s′)p(s′,r∣s,a)
融合上面两个部分:
v
π
(
s
)
=
E
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
[
G
t
+
1
∣
S
t
=
s
]
,
=
∑
a
∈
A
π
(
a
∣
s
)
∑
r
∈
R
p
(
r
∣
s
,
a
)
r
+
γ
∑
a
∈
A
π
(
a
∣
s
)
∑
s
′
∈
S
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
\begin{aligned} v_\pi(s) & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right], \\ & =\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{r \in \mathcal{R}} p(r \mid s, a) r+\gamma \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right) \end{aligned}
vπ(s)=E[Rt+1∣St=s]+γE[Gt+1∣St=s],=a∈A∑π(a∣s)r∈R∑p(r∣s,a)r+γa∈A∑π(a∣s)s′∈S∑p(s′∣s,a)vπ(s′)
从上式可以看出状态价值函数由两部分构成,第一部分是平均瞬时回报第二部分是平均未来回报。继续整理得:
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
[
∑
r
∈
R
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
∈
S
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
]
\begin{equation} v_\pi(s) =\sum_{a \in \mathcal{A}} \pi(a \mid s)\left[\sum_{r \in \mathcal{R}} p(r \mid s, a) r+\gamma \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right] \end{equation}
vπ(s)=a∈A∑π(a∣s)[r∈R∑p(r∣s,a)r+γs′∈S∑p(s′∣s,a)vπ(s′)]
若是表述为期望形式则有:
v
π
(
s
)
=
E
a
∼
π
(
⋅
∣
s
)
[
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
v
π
(
s
′
)
]
]
(
V
−
V
)
v_\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}\left[\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[v_\pi(s^{\prime})\right]\right] \quad(V-V)
vπ(s)=Ea∼π(⋅∣s)[E[r∣s,a]+γEs′∼p(⋅∣s,a)[vπ(s′)]](V−V)
这个形式描述了状态价值函数间的关系.
若是代入下划线部分则得到sutton《强化学习》3.5节中的表述:
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
v
π
(
s
′
)
]
v_\pi(s) =\sum_a \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_\pi\left(s^{\prime}\right)\right]
vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]
实际上就是多做了一次全概率公式的求和。这也说明贝尔曼方程有着相当多形式的描述。上面的描述实际上属于
p
a
i
r
−
w
i
s
e
pair-wise
pair−wise形式的描述,若是将不同状态间的描述排列成方程组的形式即可得到
m
a
t
r
i
x
−
v
e
c
t
o
r
matrix-vector
matrix−vector形式的贝尔曼方程:
[
v
π
(
s
1
)
v
π
(
s
2
)
v
π
(
s
3
)
v
π
(
s
4
)
]
⏟
v
π
=
[
r
π
(
s
1
)
r
π
(
s
2
)
r
π
(
s
3
)
r
π
(
s
4
)
]
⏟
r
π
+
γ
[
p
π
(
s
1
∣
s
1
)
p
π
(
s
2
∣
s
1
)
p
π
(
s
3
∣
s
1
)
p
π
(
s
4
∣
s
1
)
p
π
(
s
1
∣
s
2
)
p
π
(
s
2
∣
s
2
)
p
π
(
s
3
∣
s
2
)
p
π
(
s
4
∣
s
2
)
p
π
(
s
1
∣
s
3
)
p
π
(
s
2
∣
s
3
)
p
π
(
s
3
∣
s
3
)
p
π
(
s
4
∣
s
3
)
p
π
(
s
1
∣
s
4
)
p
π
(
s
2
∣
s
4
)
p
π
(
s
3
∣
s
4
)
p
π
(
s
4
∣
s
4
)
]
⏟
P
π
[
v
π
(
s
1
)
v
π
(
s
2
)
v
π
(
s
3
)
v
π
(
s
4
)
]
⏟
v
π
\underbrace{\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]}_{v_\pi}=\underbrace{\left[\begin{array}{c} r_\pi\left(s_1\right) \\ r_\pi\left(s_2\right) \\ r_\pi\left(s_3\right) \\ r_\pi\left(s_4\right) \end{array}\right]}_{r_\pi}+\gamma \underbrace{\left[\begin{array}{llll} p_\pi\left(s_1 \mid s_1\right) & p_\pi\left(s_2 \mid s_1\right) & p_\pi\left(s_3 \mid s_1\right) & p_\pi\left(s_4 \mid s_1\right) \\ p_\pi\left(s_1 \mid s_2\right) & p_\pi\left(s_2 \mid s_2\right) & p_\pi\left(s_3 \mid s_2\right) & p_\pi\left(s_4 \mid s_2\right) \\ p_\pi\left(s_1 \mid s_3\right) & p_\pi\left(s_2 \mid s_3\right) & p_\pi\left(s_3 \mid s_3\right) & p_\pi\left(s_4 \mid s_3\right) \\ p_\pi\left(s_1 \mid s_4\right) & p_\pi\left(s_2 \mid s_4\right) & p_\pi\left(s_3 \mid s_4\right) & p_\pi\left(s_4 \mid s_4\right) \end{array}\right]}_{P_\pi} \underbrace{\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]}_{v_\pi}
vπ
vπ(s1)vπ(s2)vπ(s3)vπ(s4)
=rπ
rπ(s1)rπ(s2)rπ(s3)rπ(s4)
+γPπ
pπ(s1∣s1)pπ(s1∣s2)pπ(s1∣s3)pπ(s1∣s4)pπ(s2∣s1)pπ(s2∣s2)pπ(s2∣s3)pπ(s2∣s4)pπ(s3∣s1)pπ(s3∣s2)pπ(s3∣s3)pπ(s3∣s4)pπ(s4∣s1)pπ(s4∣s2)pπ(s4∣s3)pπ(s4∣s4)
vπ
vπ(s1)vπ(s2)vπ(s3)vπ(s4)
即:
v
π
=
r
π
+
γ
P
π
v
π
v_\pi=r_\pi+\gamma P_\pi v_\pi
vπ=rπ+γPπvπ
其中
v
π
=
[
v
π
(
s
1
)
,
…
,
v
π
(
s
n
)
]
T
∈
R
n
v_\pi=\left[v_\pi\left(s_1\right), \ldots, v_\pi\left(s_n\right)\right]^T \in \mathbb{R}^n
vπ=[vπ(s1),…,vπ(sn)]T∈Rn,
r
π
=
[
r
π
(
s
1
)
,
…
,
r
π
(
s
n
)
]
T
∈
R
n
r_\pi=\left[r_\pi\left(s_1\right), \ldots, r_\pi\left(s_n\right)\right]^T \in \mathbb{R}^n
rπ=[rπ(s1),…,rπ(sn)]T∈Rn,
P
π
∈
R
n
×
n
P_\pi \in \mathbb{R}^{n \times n}
Pπ∈Rn×n,
[
P
π
]
i
j
=
p
π
(
s
j
∣
s
i
)
\left[P_\pi\right]_{i j}=p_\pi\left(s_j \mid s_i\right)
[Pπ]ij=pπ(sj∣si)。这一形式在表格型MDP中有闭式解,也多用于表格型MDP的推导,有兴趣的可以自行查阅
二、动作价值函数(state-action Value function)间的贝尔曼方程
状态-动作对
(
s
,
a
)
(s,a)
(s,a)的动作价值函数
q
π
(
s
,
a
)
(
a
c
t
i
o
n
−
V
a
l
u
e
f
u
n
c
t
i
o
n
)
q_\pi(s, a)(action-Value function)
qπ(s,a)(action−Valuefunction)被定义为:
q
π
(
s
,
a
)
≐
E
[
G
t
∣
S
t
=
s
,
A
t
=
a
]
q_\pi(s, a) \doteq \mathbb{E}\left[G_t \mid S_t=s, A_t=a\right]
qπ(s,a)≐E[Gt∣St=s,At=a]
由全概率公式:
E
[
G
t
∣
S
t
=
s
]
⏟
v
π
(
s
)
=
∑
a
∈
A
E
[
G
t
∣
S
t
=
s
,
A
t
=
a
]
⏟
q
π
(
s
,
a
)
π
(
a
∣
s
)
\underbrace{\mathbb{E}\left[G_t \mid S_t=s\right]}_{v_\pi(s)}=\sum_{a \in \mathcal{A}} \underbrace{\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right]}_{q_\pi(s, a)} \pi(a \mid s)
vπ(s)
E[Gt∣St=s]=a∈A∑qπ(s,a)
E[Gt∣St=s,At=a]π(a∣s)
得:
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
q
π
(
s
,
a
)
=
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
(
V
−
Q
)
\begin{aligned} v_\pi(s)&=\sum_{a \in \mathcal{A}} \pi(a \mid s) q_\pi(s, a)\\ &=\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)]\quad(V-Q) \end{aligned}
vπ(s)=a∈A∑π(a∣s)qπ(s,a)=Ea∼π(⋅∣s)[qπ(s,a)](V−Q)
期望形式描述了状态价值与动作价值之间的联系,代入公式(1)即得:
∑
a
∈
A
π
(
a
∣
s
)
q
π
(
s
,
a
)
=
∑
a
∈
A
π
(
a
∣
s
)
[
∑
r
∈
R
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
∈
S
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
]
\sum_{a \in \mathcal{A}} \pi(a \mid s) q_\pi(s, a)=\sum_{a \in \mathcal{A}} \pi(a \mid s)\left[\sum_{r \in \mathcal{R}} p(r \mid s, a) r+\gamma \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right]
a∈A∑π(a∣s)qπ(s,a)=a∈A∑π(a∣s)[r∈R∑p(r∣s,a)r+γs′∈S∑p(s′∣s,a)vπ(s′)]
两边同时去掉
∑
a
∈
A
π
(
a
∣
s
)
\sum_{a \in \mathcal{A}} \pi(a \mid s)
∑a∈Aπ(a∣s)即得:
q
π
(
s
,
a
)
=
∑
r
∈
R
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
∈
S
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
\begin{equation} q_\pi(s, a)=\sum_{r \in \mathcal{R}} p(r \mid s, a) r+\gamma \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right) \end{equation}
qπ(s,a)=r∈R∑p(r∣s,a)r+γs′∈S∑p(s′∣s,a)vπ(s′)
若以期望形式描述,则可以表述为:
q
π
(
s
,
a
)
=
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
v
π
(
s
′
)
]
(
Q
−
V
)
q_\pi(s, a)=\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[v_\pi(s^{\prime})\right] \quad(Q-V)
qπ(s,a)=E[r∣s,a]+γEs′∼p(⋅∣s,a)[vπ(s′)](Q−V)
这一期望形式描述了动作价值与状态价值之间的联系。
v
π
(
s
)
=
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
(
V
−
Q
)
v_\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)]\quad(V-Q)
vπ(s)=Ea∼π(⋅∣s)[qπ(s,a)](V−Q)
将定义的期望形式代入即有:
q
π
(
s
,
a
)
=
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
)
]
(
Q
−
Q
)
q_\pi(s, a)=\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)])\right] \quad(Q-Q)
qπ(s,a)=E[r∣s,a]+γEs′∼p(⋅∣s,a)[Ea∼π(⋅∣s)[qπ(s,a)])](Q−Q)
这一形式则描述了动作价值之间的联系。类似的,动作价值函数的贝尔曼方程同样有矩阵向量形式,此处略去不表。
三、总结
由定义出发,可以得
V
V
V与
Q
Q
Q的联系:
v
π
(
s
)
=
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
(
V
−
Q
)
v_\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)]\quad(V-Q)
vπ(s)=Ea∼π(⋅∣s)[qπ(s,a)](V−Q)
而贝尔曼方程则有三种表述方式:
v
π
(
s
)
=
E
a
∼
π
(
⋅
∣
s
)
[
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
v
π
(
s
′
)
]
]
(
V
−
V
)
q
π
(
s
,
a
)
=
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
v
π
(
s
′
)
]
(
Q
−
V
)
q
π
(
s
,
a
)
=
E
[
r
∣
s
,
a
]
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
E
a
∼
π
(
⋅
∣
s
)
[
q
π
(
s
,
a
)
]
)
]
(
Q
−
Q
)
\begin{aligned} &v_\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}\left[\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[v_\pi(s^{\prime})\right]\right] \quad(V-V)\\ &q_\pi(s, a)=\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[v_\pi(s^{\prime})\right] \quad(Q-V)\\ &q_\pi(s, a)=\mathbb{E}[r\mid s,a]+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[\mathbb{E}_{a \sim \pi(\cdot \mid s)}[q_\pi(s, a)])\right] \quad(Q-Q) \end{aligned}
vπ(s)=Ea∼π(⋅∣s)[E[r∣s,a]+γEs′∼p(⋅∣s,a)[vπ(s′)]](V−V)qπ(s,a)=E[r∣s,a]+γEs′∼p(⋅∣s,a)[vπ(s′)](Q−V)qπ(s,a)=E[r∣s,a]+γEs′∼p(⋅∣s,a)[Ea∼π(⋅∣s)[qπ(s,a)])](Q−Q)