强化学习中首先要明白(折扣)回报的定义:
U
t
=
R
t
+
γ
R
t
+
1
+
γ
2
R
t
+
2
+
.
.
.
.
.
.
γ
n
R
n
U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+......\gamma^n R_{n}
Ut=Rt+γRt+1+γ2Rt+2+......γnRn
这里的
R
R
R 是 reward:奖励,
R
t
R_t
Rt与当前的状态
S
t
S_t
St、当前的动作
A
t
A_t
At、以即
S
t
+
1
S_{t+1}
St+1相关(或者简单认为
R
t
R_t
Rt只与当前的状态
S
t
S_t
St、当前的动作
A
t
A_t
At相关),也就是说:
R
t
=
r
(
S
t
,
A
t
,
S
t
+
1
)
R_t=r(S_t,A_t,S_{t+1})
Rt=r(St,At,St+1),这里的大写字母代表具有随机性,小写字母表示已经被观测,没有随机性。
动作价值函数 Q π ( s t , a t ) Q_{\pi}(s_t,a_t) Qπ(st,at)与当前的策略函数 π \pi π 、当前动作 a t a_t at 、当前的状态 s t s_t st有关,动作价值函数是回报的期望 Q π ( s t , a t ) = E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_{n},A_{n}}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1,...,Sn,An[Ut∣St=st,At=at],也就是说对后续的 S t + 1 , A t + 1 , . . . , S n , A n S_{t+1},A_{t+1},...,S_{n},A_{n} St+1,At+1,...,Sn,An求期望,消除了这些状态和动作的随机性。
状态价值函数
V
π
(
s
t
)
V_{\pi}(s_t)
Vπ(st) 与当前的策略函数
π
\pi
π 、当前的状态
s
t
s_t
st有关,是用来评估,在当前策略
π
\pi
π 下,状态
s
t
s_t
st的好坏,算期望消除了动作的随机性。
V
π
(
s
t
)
=
∑
a
∈
A
π
(
a
∣
s
t
)
∗
Q
π
(
s
t
,
a
t
)
V_{\pi}(s_t)=\sum\limits_{a\in\mathcal{A}}\pi(a|s_t)*Q_{\pi}(s_t,a_t)
Vπ(st)=a∈A∑π(a∣st)∗Qπ(st,at)
最优动作价值函数 Q ∗ ( s t , a t ) Q_*(s_t,a_t) Q∗(st,at) 表示当前的策略是最优的情况下,在状态 s t s_t st做动作 a t a_t at能得到的价值,这个价值一定是各种策略下在状态 s t s_t st做动作 a t a_t at能得到的价值中最高的,因为我们的策略是最优的。 Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q_*(s_t,a_t)=\max_{\pi}Q_{\pi}(s_t,a_t) Q∗(st,at)=maxπQπ(st,at)
最优状态价值函数 V ∗ ( s t ) V_*(s_t) V∗(st) , V ∗ ( s t ) = max a Q ∗ ( s t , a t ) V_*(s_t)=\max_{a}Q_*(s_t,a_t) V∗(st)=maxaQ∗(st,at), V ∗ ( s t ) V_*(s_t) V∗(st)表示在最优策略下,最高的最优动作价值 Q ∗ ( s t , a t ) Q_*(s_t,a_t) Q∗(st,at)
对动作价值函数、状态价值函数的理解
可以看这篇博客强化学习中状态价值函数和动作价值函数的理解,讲的很好,通俗易懂
贝尔曼方程1
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
,
A
t
+
1
[
R
t
+
γ
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+\gamma Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t]
Qπ(st,at)=ESt+1,At+1[Rt+γQπ(St+1,At+1)∣St=st,At=at]
证明:
- U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . γ n R n U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+......\gamma^n R_{n} Ut=Rt+γRt+1+γ2Rt+2+......γnRn
- U t = R t + U t + 1 U_t=R_t+U_{t+1} Ut=Rt+Ut+1
-
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
:
,
A
t
+
1
:
[
U
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\pi}(s_t,a_t)=E_{S_{t+1}:,A_{t+1}:}[U_t|S_t=s_t,A_t=a_t]
Qπ(st,at)=ESt+1:,At+1:[Ut∣St=st,At=at]这里用:来做省略
将2代入3中,得 -
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
:
,
A
t
+
1
:
[
R
t
+
U
t
+
1
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\pi}(s_t,a_t)=E_{S_{t+1}:,A_{t+1}:}[R_t+U_{t+1}|S_t=s_t,A_t=a_t]
Qπ(st,at)=ESt+1:,At+1:[Rt+Ut+1∣St=st,At=at]
拆开式子为两部分 - E S t + 1 : , A t + 1 : [ R t ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[R_t|S_t=s_t,A_t=a_t] ESt+1:,At+1:[Rt∣St=st,At=at] 和 E S t + 1 : , A t + 1 : [ U t + 1 ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[U_{t+1}|S_t=s_t,A_t=a_t] ESt+1:,At+1:[Ut+1∣St=st,At=at]
- 其中 E S t + 1 : , A t + 1 : [ R t ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[R_t|S_t=s_t,A_t=a_t] ESt+1:,At+1:[Rt∣St=st,At=at], R t R_t Rt只与当前的状态 S t S_t St、当前的动作 A t A_t At、以即 S t + 1 S_{t+1} St+1相关,所以,可转化为 E S t + 1 [ R t ∣ S t = s t , A t = a t ] E_{S_{t+1}}[R_t|S_t=s_t,A_t=a_t] ESt+1[Rt∣St=st,At=at]
- E S t + 1 : , A t + 1 : [ U t + 1 ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[U_{t+1}|S_t=s_t,A_t=a_t] ESt+1:,At+1:[Ut+1∣St=st,At=at]=
- = E S t + 1 , A t + 1 [ E S t + 2 : , A t + 2 : [ U t + 1 ∣ S t + 1 , A t + 1 ] ∣ S t = s t , A t = a t ] E_{S_{t+1},A_{t+1}}[E_{S_{t+2}:,A_{t+2}:}[U_{t+1}|S_{t+1},A_{t+1}]|S_t=s_t,A_t=a_t] ESt+1,At+1[ESt+2:,At+2:[Ut+1∣St+1,At+1]∣St=st,At=at]
- =
E
S
t
+
1
,
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
E_{S_{t+1},A_{t+1}}[Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t]
ESt+1,At+1[Qπ(St+1,At+1)∣St=st,At=at]
将6、9带入到4中可得
Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1[Rt+Qπ(St+1,At+1)∣St=st,At=at]
证明完毕
贝尔曼方程2
因为
V
π
(
S
t
+
1
)
=
E
A
t
+
1
Q
(
S
t
+
1
,
A
t
+
1
)
V_{\pi}(S_{t+1})=E_{A_{t+1}}Q(S_{t+1},A_{t+1})
Vπ(St+1)=EAt+1Q(St+1,At+1)
所以上式贝尔曼方程1可以转换为
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
[
R
t
+
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\pi}(s_t,a_t)=E_{S_{t+1}}[R_t+V_{\pi}(S_{t+1})|S_t=s_t,A_t=a_t]
Qπ(st,at)=ESt+1[Rt+Vπ(St+1)∣St=st,At=at]
贝尔曼方程3
因为
V
π
(
S
t
)
=
E
A
t
Q
(
S
t
,
A
t
)
V_{\pi}(S_{t})=E_{A_{t}}Q(S_{t},A_{t})
Vπ(St)=EAtQ(St,At)
所以上式贝尔曼方程2可以转换为
V
π
(
s
t
)
=
E
S
t
+
1
,
A
t
[
R
t
+
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
V_{\pi}(s_{t})=E_{S_{t+1},A_t}[R_t+V_{\pi}(S_{t+1})|S_t=s_t,A_t=a_t]
Vπ(st)=ESt+1,At[Rt+Vπ(St+1)∣St=st,At=at]
最优贝尔曼方程
Q
∗
(
s
t
,
a
t
)
=
E
S
t
+
1
∼
p
(
⋅
∣
s
t
,
a
t
)
[
R
t
+
γ
∗
m
a
x
A
∈
A
Q
∗
(
S
t
+
1
,
A
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_*(s_t,a_t)=E_{S_{t+1}\sim p(\cdot|s_t,a_t)}[R_t+\gamma * max_{A\in \mathcal{A}}Q_*(S_{t+1},A)|S_t=s_t,A_t=a_t]
Q∗(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ∗maxA∈AQ∗(St+1,A)∣St=st,At=at]
π
∗
=
a
r
g
m
a
x
π
Q
π
(
s
,
a
)
\pi^*=argmax_{\pi}Q_{\pi}(s,a)
π∗=argmaxπQπ(s,a)
由贝尔曼方程可得
Q
π
∗
(
s
t
,
a
t
)
=
E
S
t
+
1
,
A
t
+
1
[
R
t
+
Q
π
∗
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{\pi^*}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{\pi^*}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t]
Qπ∗(st,at)=ESt+1,At+1[Rt+Qπ∗(St+1,At+1)∣St=st,At=at]
Q
π
∗
(
s
t
,
a
t
)
=
Q
∗
(
s
t
,
a
t
)
Q_{\pi^*}(s_t,a_t)=Q_{*}(s_t,a_t)
Qπ∗(st,at)=Q∗(st,at),可得
Q
∗
(
s
t
,
a
t
)
=
E
S
t
+
1
,
A
t
+
1
[
R
t
+
Q
∗
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_{*}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{*}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t]
Q∗(st,at)=ESt+1,At+1[Rt+Q∗(St+1,At+1)∣St=st,At=at]
动作
A
t
+
1
=
a
r
g
m
a
x
A
Q
∗
(
S
t
+
1
,
A
)
A_{t+1}=argmax_A Q_{*}(S_{t+1},A)
At+1=argmaxAQ∗(St+1,A)是状态
S
t
+
1
S_{t+1}
St+1的确定函数(最好的那个动作),所以
Q
∗
(
s
t
,
a
t
)
=
E
S
t
+
1
∼
p
(
⋅
∣
s
t
,
a
t
)
[
R
t
+
γ
∗
m
a
x
A
∈
A
Q
∗
(
S
t
+
1
,
A
)
∣
S
t
=
s
t
,
A
t
=
a
t
]
Q_*(s_t,a_t)=E_{S_{t+1}\sim p(\cdot|s_t,a_t)}[R_t+\gamma * max_{A\in \mathcal{A}}Q_*(S_{t+1},A)|S_t=s_t,A_t=a_t]
Q∗(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ∗maxA∈AQ∗(St+1,A)∣St=st,At=at]