Markov Decision Process
1 相关资料
David Silver课件: https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf
周博磊课件: https://github.com/zhoubolei/introRL
2 概述
Markov Descision Processes(MDP) 是对强化学习环境(environment) 的一种正式描述:
- 该环境是完全可观测的 (fully observable)
- 当前的状态可以完全描述这个过程
- 几乎所有的RL问题都可以用MDP描述
3 Markov Property
- 马尔科夫性: 下一个状态仅与当前状态有关
4 MP, MRP和MDP
4.1 Markov Process ( M P ) (MP) (MP)
4.2 Markov Reward Process ( M R P ) (MRP) (MRP)
Return G t G_t Gt
G
t
G_t
Gt: 从时刻
t
t
t 开始总计的
d
i
s
c
o
u
n
t
e
d
discounted
discounted
r
e
w
a
r
d
reward
reward
G
t
=
R
t
+
1
+
γ
R
t
+
2
+
⋯
=
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
Gt=Rt+1+γRt+2+⋯=k=0∑∞γkRt+k+1
γ
\gamma
γ 表明我们有多喜欢即时的
r
e
w
a
r
d
reward
reward
- γ \gamma γ 趋近于0,表明喜欢近期的 r e w a r d reward reward
- γ \gamma γ 趋近于1,表明喜欢远期的 r e w a r d reward reward
Value Function v ( s ) v(s) v(s)
v
(
s
)
v(s)
v(s) 是状态价值函数 (state value function),对于一个
M
R
P
MRP
MRP,
v
(
s
)
v(s)
v(s) 描述了从状态
s
s
s 开始的期望
R
e
t
u
r
n
Return
Return:
v
(
s
)
=
E
[
G
t
∣
s
t
=
s
]
v(s) = E[G_t|s_t=s]
v(s)=E[Gt∣st=s]
MRPs 的 Bellman Equation
v ( s ) v(s) v(s) 可以分解为两个部分:
- 即时的 Reward
- 下一步状态的 discounted value:
γ
v
(
s
t
+
1
)
\gamma v(s_{t+1})
γv(st+1)
因此,可对
v
(
s
)
v(s)
v(s) 做如下拆解:
v
(
s
)
=
E
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
⋯
∣
s
t
=
s
]
=
E
[
R
t
+
1
+
γ
(
R
t
+
2
+
γ
R
t
+
3
+
⋯
)
∣
s
t
=
s
]
=
E
[
R
t
+
1
+
γ
G
t
+
1
∣
s
t
=
s
]
=
E
[
R
t
+
1
+
γ
v
(
s
t
+
1
)
∣
s
t
=
s
]
\begin{aligned} v(s) &= E[R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \cdots | s_t=s] \\ & = E[R_{t+1} + \gamma (R_{t+2 }+ \gamma R_{t+3}+\cdots) | s_t=s] \\ &=E[R_{t+1}+\gamma G_{t+1} | s_t=s] \\ & =E[R_{t+1}+\gamma v(s_{t+1})|s_t=s] \end{aligned}
v(s)=E[Rt+1+γRt+2+γ2Rt+3+⋯∣st=s]=E[Rt+1+γ(Rt+2+γRt+3+⋯)∣st=s]=E[Rt+1+γGt+1∣st=s]=E[Rt+1+γv(st+1)∣st=s]
4.3 Markov Decision Process(MDP)
Policies
Policy是状态
s
s
s 到行动
a
a
a 的映射。
State-Value Function
MDP的state-value function
v
π
(
s
)
v_\pi (s)
vπ(s): 依据policy
π
\pi
π 决策,从状态
s
s
s 开始的期望 Return
v
π
(
s
)
=
E
π
[
G
t
∣
S
t
=
s
]
v_\pi(s)=E_\pi[G_t|S_t=s]
vπ(s)=Eπ[Gt∣St=s]
action-value function
MDP 的 action-value function
q
π
(
s
,
a
)
q_{\pi}(s,a)
qπ(s,a):从状态
s
s
s 开始采取行动
a
a
a,并且依据 Policy
π
\pi
π 获得 的期望 Return:
q
π
(
s
,
a
)
=
E
π
[
G
t
∣
S
t
=
s
,
A
t
=
a
]
q_{\pi}(s,a) = E_{\pi}[G_t|S_t=s,A_t=a]
qπ(s,a)=Eπ[Gt∣St=s,At=a]
MDPs 的 Bellman Equation
v
π
(
s
)
=
E
π
[
R
t
+
1
+
γ
v
π
(
s
t
+
1
)
∣
S
t
=
s
]
v_\pi(s)=E_\pi[R_{t+1}+\gamma v_\pi({s_{t+1}})|S_t=s]
vπ(s)=Eπ[Rt+1+γvπ(st+1)∣St=s]
q
π
(
s
,
a
)
=
E
π
[
R
t
+
1
+
γ
q
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
,
A
t
=
a
]
q_{\pi}(s,a) = E_\pi[R_{t+1}+\gamma q_\pi({S_{t+1}},A_{t+1})|S_t=s,A_t=a]
qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a]