Markov Decision Processes
MPDs formally describe an environment for RL
Markov Processes
Definition
无记忆的随机过程
i.e. a sequence of random states S 1 , S 2 , … S_1,S_2,\dots S1,S2,… with the [Markov property](#Markov Property)
A Markov Process (or Markov Chain) is a tuple ⟨ S , P ⟩ \lang \mathcal{S, P} \rang ⟨S,P⟩
- S \mathcal{S} S is a (finite) set of states
-
P
\mathcal{P}
P is a state [transition probability matrix](#state transition matrix)
P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right] Pss′=P[St+1=s′∣St=s]
Markov Property
“The future is independent of the past given the present”
P
[
S
t
+
1
∣
S
t
]
=
P
[
S
t
+
1
∣
S
1
,
…
,
S
t
]
\mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right]
P[St+1∣St]=P[St+1∣S1,…,St]
The state is a sufficient statistic of the future
State Transition Matrix
state transition probability
P
s
o
u
r
c
e
_
d
e
s
t
i
n
a
t
i
o
n
\mathcal{P}_{source\_destination}
Psource_destination
P
s
s
′
=
P
[
S
t
+
1
=
s
′
∣
S
t
=
s
]
\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right]
Pss′=P[St+1=s′∣St=s]
state transition matrix
P
\mathcal{P}
P
defines transition probabilities from all state s to all successor state s’
P
=
[
P
11
⋯
P
1
n
⋮
P
n
1
⋯
P
n
n
]
\mathcal{P} = \left[ \begin{matrix} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{matrix} \right]
P=⎣⎢⎡P11⋮Pn1⋯⋯P1nPnn⎦⎥⎤
根据概率的性质,一定有:
∑
j
=
1
n
P
i
j
=
1
∀
i
=
1
,
…
,
n
\sum_{j=1}^n \mathcal{P}_{ij} = 1 \qquad \forall i =1,\dots,n
j=1∑nPij=1∀i=1,…,n
Example:
Markov Reward Process
Definition
a Markov chain with values
A Markov Reward Process is a tuple ⟨ S , P , R , γ ⟩ \lang \mathcal{S,P,R,\gamma} \rang ⟨S,P,R,γ⟩
- S \mathcal{S} S is a finite set of states
-
P
\mathcal{P}
P is a state transition probability matrix
P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right] Pss′=P[St+1=s′∣St=s] - R \mathcal{R} R is a reward function, $\mathcal{R}s = \mathbb{E}[R{t+1} \mid S_t =s] $
- γ \gamma γ is a discount factor, γ ∈ [ 0 , 1 ] \gamma \in [0, 1] γ∈[0,1]
注意这时的reward只和状态有关 R s \mathcal{R}_s Rs,例如下图的 class 1 无论去 Facebook 还是 class 2 都是 R = − 2 R = -2 R=−2
Return
the retrun
G
t
G_t
Gt is the total discounted reward from time-step t
G
t
=
R
t
+
1
+
γ
R
t
+
2
+
⋯
=
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infin \gamma^k R_{t+k+1}
Gt=Rt+1+γRt+2+⋯=k=0∑∞γkRt+k+1
-
γ
∈
[
0
,
1
]
\gamma \in [0,1]
γ∈[0,1] 代表了未来的奖励在现在这一时刻起的作用大小,更希望得到现有的奖励,未来的奖励就要把它打折扣
- γ \gamma γ 代表“近视”,比较重视短期利益
- γ \gamma γ 代表“远视”,对未来的奖励也一样看待
Why discount
- 有些马尔可夫过程是带环的,它并没有终结,我们想避免这个无穷的奖励。
- 我们并没有建立一个完美的模拟环境的模型,也就是说,我们对未来的评估不一定是准确的,我们不一定完全信任我们的模型,因为这种不确定性,所以我们对未来的预估增加一个折扣。我们想把这个不确定性表示出来,希望尽可能快地得到奖励,而不是在未来某一个点得到奖励。
- 如果这个奖励是有实际价值的,我们可能是更希望立刻就得到奖励,而不是后面再得到奖励(现在的钱比以后的钱更有价值)。
- 在人的行为里面来说的话,大家也是想得到即时奖励。
- 有些时候可以把这个系数设为 0:我们就只关注了它当前的奖励。我们也可以把它设为 1:对未来并没有折扣,未来获得的奖励跟当前获得的奖励是一样的。
Value Function
The state value function v(s) of an MRP is the expected return starting from state s
v
(
s
)
=
E
[
G
t
∣
S
t
=
s
]
\operatorname{v}(s) = \mathbb{E}[G_t \mid S_t = s]
v(s)=E[Gt∣St=s]
Bellman Equation
记 S t + 1 S_{t+1} St+1 为 s ′ s' s′,根据期望的定义:
v
(
s
)
=
R
s
+
γ
∑
s
′
∈
S
P
s
s
′
v
(
s
′
)
(4)
\color{red}{\operatorname{v}(s) = \mathcal{R}_s + \gamma \sum_{s' \in S} \mathcal{P}_{ss'}\operatorname{v}(s')} \tag{4}
v(s)=Rs+γs′∈S∑Pss′v(s′)(4)
由 equation 1. 到式 equation 2. 并不是那么的直观,还需要进一步证明
E
[
G
t
+
1
∣
S
t
]
=
E
[
v
(
S
t
+
1
)
∣
S
t
]
=
E
[
E
[
G
t
+
1
∣
S
t
+
1
]
∣
S
t
]
\mathbb{E}[G_{t+1} \mid S_{t}] = \mathbb{E}\left[\operatorname{v}(S_{t+1}) \mid S_t \right] = \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right]
E[Gt+1∣St]=E[v(St+1)∣St]=E[E[Gt+1∣St+1]∣St]
先回顾一下全概率公式: E ( X ) = ∑ x x P ( X = x ∣ Y = y ) \mathbb{E}(X) = \sum_{x} x\operatorname{P}(X = x \mid Y = y) E(X)=∑xxP(X=x∣Y=y)
记 G t + 1 = g ′ , S t + 1 = s ′ , S t = s G_{t+1} = g' ,\ S_{t+1} = s' , \ S_t = s Gt+1=g′, St+1=s′, St=s
E
[
E
[
G
t
+
1
∣
S
t
+
1
]
∣
S
t
]
=
E
[
E
[
g
′
∣
S
′
]
∣
S
t
]
=
E
[
∑
g
′
g
′
p
(
g
′
∣
s
′
)
∣
s
]
=
∑
s
′
(
∑
g
′
g
′
p
(
g
′
∣
s
′
,
s
)
)
p
(
s
′
∣
s
)
=
∑
s
′
∑
g
′
g
′
p
(
g
′
,
s
′
,
s
)
p
(
s
′
,
s
)
p
(
s
′
,
s
)
p
(
s
)
=
∑
s
′
∑
g
′
g
′
p
(
g
′
,
s
′
∣
s
)
=
∑
g
′
g
′
p
(
g
′
∣
s
)
=
E
[
G
t
+
1
∣
s
t
]
\begin{aligned} \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right] &= \mathbb{E}\left[\mathbb{E}[g' \mid S'] \mid S_t \right] \\ &= \mathbb{E}\left[\sum_{g'}g'p(g' \mid s') \mid s \right] \\ &= \sum_{s'} \left(\sum_{g'}g'p(g' \mid s',s) \right)p(s' \mid s) \\ &= \sum_{s'} \sum_{g'} g' \frac{p(g',s',s)}{p(s',s)} \frac{p(s',s)}{p(s)} \\ &= \sum_{s'} \sum_{g'} g'p(g',s' \mid s) \\ &= \sum_{g'} g'p(g' \mid s) \\ & = \mathbb{E}[G_{t+1} \mid s_t] \end{aligned}
E[E[Gt+1∣St+1]∣St]=E[E[g′∣S′]∣St]=E⎣⎡g′∑g′p(g′∣s′)∣s⎦⎤=s′∑⎝⎛g′∑g′p(g′∣s′,s)⎠⎞p(s′∣s)=s′∑g′∑g′p(s′,s)p(g′,s′,s)p(s)p(s′,s)=s′∑g′∑g′p(g′,s′∣s)=g′∑g′p(g′∣s)=E[Gt+1∣st]
Equation 3. in Matrix form
根据矩阵形式,直接可求 value function 的解析解:
v
=
(
I
−
γ
(
P
)
)
−
1
R
\mathbf{v} = \left(I - \gamma \mathcal(P)\right)^{-1} \mathcal{R}
v=(I−γ(P))−1R
但是对于含n个状态的矩阵,计算复杂度为 O ( n 3 ) O(n^3) O(n3),因此解析解只适合小型的MRPs,对于大型的MRPs,采取迭代的方法:
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning
Markov Decision Processes
Definition
A MDP is a MRP with decisions. It is a environment in which all states are Markov
A Markov Decision Process is a tuple ⟨ S , A , P , R , γ ⟩ \lang \mathcal{S, A, P, R, \gamma} \rang ⟨S,A,P,R,γ⟩
- S \mathcal{S} S is a finite set of states
- A \mathcal{A} A is a finite set of actions
-
P
\mathcal{P}
P is a state transition probability matrix
P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] \mathcal{P}_{ss'}^{\color{red}a} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s ,{\color{red} A_t = a}\right] Pss′a=P[St+1=s′∣St=s,At=a] - R \mathcal{R} R is a reward function, $\mathcal{R}s{\color{red}a} = \mathbb{E}[R{t+1} \mid S_t =s ,{\color{red} A_t = a}] $
- γ \gamma γ is a discount factor, γ ∈ [ 0 , 1 ] \gamma \in [0, 1] γ∈[0,1]
Policy
A policy
π
\pi
π is a distribution over actions given states
π
(
a
∣
s
)
=
P
[
A
t
=
a
∣
s
t
=
s
]
\pi(a \mid s) = \mathbb{P}\left[A_t =a \mid s_t = s \right]
π(a∣s)=P[At=a∣st=s]
Given an MDP
M
=
⟨
S
,
A
,
P
,
R
,
γ
⟩
\mathcal{M} = \lang \mathcal{S,A,P,R,\gamma} \rang
M=⟨S,A,P,R,γ⟩ and a policy
π
\pi
π
-
The state sequence S 1 , S 2 , … S_1, S_2, \dots S1,S2,… is a Markkov process ⟨ S , P π ⟩ \lang \mathcal{S,P^\pi} \rang ⟨S,Pπ⟩
-
The state and reward sequence S 1 , R 2 , S 2 , … S_1, R_2, S_2, \dots S1,R2,S2,… is a Markkov reward process ⟨ S , P π , R π , γ ⟩ \lang \mathcal{S,P^\pi, R^\pi, \gamma} \rang ⟨S,Pπ,Rπ,γ⟩
-
where
P s , s ′ π = ∑ a ∈ A π ( a ∣ s ) P s s ′ a R s π = ∑ a ∈ A π ( a ∣ s ) R s a \mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \\ R_s^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_{s}^a Ps,s′π=a∈A∑π(a∣s)Pss′aRsπ=a∈A∑π(a∣s)Rsa
Value Function
- State-value funtion v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s) = \mathbb{E}\left[G_t \mid S_t = s \right] vπ(s)=E[Gt∣St=s]
- Action-value function q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}\left[G_t \mid S_t = s, A_t = a \right] qπ(s,a)=E[Gt∣St=s,At=a]
Bellman Expectation Equation
-
value function can be decomposed into immediate reward plus discounted value of successor state
v π ( s ) = E π [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] (5) {\color{blue} \operatorname{v}_{\color{red}\pi}(s) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] } \tag{5} vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s](5)q π ( s , a ) = E π [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] (6) {\color{blue} q_{\color{red}\pi}(s,a) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] } \tag{6} qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a](6)
equation 5. 和 equation 6. 表明了当前状态和未来状态之间的 value function 的关系 -
考虑 state-value function 和 action-value function 之间的关系
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
q
π
(
s
,
a
)
对所有用黑色实心圆代表的 action: a 求和
(7)
\operatorname{v}_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) q_\pi(s, a) \qquad \text{对所有用黑色实心圆代表的 action: a 求和 } \tag{7}
vπ(s)=a∈A∑π(a∣s)qπ(s,a)对所有用黑色实心圆代表的 action: a 求和 (7)
q
π
(
s
,
a
)
=
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
v
π
(
s
′
)
对所有用空心圆代表的 state: s 求和
(8)
q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \tag{8} \qquad \text{对所有用空心圆代表的 state: s 求和}
qπ(s,a)=Rsa+γs′∈S∑Pss′avπ(s′)对所有用空心圆代表的 state: s 求和(8)
- 把 equation 7. 和 equation 8. 互相代入就可以得出 equation 5. 和 equation 6. 的取去掉 E [ ] \mathbb{E}[\ ] E[ ] 的形式
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
(
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
v
π
(
s
′
)
)
(9)
\operatorname{v}_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \tag{9}
vπ(s)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′))(9)
q
π
(
s
,
a
)
=
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
∑
a
′
∈
A
π
(
a
′
∣
s
′
)
q
π
(
s
′
,
a
′
)
(10)
q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \tag{10}
qπ(s,a)=Rsa+γs′∈S∑Pss′aa′∈A∑π(a′∣s′)qπ(s′,a′)(10)
直接由 equation 5. 6. 分别推 equation 9. 10.
v
π
(
s
)
=
E
π
[
R
t
+
1
+
γ
v
π
(
S
t
+
1
)
∣
S
t
=
s
]
=
E
π
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
π
[
v
π
(
S
t
+
1
)
∣
S
t
=
s
]
=
∑
a
∈
A
π
(
a
∣
s
)
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
π
v
π
(
s
′
)
=
∑
a
∈
A
π
(
a
∣
s
)
R
s
a
+
γ
∑
s
′
∈
S
[
∑
a
∈
A
π
(
a
∣
s
)
P
s
s
′
a
]
v
π
(
s
′
)
=
∑
a
∈
A
π
(
a
∣
s
)
(
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
v
π
(
s
′
)
)
≜
Equation 9.
\begin{aligned} \operatorname{v}_{\color{red}\pi}(s) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[\operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^\pi \operatorname{v}_\pi(s') \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \left[ \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \right] \operatorname{v}_\pi(s') \\ & = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \triangleq \text{Equation 9.} \end{aligned}
vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s]=Eπ[Rt+1∣St=s]+γEπ[vπ(St+1)∣St=s]=a∈A∑π(a∣s)Rsa+γs′∈S∑Pss′πvπ(s′)=a∈A∑π(a∣s)Rsa+γs′∈S∑[a∈A∑π(a∣s)Pss′a]vπ(s′)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′))≜Equation 9.
q π ( s , a ) = E π [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] = E π [ R t + 1 ∣ S t = s , A t = a ] + γ E π [ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] = R s a + γ ∑ s ′ ∈ S P s s ′ a ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) ≜ Equation 10. \begin{aligned} q_{\color{red}\pi}(s,a) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \triangleq \text{Equation 10.} \end{aligned} qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a]=Eπ[Rt+1∣St=s,At=a]+γEπ[qπ(St+1,At+1)∣St=s,At=a]=Rsa+γs′∈S∑Pss′aa′∈A∑π(a′∣s′)qπ(s′,a′)≜Equation 10.
直接使用 Equation 9. 计算如下例子:
Optimal Value Function
记: v ∗ ( s ) = max π v π ( s ) \operatorname{v}_*(s) = \underset{\pi}{\operatorname{max}}\operatorname{v}_{\pi}(s) v∗(s)=πmaxvπ(s), q ∗ ( s , a ) = max π q π ( s , a ) q_*(s,a) = \underset{\pi}{\operatorname{max}}q_{\pi}(s,a) q∗(s,a)=πmaxqπ(s,a)
An MDP is “solved” when we know the optimal value function
如何比较 policies 的好坏(大小): π ≥ π ′ if v π ( s ) ≥ v π ′ ( s ) , ∀ s \pi \geq \pi' \operatorname{if} \operatorname{v}_\pi(s) \geq \operatorname{v}_{\pi'}(s), \quad \forall s π≥π′ifvπ(s)≥vπ′(s),∀s
Finding an optimal Policy
If we know
q
∗
(
s
,
a
)
q_*(s,a)
q∗(s,a), we immediately have the optimal policy
π
∗
(
a
∣
s
)
=
{
1
,
if
a
=
arg max
a
∈
A
q
∗
(
s
,
a
)
0
,
o
t
h
e
r
w
i
s
e
\pi_*(a \mid s) = \begin{cases} 1, \quad \text{if}\ a = \underset{a \in \mathcal{A}}{\operatorname{arg\,max}}\ q_*(s,a) \\ 0, \quad otherwise \end{cases}
π∗(a∣s)=⎩⎨⎧1,if a=a∈Aargmax q∗(s,a)0,otherwise
Bellman Optimaltiy Equation
v
∗
(
s
)
=
max
a
q
∗
(
s
,
a
)
(11)
\operatorname{v}_*(s) = \underset{a}{\operatorname{max}}q_{\color{red}*}(s,a) \tag{11}
v∗(s)=amaxq∗(s,a)(11)
q ∗ ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a v ∗ ( s ′ ) (12) q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{12} q∗(s,a)=Rsa+γs′∈S∑Pss′av∗(s′)(12)
Equation 11. 和 Equation 12. 互相代入
v
∗
(
s
)
=
max
a
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
v
∗
(
s
′
)
(11)
\operatorname{v}_*(s) = \underset{a}{\operatorname{max}}\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{11}
v∗(s)=amaxRsa+γs′∈S∑Pss′av∗(s′)(11)
q
∗
(
s
,
a
)
=
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
max
a
′
q
∗
(
s
′
,
a
′
)
(12)
q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \underset{a'}{\operatorname{max}}q_{\color{red}*}(s',a') \tag{12}
q∗(s,a)=Rsa+γs′∈S∑Pss′aa′maxq∗(s′,a′)(12)