Lect2_MDPs

Markov Decision Processes

MPDs formally describe an environment for RL

Markov Processes

Definition

无记忆的随机过程

i.e. a sequence of random states S 1 , S 2 , … S_1,S_2,\dots S1,S2,​ with the [Markov property](#Markov Property)

A Markov Process (or Markov Chain) is a tuple ⟨ S , P ⟩ \lang \mathcal{S, P} \rang S,P​​​

  • S \mathcal{S} S​ is a (finite) set of states
  • P \mathcal{P} P is a state [transition probability matrix](#state transition matrix)
    P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right] Pss=P[St+1=sSt=s]

Markov Property

“The future is independent of the past given the present”
P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , … , S t ] \mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right] P[St+1St]=P[St+1S1,,St]
The state is a sufficient statistic of the future

State Transition Matrix

state transition probability P s o u r c e _ d e s t i n a t i o n \mathcal{P}_{source\_destination} Psource_destination
P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right] Pss=P[St+1=sSt=s]
state transition matrix P \mathcal{P} P​​

defines transition probabilities from all state s to all successor state s’
P = [ P 11 ⋯ P 1 n ⋮ P n 1 ⋯ P n n ] \mathcal{P} = \left[ \begin{matrix} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{matrix} \right] P=P11Pn1P1nPnn
根据概率的性质,一定有:
∑ j = 1 n P i j = 1 ∀ i = 1 , … , n \sum_{j=1}^n \mathcal{P}_{ij} = 1 \qquad \forall i =1,\dots,n j=1nPij=1i=1,,n
Example:

在这里插入图片描述

Markov Reward Process

Definition

a Markov chain with values

A Markov Reward Process is a tuple ⟨ S , P , R , γ ⟩ \lang \mathcal{S,P,R,\gamma} \rang S,P,R,γ

  • S \mathcal{S} S is a finite set of states
  • P \mathcal{P} P​ is a state transition probability matrix
    P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right] Pss=P[St+1=sSt=s]
  • R \mathcal{R} R​ is a reward function, $\mathcal{R}s = \mathbb{E}[R{t+1} \mid S_t =s] $​
  • γ \gamma γ is a discount factor, γ ∈ [ 0 , 1 ] \gamma \in [0, 1] γ[0,1]

注意这时的reward只和状态有关 R s \mathcal{R}_s Rs​​​,例如下图的 class 1 无论去 Facebook 还是 class 2 都是 R = − 2 R = -2 R=2

在这里插入图片描述

Return

the retrun G t G_t Gt is the total discounted reward from time-step t
G t = R t + 1 + γ R t + 2 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infin \gamma^k R_{t+k+1} Gt=Rt+1+γRt+2+=k=0γkRt+k+1

  • γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ[0,1]​ 代表了未来的奖励在现在这一时刻起的作用大小,更希望得到现有的奖励,未来的奖励就要把它打折扣
    • γ \gamma γ 代表“近视”,比较重视短期利益
    • γ \gamma γ 代表“远视”,对未来的奖励也一样看待
Why discount
  • 有些马尔可夫过程是带环的,它并没有终结,我们想避免这个无穷的奖励。
  • 我们并没有建立一个完美的模拟环境的模型,也就是说,我们对未来的评估不一定是准确的,我们不一定完全信任我们的模型,因为这种不确定性,所以我们对未来的预估增加一个折扣。我们想把这个不确定性表示出来,希望尽可能快地得到奖励,而不是在未来某一个点得到奖励。
  • 如果这个奖励是有实际价值的,我们可能是更希望立刻就得到奖励,而不是后面再得到奖励(现在的钱比以后的钱更有价值)。
  • 在人的行为里面来说的话,大家也是想得到即时奖励。
  • 有些时候可以把这个系数设为 0:我们就只关注了它当前的奖励。我们也可以把它设为 1:对未来并没有折扣,未来获得的奖励跟当前获得的奖励是一样的。

Value Function

The state value function v(s) of an MRP is the expected return starting from state s
v ⁡ ( s ) = E [ G t ∣ S t = s ] \operatorname{v}(s) = \mathbb{E}[G_t \mid S_t = s] v(s)=E[GtSt=s]
在这里插入图片描述

Bellman Equation

在这里插入图片描述

S t + 1 S_{t+1} St+1​ 为 s ′ s' s​​,根据期望的定义:

在这里插入图片描述

v ⁡ ( s ) = R s + γ ∑ s ′ ∈ S P s s ′ v ⁡ ( s ′ ) (4) \color{red}{\operatorname{v}(s) = \mathcal{R}_s + \gamma \sum_{s' \in S} \mathcal{P}_{ss'}\operatorname{v}(s')} \tag{4} v(s)=Rs+γsSPssv(s)(4)
由 equation 1. 到式 equation 2. 并不是那么的直观,还需要进一步证明 E [ G t + 1 ∣ S t ] = E [ v ⁡ ( S t + 1 ) ∣ S t ] = E [ E [ G t + 1 ∣ S t + 1 ] ∣ S t ] \mathbb{E}[G_{t+1} \mid S_{t}] = \mathbb{E}\left[\operatorname{v}(S_{t+1}) \mid S_t \right] = \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right] E[Gt+1St]=E[v(St+1)St]=E[E[Gt+1St+1]St]

先回顾一下全概率公式: E ( X ) = ∑ x x P ⁡ ( X = x ∣ Y = y ) \mathbb{E}(X) = \sum_{x} x\operatorname{P}(X = x \mid Y = y) E(X)=xxP(X=xY=y)​​​​

G t + 1 = g ′ ,   S t + 1 = s ′ ,   S t = s G_{t+1} = g' ,\ S_{t+1} = s' , \ S_t = s Gt+1=g, St+1=s, St=s

E [ E [ G t + 1 ∣ S t + 1 ] ∣ S t ] = E [ E [ g ′ ∣ S ′ ] ∣ S t ] = E [ ∑ g ′ g ′ p ( g ′ ∣ s ′ ) ∣ s ] = ∑ s ′ ( ∑ g ′ g ′ p ( g ′ ∣ s ′ , s ) ) p ( s ′ ∣ s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ , s ′ , s ) p ( s ′ , s ) p ( s ′ , s ) p ( s ) = ∑ s ′ ∑ g ′ g ′ p ( g ′ , s ′ ∣ s ) = ∑ g ′ g ′ p ( g ′ ∣ s ) = E [ G t + 1 ∣ s t ] \begin{aligned} \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right] &= \mathbb{E}\left[\mathbb{E}[g' \mid S'] \mid S_t \right] \\ &= \mathbb{E}\left[\sum_{g'}g'p(g' \mid s') \mid s \right] \\ &= \sum_{s'} \left(\sum_{g'}g'p(g' \mid s',s) \right)p(s' \mid s) \\ &= \sum_{s'} \sum_{g'} g' \frac{p(g',s',s)}{p(s',s)} \frac{p(s',s)}{p(s)} \\ &= \sum_{s'} \sum_{g'} g'p(g',s' \mid s) \\ &= \sum_{g'} g'p(g' \mid s) \\ & = \mathbb{E}[G_{t+1} \mid s_t] \end{aligned} E[E[Gt+1St+1]St]=E[E[gS]St]=Eggp(gs)s=sggp(gs,s)p(ss)=sggp(s,s)p(g,s,s)p(s)p(s,s)=sggp(g,ss)=ggp(gs)=E[Gt+1st]
Equation 3. in Matrix form
在这里插入图片描述
根据矩阵形式,直接可求 value function 的解析解: v = ( I − γ ( P ) ) − 1 R \mathbf{v} = \left(I - \gamma \mathcal(P)\right)^{-1} \mathcal{R} v=(Iγ(P))1R

但是对于含n个状态的矩阵,计算复杂度为 O ( n 3 ) O(n^3) O(n3),因此解析解只适合小型的MRPs,对于大型的MRPs,采取迭代的方法:

  • Dynamic programming
  • Monte-Carlo evaluation
  • Temporal-Difference learning

Markov Decision Processes

Definition

A MDP is a MRP with decisions. It is a environment in which all states are Markov

A Markov Decision Process is a tuple ⟨ S , A , P , R , γ ⟩ \lang \mathcal{S, A, P, R, \gamma} \rang S,A,P,R,γ

  • S \mathcal{S} S​ is a finite set of states
  • A \mathcal{A} A is a finite set of actions
  • P \mathcal{P} P​​ is a state transition probability matrix
    P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] \mathcal{P}_{ss'}^{\color{red}a} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s ,{\color{red} A_t = a}\right] Pssa=P[St+1=sSt=s,At=a]​​
  • R \mathcal{R} R is a reward function, $\mathcal{R}s{\color{red}a} = \mathbb{E}[R{t+1} \mid S_t =s ,{\color{red} A_t = a}] $
  • γ \gamma γ is a discount factor, γ ∈ [ 0 , 1 ] \gamma \in [0, 1] γ[0,1]

Policy

A policy π \pi π is a distribution over actions given states
π ( a ∣ s ) = P [ A t = a ∣ s t = s ] \pi(a \mid s) = \mathbb{P}\left[A_t =a \mid s_t = s \right] π(as)=P[At=ast=s]
Given an MDP M = ⟨ S , A , P , R , γ ⟩ \mathcal{M} = \lang \mathcal{S,A,P,R,\gamma} \rang M=S,A,P,R,γ and a policy π \pi π

  • The state sequence S 1 , S 2 , … S_1, S_2, \dots S1,S2, is a Markkov process ⟨ S , P π ⟩ \lang \mathcal{S,P^\pi} \rang S,Pπ

  • The state and reward sequence S 1 , R 2 , S 2 , … S_1, R_2, S_2, \dots S1,R2,S2, is a Markkov reward process ⟨ S , P π , R π , γ ⟩ \lang \mathcal{S,P^\pi, R^\pi, \gamma} \rang S,Pπ,Rπ,γ​​

  • where
    P s , s ′ π = ∑ a ∈ A π ( a ∣ s ) P s s ′ a R s π = ∑ a ∈ A π ( a ∣ s ) R s a \mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \\ R_s^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_{s}^a Ps,sπ=aAπ(as)PssaRsπ=aAπ(as)Rsa

Value Function

  • State-value funtion v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s) = \mathbb{E}\left[G_t \mid S_t = s \right] vπ(s)=E[GtSt=s]​​
  • Action-value function q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}\left[G_t \mid S_t = s, A_t = a \right] qπ(s,a)=E[GtSt=s,At=a]

Bellman Expectation Equation

  1. value function can be decomposed into immediate reward plus discounted value of successor state
    v ⁡ π ( s ) = E π [ R t + 1 + γ v ⁡ π ( S t + 1 ) ∣ S t = s ] (5) {\color{blue} \operatorname{v}_{\color{red}\pi}(s) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] } \tag{5} vπ(s)=Eπ[Rt+1+γvπ(St+1)St=s](5)

    q π ( s , a ) = E π [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] (6) {\color{blue} q_{\color{red}\pi}(s,a) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] } \tag{6} qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)St=s,At=a](6)
    equation 5. 和 equation 6. 表明了当前状态和未来状态之间的 value function 的关系

  2. 考虑 state-value function 和 action-value function 之间的关系

在这里插入图片描述

v ⁡ π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) 对所有用黑色实心圆代表的 action: a 求和  (7) \operatorname{v}_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) q_\pi(s, a) \qquad \text{对所有用黑色实心圆代表的 action: a 求和 } \tag{7} vπ(s)=aAπ(as)qπ(s,a)对所有用黑色实心圆代表的 action: a 求和 (7)
q π ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ π ( s ′ ) 对所有用空心圆代表的 state: s 求和 (8) q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \tag{8} \qquad \text{对所有用空心圆代表的 state: s 求和} qπ(s,a)=Rsa+γsSPssavπ(s)对所有用空心圆代表的 state: s 求和(8)

  1. 把 equation 7. 和 equation 8. 互相代入就可以得出 equation 5. 和 equation 6. 的取去掉 E [   ] \mathbb{E}[\ ] E[ ]​ 的形式

在这里插入图片描述

v ⁡ π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ π ( s ′ ) ) (9) \operatorname{v}_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \tag{9} vπ(s)=aAπ(as)(Rsa+γsSPssavπ(s))(9)
在这里插入图片描述
q π ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) (10) q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \tag{10} qπ(s,a)=Rsa+γsSPssaaAπ(as)qπ(s,a)(10)
直接由 equation 5. 6. 分别推 equation 9. 10.
v ⁡ π ( s ) = E π [ R t + 1 + γ v ⁡ π ( S t + 1 ) ∣ S t = s ] = E π [ R t + 1 ∣ S t = s ] + γ E π [ v ⁡ π ( S t + 1 ) ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) R s a + γ ∑ s ′ ∈ S P s s ′ π v ⁡ π ( s ′ ) = ∑ a ∈ A π ( a ∣ s ) R s a + γ ∑ s ′ ∈ S [ ∑ a ∈ A π ( a ∣ s ) P s s ′ a ] v ⁡ π ( s ′ ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ π ( s ′ ) ) ≜ Equation 9. \begin{aligned} \operatorname{v}_{\color{red}\pi}(s) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[\operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^\pi \operatorname{v}_\pi(s') \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \left[ \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \right] \operatorname{v}_\pi(s') \\ & = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \triangleq \text{Equation 9.} \end{aligned} vπ(s)=Eπ[Rt+1+γvπ(St+1)St=s]=Eπ[Rt+1St=s]+γEπ[vπ(St+1)St=s]=aAπ(as)Rsa+γsSPssπvπ(s)=aAπ(as)Rsa+γsS[aAπ(as)Pssa]vπ(s)=aAπ(as)(Rsa+γsSPssavπ(s))Equation 9.

q π ( s , a ) = E π [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] = E π [ R t + 1 ∣ S t = s , A t = a ] + γ E π [ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] = R s a + γ ∑ s ′ ∈ S P s s ′ a ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) ≜ Equation 10. \begin{aligned} q_{\color{red}\pi}(s,a) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \triangleq \text{Equation 10.} \end{aligned} qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)St=s,At=a]=Eπ[Rt+1St=s,At=a]+γEπ[qπ(St+1,At+1)St=s,At=a]=Rsa+γsSPssaaAπ(as)qπ(s,a)Equation 10.

直接使用 Equation 9. 计算如下例子:

在这里插入图片描述

Optimal Value Function

记: v ⁡ ∗ ( s ) = max ⁡ π v ⁡ π ( s ) \operatorname{v}_*(s) = \underset{\pi}{\operatorname{max}}\operatorname{v}_{\pi}(s) v(s)=πmaxvπ(s), q ∗ ( s , a ) = max ⁡ π q π ( s , a ) q_*(s,a) = \underset{\pi}{\operatorname{max}}q_{\pi}(s,a) q(s,a)=πmaxqπ(s,a)

An MDP is “solved” when we know the optimal value function

如何比较 policies 的好坏(大小): π ≥ π ′ if ⁡ v ⁡ π ( s ) ≥ v ⁡ π ′ ( s ) , ∀ s \pi \geq \pi' \operatorname{if} \operatorname{v}_\pi(s) \geq \operatorname{v}_{\pi'}(s), \quad \forall s ππifvπ(s)vπ(s),s

Finding an optimal Policy

If we know q ∗ ( s , a ) q_*(s,a) q(s,a), we immediately have the optimal policy
π ∗ ( a ∣ s ) = { 1 , if  a = arg max ⁡ a ∈ A   q ∗ ( s , a ) 0 , o t h e r w i s e \pi_*(a \mid s) = \begin{cases} 1, \quad \text{if}\ a = \underset{a \in \mathcal{A}}{\operatorname{arg\,max}}\ q_*(s,a) \\ 0, \quad otherwise \end{cases} π(as)=1,if a=aAargmax q(s,a)0,otherwise

Bellman Optimaltiy Equation

在这里插入图片描述
v ⁡ ∗ ( s ) = max ⁡ a q ∗ ( s , a ) (11) \operatorname{v}_*(s) = \underset{a}{\operatorname{max}}q_{\color{red}*}(s,a) \tag{11} v(s)=amaxq(s,a)(11)

q ∗ ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ ∗ ( s ′ ) (12) q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{12} q(s,a)=Rsa+γsSPssav(s)(12)

Equation 11. 和 Equation 12. 互相代入
v ⁡ ∗ ( s ) = max ⁡ a R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ ∗ ( s ′ ) (11) \operatorname{v}_*(s) = \underset{a}{\operatorname{max}}\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{11} v(s)=amaxRsa+γsSPssav(s)(11)
q ∗ ( s , a ) = R s a + γ ∑ s ′ ∈ S P s s ′ a max ⁡ a ′ q ∗ ( s ′ , a ′ ) (12) q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \underset{a'}{\operatorname{max}}q_{\color{red}*}(s',a') \tag{12} q(s,a)=Rsa+γsSPssaamaxq(s,a)(12)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值