Lect2_MDPs

最新推荐文章于 2024-08-29 23:58:21 发布

Ricky050

最新推荐文章于 2024-08-29 23:58:21 发布

阅读量110

点赞数

分类专栏： RL_by_DavidSilver_notes 文章标签：概率论机器学习强化学习

本文链接：https://blog.csdn.net/zzping01/article/details/120353909

版权

RL_by_DavidSilver_notes 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

Markov Decision Processes

Markov Decision Processes

MPDs formally describe an environment for RL

Markov Processes

Definition

无记忆的随机过程

i.e. a sequence of random states $S_1,S_2,\dots$ with the [Markov property](#Markov Property)

A Markov Process (or Markov Chain) is a tuple $\lang \mathcal{S, P} \rang$

$\mathcal{S}$ is a (finite) set of states
$\mathcal{P}$ is a state [transition probability matrix](#state transition matrix)
$\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right]$

Markov Property

“The future is independent of the past given the present”
$\mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right]$
The state is a sufficient statistic of the future

State Transition Matrix

state transition probability $\mathcal{P}_{source\_destination}$
$\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right]$
state transition matrix $\mathcal{P}$

defines transition probabilities from all state s to all successor state s’
$\mathcal{P} = \left[ \begin{matrix} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{matrix} \right]$
根据概率的性质，一定有：
$\sum_{j=1}^n \mathcal{P}_{ij} = 1 \qquad \forall i =1,\dots,n$
Example:

在这里插入图片描述

Markov Reward Process

Definition

a Markov chain with values

A Markov Reward Process is a tuple $\lang \mathcal{S,P,R,\gamma} \rang$

$\mathcal{S}$ is a finite set of states
$\mathcal{P}$ is a state transition probability matrix
$\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right]$
$\mathcal{R}$ is a reward function, $\mathcal{R}s = \mathbb{E}[R{t+1} \mid S_t =s] $
$\gamma$ is a discount factor, $\gamma \in [0, 1]$

注意这时的reward只和状态有关 $\mathcal{R}_s$ ，例如下图的 class 1 无论去 Facebook 还是 class 2 都是 $R = - 2$

在这里插入图片描述

Return

the retrun $G_t$ is the total discounted reward from time-step t
$G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infin \gamma^k R_{t+k+1}$

$\gamma \in [0,1]$ 代表了未来的奖励在现在这一时刻起的作用大小,更希望得到现有的奖励，未来的奖励就要把它打折扣
- $\gamma$ 代表“近视”，比较重视短期利益
- $\gamma$ 代表“远视”，对未来的奖励也一样看待

Why discount

有些马尔可夫过程是带环的，它并没有终结，我们想避免这个无穷的奖励。
我们并没有建立一个完美的模拟环境的模型，也就是说，我们对未来的评估不一定是准确的，我们不一定完全信任我们的模型，因为这种不确定性，所以我们对未来的预估增加一个折扣。我们想把这个不确定性表示出来，希望尽可能快地得到奖励，而不是在未来某一个点得到奖励。
如果这个奖励是有实际价值的，我们可能是更希望立刻就得到奖励，而不是后面再得到奖励（现在的钱比以后的钱更有价值）。
在人的行为里面来说的话，大家也是想得到即时奖励。
有些时候可以把这个系数设为 0：我们就只关注了它当前的奖励。我们也可以把它设为 1：对未来并没有折扣，未来获得的奖励跟当前获得的奖励是一样的。

Value Function

The state value function v(s) of an MRP is the expected return starting from state s
$\operatorname{v}(s) = \mathbb{E}[G_t \mid S_t = s]$
在这里插入图片描述

Bellman Equation

在这里插入图片描述

记 $S_{t+1}$ 为 $s^{'}$ ，根据期望的定义：

在这里插入图片描述

$\color{red}{\operatorname{v}(s) = \mathcal{R}_s + \gamma \sum_{s' \in S} \mathcal{P}_{ss'}\operatorname{v}(s')} \tag{4}$
由 equation 1. 到式 equation 2. 并不是那么的直观，还需要进一步证明 $\mathbb{E}[G_{t+1} \mid S_{t}] = \mathbb{E}\left[\operatorname{v}(S_{t+1}) \mid S_t \right] = \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right]$

先回顾一下全概率公式： $\mathbb{E}(X) = \sum_{x} x\operatorname{P}(X = x \mid Y = y)$

记 $G_{t+1} = g' ,\ S_{t+1} = s' , \ S_t = s$

$\begin{aligned} \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right] &= \mathbb{E}\left[\mathbb{E}[g' \mid S'] \mid S_t \right] \\ &= \mathbb{E}\left[\sum_{g'}g'p(g' \mid s') \mid s \right] \\ &= \sum_{s'} \left(\sum_{g'}g'p(g' \mid s',s) \right)p(s' \mid s) \\ &= \sum_{s'} \sum_{g'} g' \frac{p(g',s',s)}{p(s',s)} \frac{p(s',s)}{p(s)} \\ &= \sum_{s'} \sum_{g'} g'p(g',s' \mid s) \\ &= \sum_{g'} g'p(g' \mid s) \\ & = \mathbb{E}[G_{t+1} \mid s_t] \end{aligned}$
Equation 3. in Matrix form
在这里插入图片描述
根据矩阵形式，直接可求 value function 的解析解： $\mathbf{v} = \left(I - \gamma \mathcal(P)\right)^{-1} \mathcal{R}$

但是对于含n个状态的矩阵，计算复杂度为 $O(n^3)$ ，因此解析解只适合小型的MRPs，对于大型的MRPs，采取迭代的方法：

Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

Markov Decision Processes

Definition

A MDP is a MRP with decisions. It is a environment in which all states are Markov

A Markov Decision Process is a tuple $\lang \mathcal{S, A, P, R, \gamma} \rang$

$\mathcal{S}$ is a finite set of states
$\mathcal{A}$ is a finite set of actions
$\mathcal{P}$ is a state transition probability matrix
$\mathcal{P}_{ss'}^{\color{red}a} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s ,{\color{red} A_t = a}\right]$
$\mathcal{R}$ is a reward function, $\mathcal{R}s^{\color{red}a} = \mathbb{E}[R{t+1} \mid S_t =s ,{\color{red} A_t = a}] $
$\gamma$ is a discount factor, $\gamma \in [0, 1]$

Policy

A policy $\pi$ is a distribution over actions given states
$\pi(a \mid s) = \mathbb{P}\left[A_t =a \mid s_t = s \right]$
Given an MDP $\mathcal{M} = \lang \mathcal{S,A,P,R,\gamma} \rang$ and a policy $\pi$

The state sequence $S_1, S_2, \dots$ is a Markkov process $\lang \mathcal{S,P^\pi} \rang$
The state and reward sequence $S_1, R_2, S_2, \dots$ is a Markkov reward process $\lang \mathcal{S,P^\pi, R^\pi, \gamma} \rang$
where
$\mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \\ R_s^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_{s}^a$

Value Function

State-value funtion $v_\pi(s) = \mathbb{E}\left[G_t \mid S_t = s \right]$
Action-value function $q_\pi(s,a) = \mathbb{E}\left[G_t \mid S_t = s, A_t = a \right]$

Bellman Expectation Equation

value function can be decomposed into immediate reward plus discounted value of successor state
${\color{blue} \operatorname{v}_{\color{red}\pi}(s) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] } \tag{5}$

${\color{blue} q_{\color{red}\pi}(s,a) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] } \tag{6}$
equation 5. 和 equation 6. 表明了当前状态和未来状态之间的 value function 的关系
考虑 state-value function 和 action-value function 之间的关系

在这里插入图片描述

$\operatorname{v}_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) q_\pi(s, a) \qquad \text{对所有用黑色实心圆代表的 action: a 求和 } \tag{7}$
$q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \tag{8} \qquad \text{对所有用空心圆代表的 state: s 求和}$

把 equation 7. 和 equation 8. 互相代入就可以得出 equation 5. 和 equation 6. 的取去掉 $\mathbb{E}[\ ]$ 的形式

在这里插入图片描述

$\operatorname{v}_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \tag{9}$
在这里插入图片描述
$q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \tag{10}$
直接由 equation 5. 6. 分别推 equation 9. 10.
$\begin{aligned} \operatorname{v}_{\color{red}\pi}(s) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[\operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^\pi \operatorname{v}_\pi(s') \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \left[ \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \right] \operatorname{v}_\pi(s') \\ & = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \triangleq \text{Equation 9.} \end{aligned}$

$\begin{aligned} q_{\color{red}\pi}(s,a) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \triangleq \text{Equation 10.} \end{aligned}$

直接使用 Equation 9. 计算如下例子：

在这里插入图片描述

Optimal Value Function

记： $\operatorname{v}_*(s) = \underset{\pi}{\operatorname{max}}\operatorname{v}_{\pi}(s)$ , $q_*(s,a) = \underset{\pi}{\operatorname{max}}q_{\pi}(s,a)$

An MDP is “solved” when we know the optimal value function

如何比较 policies 的好坏（大小）： $\pi \geq \pi' \operatorname{if} \operatorname{v}_\pi(s) \geq \operatorname{v}_{\pi'}(s), \quad \forall s$

Finding an optimal Policy

If we know $q_*(s,a)$ , we immediately have the optimal policy
$\pi_*(a \mid s) = \begin{cases} 1, \quad \text{if}\ a = \underset{a \in \mathcal{A}}{\operatorname{arg\,max}}\ q_*(s,a) \\ 0, \quad otherwise \end{cases}$

Bellman Optimaltiy Equation

在这里插入图片描述
$\operatorname{v}_*(s) = \underset{a}{\operatorname{max}}q_{\color{red}*}(s,a) \tag{11}$

$q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{12}$

Equation 11. 和 Equation 12. 互相代入
$\operatorname{v}_*(s) = \underset{a}{\operatorname{max}}\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{11}$
$q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \underset{a'}{\operatorname{max}}q_{\color{red}*}(s',a') \tag{12}$

Ricky050

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lect2_MDPs

文章目录Markov Decision ProcessesMarkov ProcessesDefinitionMarkov PropertyState Transition MatrixMarkov Reward ProcessDefinitionReturnWhy discountValue FunctionBellman EquationMarkov Decision ProcessesDefinitionPolicyValue FunctionBellman Expectation EquationO
复制链接

扫一扫

专栏目录