RL(2)：马尔科夫决策过程

最新推荐文章于 2024-06-21 17:51:09 发布

学长很忙

最新推荐文章于 2024-06-21 17:51:09 发布

阅读量221

点赞数

分类专栏： # DL

本文链接：https://blog.csdn.net/qq_41984831/article/details/108432646

版权

DL 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

在这里插入图片描述

把扫地机器人简化成以下条件：
状态序列：
${0,1,2,3,4,5\}$

行为序列：
${-1,+1\}$

转移函数：
$\bar f(0,\pm 1)=0 , \bar f(1,+ 1)=2, \bar f(1,- 1)=0, \bar f(2,+ 1)=3, \bar f(2,- 1)=1$
$\bar f(3,+1)=4, \bar f(3,- 1)=2, \bar f(4,+ 1)=5, \bar f(4,- 1)=4, \bar f(5,\pm 1)=5$

奖励函数：
$\rho(0,\pm1,0)=0, \rho(1,+1,2)=0, \rho(1,-1,0)=1, \rho(2,+1,3)=0, \rho(2,-1,1)=0$
$\rho(3,+1,4)=0, \rho(3,-1,2)=0, \rho(4,+1,5)=5, \rho(4,-1,3)=0, \rho(5,\pm1,5)=0$

1. 定义

定义1.1：

Argument of the maximum——假设给定函数 $f (x)$ 有最大 $M$ ，则 $f (x)$ 达到 $M$ 的 $x$ 值集合表示为
$argmax _{x} f(x)$

定义1.2：

一个随机变量的期望值——假设随机变量 $X$ 取 $x_1$ 的概率为 $p_1$ , $x_2$ 的概率为 $p_2$ ，……，则 $X$ 的期望为
$\mathbb E[X]=x_1p_1+x_2p_2+...+x_kp_k$
$\mathbb E$ 被称为exception operator，并具有以下性质：
$\mathbb E[X+Y]=\mathbb E[X]+\mathbb E[Y]\\ \mathbb E[X+c]=\mathbb E[X]+c\\ \mathbb E[cX]=c\mathbb E[X]$

定义1.3：

$L^\infty$ -norm:向量 $\textbf x=[x_1,x_2,...,x_n]^T$ 的 $L^\infty$ -norm，用 $||\textbf x||_\infty$ 表示，是 $\textbf x$ 中最大的元素。
$||\textbf x||_\infty \overset{\Delta}{=} \max_i|x_i|$

2. 马尔科夫决策的要素

2.1 状态，动作，转换和奖励

状态：

$S=\{s_1,s_2,...,s_{|S|}\}$

动作：

$A(s)=\{a_1,a_2,...,a_|A|\}$

转换函数：

确定的转换函数：
$\bar f : S×A\rightarrow S \\ \bar f(s,a)=s'$
不确定的转换函数：
$\bar f: S×A×S\rightarrow [0,1] \\ f(s,a,s')=\mathbb P(S_{t+1}=s' | S_t=s,A_t=a)=p^a_{ss'}$

奖励函数：

确定的奖励函数：
$R_t=\bar\rho(S_{t-1},A_{t-1},S_t|S_{t-1}=s,A_{t-1}=a,S_t=s')=\bar\rho(s,a,s')=r \\ \bar\rho:S×A×S\rightarrow R$

状态决定的奖励函数： $\rho(r|s)$
$\rho:R×S\rightarrow [0,1]$
$r(s)=\mathbb E[R_t|S_{t-1}=s]=r_1\rho(r_1|s)+...+r_m\rho(r_m|s)=\sum_rr\rho(r|s)$

状态和动作决定的奖励函数： $\rho(r|s,a)$
$\rho:R×S×A\rightarrow [0,1]$
$\begin{aligned} r(s,a)&=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a]\\ &=r_1\rho(r_1|s,a)+...+r_m\rho(r_m|s,a)\\ &=\sum_rr\rho(r|s,a) \end{aligned}$

2.2 四参数的 $p$ 函数

$\rho(r|s')\\ p:S×R×S×A\rightarrow [0,1]$

利用上式可以推导：

概率转换函数

$f(s,a,s')=\mathbb P{S_t=s'|S_{t-1}=s,A_{t-1}=a}=\sum_{r\in R}p(s',r|s,a)$

状态决定的奖励函数

$\rho(r|s')=f(s,a,s')·\rho(r|s')=\sum_{r\in R}p(s',r|s,a)·\rho(r|s')$
$\rho(r|s')=\frac{p(s',r|s,a)}{\sum_{r\in R}p(s',r|s,a)}$

状态和动作决定的奖励函数

$\rho(r|s,a)=\sum_{s'\in S}p(s',r|s,a)$

基于状态的期望奖励

$r(s)=\mathbb E[R_t|S_t=s]=\sum_rr\rho(r|s)$

基于状态和动作的期望奖励

$r(s,a)=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a]=\sum_{r\in R}\sum_{s'\in S}rp(s',r|s,a)$

基于状态-动作-下一状态的期望奖励

$\begin{aligned} r(s,a,s')&=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a,S_t=s']\\ &=\sum_rr\rho(r|s')\\ &=\sum_rr\frac{p(s',r|s,a)}{\sum_{r\in R}p(s',r|s,a)}\\ \end{aligned}$

2.3 Return

奖励的总和：
$G_t\overset{\Delta}{=}R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\sum_{k=0}^\infty\gamma^k R_{t+k+1}$

2.4 策略

$\pi:S×A\rightarrow[0,1]\\ \pi(a|s)=\mathbb P\{A_t=a|S_t=s\}$

3. 价值函数和贝尔曼方程

状态价值函数：

$\begin{aligned} v_\pi&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t=s]\\ &=\mathbb E_\pi [\sum_{k=0}^\infty\gamma^kR_{t+k+1}|S_t=s] ,对于所有的s\in S\\ \end{aligned}$

动作价值函数：

$\begin{aligned} q_\pi(s,a)&=\mathbb E_\pi[G_t|S_t=s,A_t=a]\\ &=\mathbb E_\pi [\sum_{k=0}^\infty\gamma^kR_{t+k+1}|S_t=s,A_t=a] \\ \end{aligned}$

贝尔曼方程：

$\begin{aligned} v_\pi&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t=s]\\ &=\mathbb E_\pi [R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...|S_t=s] \\ &=\mathbb E_\pi [R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[(R_{t+2}+\gamma R_{t+3}+...)|S_t=s]\\ &=\mathbb E_\pi[R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[G_{t+1}|S_t=s] \end{aligned}$
$\mathbb E_\pi[R_{t+1}|S_t=s]=\sum_a\pi(a|s)\sum_{s'}\sum_rr·p(s',r|s,a)$
$\begin{aligned} \mathbb E_\pi[G_{t+1}|S_t=s]&=\sum_a\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)\mathbb E_\pi [G_{t+1}|S_{t+1}=s']\\ &=\sum_{a)}\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)v_\pi(s')\\ \end{aligned}$

$\begin{aligned} v_\pi(s) &=\mathbb E_\pi[R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[G_{t+1}|S_t=s]\\ &=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]\\ \end{aligned}$
$\begin{aligned} q_\pi(s,a)&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t,A_t=a] \\ &=\mathbb E_\pi[R_{t+1}|S_t=s,A_t=a]+\gamma\mathbb E_\pi[G_{t+1}|S_t=s,A_t=a] \\ \mathbb E_\pi [R_{t+1}|S_t=s,A_t=a]&=\sum_{s'}\sum_rr·p(s',r|s,a) \\ \mathbb E_\pi[G_{t+1}|S_t=s,A_t=a]&=\sum_{s'}\sum_rp(s',r|s,a)\pi(a'|s')\mathbb E_\pi[G_{t+1}|S_{t+1}=s',A_{t+1}=a']\\ &=\sum_{s'}\sum_rp(s',r|s,a)\pi(a'|s')q_\pi(s',a')\\ \end{aligned}$
$q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma \pi(a'|s')q_\pi(s',a')]$
$q_\pi$ 贝尔曼方程：
$\begin{aligned} q_\pi(s,a)&\overset{}{=}\mathbb E_\pi[R_{t+1}|S_t=s,A_t=a]+\gamma\mathbb E_\pi[G_{t+1}|S_t=s,A_t=a] \\ &=\sum_{s',r}p(s',r|s,a)[r+\gamma q_\pi(s',a')]\\ \end{aligned}$

$v_\pi(s)$ 和 $q_\pi(s,a)$ 的关系：
$v_\pi(s)=\sum_a\pi(a|s)q_\pi(s,a)$

如果策略 $\pi$ 是确定的，即 $\pi(a|s)=1$
$v_\pi(s)=q_\pi(s,\pi(s))$
可以得出
$q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]$

4. 最优策略和最优值函数

最优状态函数和最优动作函数：

$v_*(s)\overset{\Delta}{=}\underset{\pi}{\rm{max}}v_\pi(s) \\ q_*(s,a)\overset{\Delta}{=}\underset{\pi}{\rm{max}}q_\pi(s,a)$

贝尔曼最优方程:

$\begin{aligned} v_*(s)&=\underset{a\in A(s)}{\rm{max}}q_*(s,a)\\ &=\underset{a}{\rm{max}}\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ q_*(s,a)&=\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ &=\sum_{s',r}p(s',r|s,a)[r+\gamma \underset{a'}{\rm max}q_*(s',a')]\\ \end{aligned}$