【强化学习】马尔科夫决策过程

最新推荐文章于 2022-11-14 14:54:23 发布

布纸所云

最新推荐文章于 2022-11-14 14:54:23 发布

阅读量293

点赞数

分类专栏：强化学习文章标签：强化学习

本文链接：https://blog.csdn.net/XindiOntheWay/article/details/108571887

版权

强化学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Markov Decision Process

周博磊课件: https://github.com/zhoubolei/introRL

2 概述

Markov Descision Processes(MDP) 是对强化学习环境(environment) 的一种正式描述：

该环境是完全可观测的 (fully observable)
当前的状态可以完全描述这个过程
几乎所有的RL问题都可以用MDP描述

3 Markov Property

马尔科夫性：下一个状态仅与当前状态有关

在这里插入图片描述

4 MP, MRP和MDP

4.1 Markov Process $(M P)$

在这里插入图片描述

4.2 Markov Reward Process $(M R P)$

在这里插入图片描述

Return $G_t$

$G_t$ : 从时刻 $t$ 开始总计的 $d i s c o u n t e d$ $r e w a r d$
$G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$
$\gamma$ 表明我们有多喜欢即时的 $r e w a r d$

$\gamma$ 趋近于0，表明喜欢近期的 $r e w a r d$
$\gamma$ 趋近于1，表明喜欢远期的 $r e w a r d$

Value Function $v (s)$

$v (s)$ 是状态价值函数 (state value function)，对于一个 $M R P$ ， $v (s)$ 描述了从状态 $s$ 开始的期望 $R e t u r n$ ：
$v(s) = E[G_t|s_t=s]$

MRPs 的 Bellman Equation

$v (s)$ 可以分解为两个部分：

即时的 Reward
下一步状态的 discounted value: $\gamma v(s_{t+1})$

因此，可对 $v (s)$ 做如下拆解：
$\begin{aligned} v(s) &= E[R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \cdots | s_t=s] \\ & = E[R_{t+1} + \gamma (R_{t+2 }+ \gamma R_{t+3}+\cdots) | s_t=s] \\ &=E[R_{t+1}+\gamma G_{t+1} | s_t=s] \\ & =E[R_{t+1}+\gamma v(s_{t+1})|s_t=s] \end{aligned}$
在这里插入图片描述

4.3 Markov Decision Process(MDP)

在这里插入图片描述

Policies

Policy是状态 $s$ 到行动 $a$ 的映射。
在这里插入图片描述

在这里插入图片描述

State-Value Function

MDP的state-value function $v_\pi (s)$ : 依据policy $\pi$ 决策，从状态 $s$ 开始的期望 Return
$v_\pi(s)=E_\pi[G_t|S_t=s]$

action-value function

MDP 的 action-value function $q_{\pi}(s,a)$ ：从状态 $s$ 开始采取行动 $a$ ，并且依据 Policy $\pi$ 获得的期望 Return：
$q_{\pi}(s,a) = E_{\pi}[G_t|S_t=s,A_t=a]$

MDPs 的 Bellman Equation

$v_\pi(s)=E_\pi[R_{t+1}+\gamma v_\pi({s_{t+1}})|S_t=s]$
$q_{\pi}(s,a) = E_\pi[R_{t+1}+\gamma q_\pi({S_{t+1}},A_{t+1})|S_t=s,A_t=a]$
在这里插入图片描述

布纸所云

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【强化学习】马尔科夫决策过程

一、Markov Decision Process1.1 相关资料David Silver课件: https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf周博磊课件: https://github.com/zhoubolei/introRL1.2 概述Markov Descision Processes(MDP) 是对强化学习环境(environment) 的一种正式描述：该环境是完全可观测的 (fully observabl
复制链接

扫一扫