手把手教你强化学习 (六) 强化学习中的无模型预测

小小何先生

已于 2024-02-24 22:37:04 修改

阅读量2.5k

点赞数

分类专栏：强化学习入门文章标签：人工智能

于 2020-02-23 10:17:56 首次发布

本文链接：https://blog.csdn.net/weixin_39059031/article/details/104451919

版权

强化学习入门专栏收录该内容

18 篇文章 57 订阅

订阅专栏

在这里插入图片描述

本节目录

在大多是强化学习(reinforcement learning RL)问题中，环境的model都是未知的，也就无法直接做动态规划。一种方法是去学MDP，在这个系列的理解强化学习中的策略迭代和值迭代这篇文章中有具体思路。但这种做法还是会存在很多问题，就是在sample过程中会比较麻烦，如果你随机sample的话就会有某些state你很难sample到，而按照某种策略sample的话，又很难得到真实的转移概率。一旦你的model出现了问题，值迭代和策略迭代都将会出现问题。

于是就有了Model-free Reinforcement Learning ，直接与环境交互，直接从数据中学到model。

Model-free Reinforcement Learning

Model-free Reinforcement Learning需要从数据中estimate出value是多少(state or state-action pair)，接下来拿到cumulative reward的期望，得到这些case之后，再去做model-free的control，去optimal当前的policy使得value function最大化。

那model-free的value function如何来做prediction呢？

在model-free的RL中我们无法获取state transition 和reward function，我们仅仅是有一些episodes。之前我们是拿这些episodes学model，在model free的方法中拿这些episode直接学value function 或者是policy，不需要学MDP。这里面两个关键的key steps：1. estimate value function. 2. optimize policy.

Value Function Estimate

In model-based RL (MDP), the value function is calculated by dynamic programming

$v_{\pi}(s)=\mathbb{E_{\pi}}[R_{t+1}+\gamma v_{\pi}(\mathcal{S_{t+1}})|\mathcal{S_{t}=s}]$

在model free的方法中，我们不知道state transition，由此无法计算上述等式的期望。

Monte-Carlo Methods

Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle’s surface. As show in following figure：

Monte-Carlo Methods

对上述方框中均匀撒上一些点，然后用如下等式计算即可：

$\text{Circle Surface} = \text{Square Surface} \times \frac{\text{ points in circle}}{\text{points in total}}$

Monte-Carlo Value Estimation

我们有很多episodes，基于这些episode，我们去计算total discounted reward ：

$G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}$

Value function的 expected return可表示为如下数学形式：

$V^{\pi}(s) = \mathbb{E} [G_{t}|s_{t}=s,\pi] \\ \approx \frac{1}{N} \sum_{i=1}^{N} G_{i}^{(i)}$

上述方法可总结为两步：1. 使用policy $\pi$ 从state $s$ 开始采样 $N$ 个episodes 。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来，这种基于采样的方法，直接一步到位，计算value 而不需要计算MDP中的什么状态转移啥的。

上述思想更加细致、更具体的方法可用如下形式表示：

Sample episodes of policy $\pi$ 。
Every time-step $t$ that state $s$ is visited in an episode
- Increment counter $\leftarrow N(s) +1$
- Increment total return $\leftarrow S(s) +G_{t}$
- Value is estimated by mean return $V (s) = S (s) / N (s)$
- By law of large numbers $\leftarrow V^{\pi}$ as $\rightarrow \infty$ 。

Incremental Monte-Carlo Updates

Update $V (s)$ incrementally after each episode
For each state $S_{t}$ with cumulative return $G_{t}$

$\begin{array}{l} {N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1} \\ {V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right)} \end{array}$

For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes

如果环境的state transition 和reward function一直在变，我们把这个环境叫做non-stationary，环境本身肯定叫做stochastic 环境。但是如果分布不变，叫做statically environment，但是环境本身的分布会发生变化的话，就需要去忘掉一些老的episode，如果用平均的方法去做的话，老的episode和新的episode一样，它就忘不掉老的episode。

$V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_{t} - V(S_{t}))$

Monte-Carlo Value Estimation的一些特点：

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping (discussed later)
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate

Temporal-Difference Learning

TD的方法中引入对未来值函数的估计：

$G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=R_{t+1}+\gamma V(S_{t+1})$

$V(S_{t}) \leftarrow V(S_{t}) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_{t}))$

TD的算法主要有以下四个特点：

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

Monte Carlo vs. Temporal Difference

Monte Carlo方法和Temporal Difference方法对比如下：

The same goal: learn $V_{\pi}$ from episodes of experience under policy $\pi$ 。
Incremental every-visit Monte-Carlo
- Update value $V(S_{t})$ toward actual return $G_{t}$ 。

$V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}-V(S_{t}))$

Simplest temporal-difference learning algorithm: TD
- Update value $V(S_{t})$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$ 。
- TD Target： $R_{t+1} + \gamma V(S_{t+1})$ ；
- TD error： $\delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})$

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

Return $G_{t}$ is unbiased estimate of $V^{\pi}(S_{t})$ 。

基于当前的策略去采样，然后计算平均值，这样得到的估计是无偏估计。

TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $V^{\pi}$ 。

TD target中由于存在对未来的估计 $V(S_{t+1})$ ，这个估计如果是非常准确的，那TD target也是unbiased estimate ，但是由于 $V(S_{t+1})$ 很难估计准确，所以是 biased estimate 。

TD target is of much lower variance than the return

TD target的方法一般比Return $G_{t}$ 要小。Return $G_{t}$ depends on many random actions, transitions and rewards；TD target depends on one random action, transition and reward

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias

MC方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance。

TD has low variance, some bias

TD的方法 Usually more efficient than MC ，TD converges to $V^{\pi}(S_{t})$ ，but not always with function approximation。并且 More sensitive to initial value than MC。