Chapter 5 Monte Carlo Methods

最新推荐文章于 2021-11-24 17:58:04 发布

滑稽树

最新推荐文章于 2021-11-24 17:58:04 发布

阅读量416

点赞数

分类专栏：强化学习笔记游戏AI

本文链接：https://blog.csdn.net/dengyibing/article/details/80464699

版权

强化学习笔记同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

游戏AI

10 篇文章 0 订阅

订阅专栏

不像以前章节，这里不假设有complete knowledge of the environment。

不需要完美的model，只要有experiences就行，用episodes表示，一个episode就是一个完整的从开始到结束的state、action、reward序列。蒙特卡洛方法的特点就是要使用整个序列，举例来说就是必须在一个episode结束后得到了整个序列才能使用蒙特卡洛方法。
蒙特卡洛方法因此可以episode-by-episode的增加，但不是step-by-step的（在线）的增加。
蒙特卡洛方法在这里用于基于averaging complete returns。而且要处理的问题也是nonstationarity

5.1 Monte Carlo Prediction

首先考虑蒙特卡洛方法用于在给定policy下学习state-value function。跟Policy Evaluation(Prediction)类似的情况。原理是大数定理，这也是所有蒙特卡洛方法的基础

First-Visit Monte-Carlo Policy Evaluation：estimate vπ(s) as the average of the returns following first visits to s.
- To evaluate state s
- The fi rst time-step t that state s is visited in an episode,
- Increment counter $N(s) \leftarrow N(s)+1$
- Increment total return $S(s) \leftarrow S(s)+G_t$
- Value is estimated by mean return $V(s)=S(s)/N(s)$
- By law of large numbers, $V(s) \rightarrow v_{\pi}(s) \ \text{as}\ N(s) \rightarrow \infty$
Every-Visit Monte-Carlo Policy Evaluation：estimate vπ(s) as the average of the returns following every visits to s.
- To evaluate state s
- Every time-step t that state s is visited in an episode,
- Increment counter $N(s) \leftarrow N(s)+1$
- Increment total return $S(s) \leftarrow S(s)+G_t$
- Value is estimated by mean return $V(s)=S(s)/N(s)$
- Again, $V(s) \rightarrow v_{\pi}(s) \ \text{as}\ N(s) \rightarrow \infty$

这里的说的对s的visit 是指在一个episode中 state s 出现一次

First-visit MC prediction
first-visit MC和every-visit MC都收敛到 $v_{\pi}(s)$ ，当visit的数量增加到无限的时候

5.3 Monte Carlo Control

π (s) ≐ a r g max a q (s, a) .

$\pi(s) \doteq \underset{a}{arg\max} q(s,a).$
policy improvement theorem应用到

πk π k $\pi_k$ 和

πk+1 π k + 1 $\pi_{k+1}$

q π k (s, π k + 1 (s)) = q π k (s, a r g max a q π k (s, a)) = max a q π k (s, a) \geq q π k (s, π k (s)) \geq v π k (s) .

$\begin{align*} q_{\pi_k}(s,\pi_{k+1}(s)) & = q_{\pi_k}(s,\underset{a}{arg\max} q_{\pi_k}(s,a))\\ & = \underset{a}{\max} q_{\pi_k}(s,a)\\ & \geq q_{\pi_k}(s,\pi_k(s))\\ & \geq v_{\pi_k}(s). \end{align*}$
Monte Carlo ES

exploring starts就是开始的时候手动给一个好的值

5.4 Monte Carlo Control without Exploring Starts

有两种方法可以避开exploring starts的需求

On-policy learning
- ”Learn on the job”
- Learn about policy $\pi$ from experience sampled from $\pi$
- On-policy更新的policy与产生样本的policy是一样的
Off-policy learning
- ”Look over someone’s shoulder”
- Learning about policy $\pi$ from experience sampled from $\mu$
- Off-policy更新的policy与产生样本的policy不一样

关于On-policy和Off-policy的定义和关系是后面近似方法的核心

On-policy rst-visit MC control

5.5 Off-policy Prediction via Importance Sampling

off-policy的方差更大，收敛的更慢

on-policy approach实际上是一种妥协，是探索近似最优policy。
off-policy approach使用一种更直观的方式是使用两个policy，一个用来学习并成为the optimal policy，另一个更exploratory，用来generate behavior。

用来学习的policy称为target policy，这里是 $\pi$ ；用来生成行为的policy称为behavior policy，这里是 $b$ 。
In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

因为behavior policy更stochastic and more exploratory，所以可以是 $\varepsilon\text{-greedy}$ 方法

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

给定开始状态 $S_t$ ，后续state-action trajectory在任意policy $\pi$ 下发生的概率

P r {A t, S t + 1, A T = 1, \dots, S T | S t, A t : T - 1 \sim π} = π (A t | S t) p (S t + 1 | S t, A t) π (A t + 1 | S t + 1) \dots p (S T | S T - 1, A T - 1) = \prod k = t T - 1 π (A k | S k) p (S k + 1 | S k, A k),

$\begin{align*} Pr & \{ A_t,S_{t+1},A_{T=1},\cdots,S_T|S_t,A_{t:T-1} \sim \pi \} \\ & = \pi(A_t|S_t)p(S_{t+1}|S_t,A_t)\pi(A_{t+1}|S_{t+1})\cdots p({S_T|S_{T-1},A_{T-1}})\\ & = \prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k), \end{align*}$

注意trajectory，后面蒙特卡洛搜索树会用到这个概念

那么importance sampling ratio为

ρ t : T - 1 ≐ \prod T - 1 k = t π ( A k | S k ) p ( S k + 1 | S k , A k ) \prod T - 1 k = t b ( A k | S k ) p ( S k + 1 | S k , A k ) = \prod k = t T - 1 π ( A k | S k ) b ( A k | S k )

$\rho_{t:T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k,A_k)}=\prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

应用importance ratio。在只有有behavior policy得到的returns $G_t$ 的情况下，想得到在target policy下的expected returns(values)。

E [ρ t : T - 1 G t | S t = s] = v π (s)

$\mathbb{E}[\rho_{t:T-1}G_t|S_t=s]=v_{\pi}(s)$

In particular, we can define the set of all time steps in which state s is visited, denoted $J(s)$ . This is for an every-visit method; for a fi rst-visit method, $J(s)$ would only include time steps that were fi rst-visits to s within their episodes.

Ordinary importance sampling:

V (s) ≐ \sum t \in J ( s ) ρ t : T - 1 G t | J ( s ) |

$V(s) \doteq \frac{\sum_{t \in J(s)}\rho_{t:T-1}G_t}{|J(s)|}$

Weighted importance sampling:

V (s) ≐ \sum t \in J ( s ) ρ t : T - 1 G t \sum t \in J ( s ) ρ t : T - 1

$V(s) \doteq \frac{\sum_{t \in J(s)}\rho_{t:T-1}G_t}{\sum_{t \in J(s)}\rho_{t:T-1}}$

5.6 Incremental Implementation

$W_t=\rho_{t:T-1}$
则有
$V(s) \doteq \frac{\sum_{t=1}^{n-1}W_kG_t}{\sum_{t=1}^{n-1}W_k}, \qquad n \geq 2$

把上面的权重更新写成递增实现
$V_{n+1} \doteq V_n + \frac{W_n}{C_n}[C_n-V_n], \qquad n \geq 1$
和
$G_{n+1} \doteq C_n+W_{n+1}$

Off-policy MC prediction
这里其实就是上面增量的实现weighted importance-sampling。这里只表现了增量实现与importance-sampling的关系

5.7 Off-policy Monte Carlo Control

Off-policy MC control

5.8 *Discounting-aware Importance Sampling

把returns的内部结构添加到discounted rewards的总和的考虑中。这可以减小方差

The essence of the idea is to think of discounting as determining a probability of termination or, equivalently, a degree of partial termination.

$\bar{G}_{t:h} \doteq R_{t+1}+R_{t+2}+\cdots+R_{h}, \qquad 0\leq t \lt h \leq T,$

The conventional full return $G_t$ can be viewed as a sum of at partial returns

G t ⋮ ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + \dots + γ T - t - 1 R T = (1 - γ) R t + 1 + (1 - γ) γ (R t + 1 + R t + 2) + (1 - γ) γ 2 (R t + 1 + R t + 2 + R t + 3) + (1 - γ) γ T - t - 2 (R t + 1 + R t + 2 + \dots + R T - 1) + γ T - t - 1 (R t + 1 + R t + 2 + \dots + R T - 1) = (1 - γ) \sum h = t + 1 T - 1 γ h - t - 1 G ¯ t : h + γ T - t - 1 G ¯ t : T

$\begin{align*} G_t & \doteq R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots+\gamma^{T-t-1}R_T\\ &= (1-\gamma)R_{t+1}\\ & + (1-\gamma)\gamma (R_{t+1}+R_{t+2})\\ & + (1-\gamma)\gamma^2 (R_{t+1}+R_{t+2}+R_{t+3})\\ \vdots\\ & + (1-\gamma)\gamma^{T-t-2} (R_{t+1}+R_{t+2}+\cdots+R_{T-1})\\ & + \gamma^{T-t-1}(R_{t+1}+R_{t+2}+\cdots+R_{T-1})\\ & = (1-\gamma)\sum_{h=t+1}^{T-1} \gamma^{h-t-1} \bar{G}_{t:h} + \gamma^{T-t-1}\bar{G}_{t:T} \end{align*}$
则有
ordinary importance-sampling estimator

V (s) ≐ \sum t \in J ( s ) ( ( 1 - γ ) \sum T ( t ) - 1 h = t + 1 γ h - t - 1 ρ t : h - 1 G ¯ t : h + γ T ( t ) - t - 1 ρ t : T ( t ) - 1 G ¯ t : T ( t ) ) | J ( s ) |

$V(s) \doteq \frac {\sum_{t \in J(s)}((1-\gamma)\sum_{h=t+1}^{T(t)-1}\gamma^{h-t-1}\rho_{t:h-1}\bar{G}_{t:h}+\gamma^{T(t)-t-1}\rho_{t:T(t)-1}\bar{G}_{t:T(t)})} {|J(s)|}$

weighted importance-sampling estimator

V (s) ≐ \sum t \in J ( s ) ( ( 1 - γ ) \sum T ( t ) - 1 h = t + 1 γ h - t - 1 ρ t : h - 1 G ¯ t : h + γ T ( t ) - t - 1 ρ t : T ( t ) - 1 G ¯ t : T ( t ) ) \sum t \in J ( s ) ( ( 1 - γ ) \sum T ( t ) - 1 h = t + 1 γ h - t - 1 ρ t : h - 1 + γ T ( t ) - t - 1 ρ t : T ( t ) - 1 )

$V(s) \doteq \frac {\sum_{t \in J(s)}((1-\gamma)\sum_{h=t+1}^{T(t)-1}\gamma^{h-t-1}\rho_{t:h-1}\bar{G}_{t:h}+\gamma^{T(t)-t-1}\rho_{t:T(t)-1}\bar{G}_{t:T(t)})} {\sum_{t \in J(s)}((1-\gamma)\sum_{h=t+1}^{T(t)-1}\gamma^{h-t-1}\rho_{t:h-1}+\gamma^{T(t)-t-1}\rho_{t:T(t)-1})}$

5.9 *Per-decision Importance Sampling

另外一种把structure of the return作为rewards的总和，可以被考虑在off-policy importance sampling中。也可以减小方差

ρ t : T - 1 G t = ρ t : T - 1 (R t + 1 + γ R t + 2 + \dots + γ T - t - 1 R T) = ρ t : T - 1 R t + 1 + γ ρ t : T - 1 R t + 2 + \dots + γ T - t - 1 ρ t : T - 1 R T)

$\begin{align*} \rho_{t:T-1}G_t & = \rho_{t:T-1}(R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{T-t-1}R_T)\\ & = \rho_{t:T-1}R_{t+1}+\gamma \rho_{t:T-1}R_{t+2}+\cdots+\gamma^{T-t-1}\rho_{t:T-1}R_T) \end{align*}$
上式的第一个子项可以写为

ρ t : T - 1 R t + 1 = π ( A t | S t ) b ( A t | S t ) π ( A t + 1 | S t + 1 ) b ( A t + 1 | S t + 1 ) π ( A t + 2 | S t + 2 ) b ( A t + 2 | S t + 2 ) \dots π ( A T - 1 | S T - 1 ) b ( A T - 1 | S T - 1 ) R t + 1

$\rho_{t:T-1}R_{t+1} = \frac{\pi(A_t|S_t)}{b(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})} \frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})} \cdots \frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})} R_{t+1}$

注意到上式中的各项，只有第一项和最后一项(the reward)是相关的；其他各项都是独立随机变量，它们的期望值为1

E [π ( A k | S k ) b ( A k | S k )] ≐ \sum a b (a | S k) π ( a | S k ) b ( a | S k ) = \sum a π (a | S k) = 1

$\mathbb{E}[\frac{\pi(A_k|S_k)}{b(A_k|S_k)}] \doteq \sum_a b(a|S_k)\frac{\pi(a|S_k)}{b(a|S_k)} = \sum_a \pi(a|S_k) = 1$

所有的比率值中，只有第一项留下来了，则有

E [ρ t : T - 1 R t + 1] = [ρ t : t R t + 1]

$\mathbb{E}[\rho_{t:T-1}R_{t+1}]=\mathbb[\rho_{t:t}R_{t+1}]$

重复上述分析过程则可以得到

E [ρ t : T - 1 G t] = E [G ¯ t]

$\mathbb{E}[\rho_{t:T-1}G_t]=\mathbb{E}[\bar G_t]$

其中

G ¯ t = ρ t : t R t + 1 + γ ρ t : t + 1 R t + 2 + γ 2 ρ t : t + 2 R t + 3 + \dots + γ T - t - 1 ρ t : T - 1 R T

$\bar G_t = \rho_{t:t}R_{t+1}+\gamma \rho_{t:t+1}R_{t+2}+\gamma^2 \rho_{t:t+2}R_{t+3}+\cdots+\gamma^{T-t-1} \rho_{t:T-1}R_T$
我们称这个想法为 per-decision importance sampling

使用 $\bar G_t$ 的ordinary-importance-sampling estimator