不像以前章节,这里不假设有complete knowledge of the environment。
不需要完美的model,只要有experiences就行,用episodes表示,一个episode就是一个完整的从开始到结束的state、action、reward序列。蒙特卡洛方法的特点就是要使用整个序列,举例来说就是必须在一个episode结束后得到了整个序列才能使用蒙特卡洛方法。
蒙特卡洛方法因此可以episode-by-episode的增加,但不是step-by-step的(在线)的增加。
蒙特卡洛方法在这里用于基于averaging complete returns。而且要处理的问题也是nonstationarity
5.1 Monte Carlo Prediction
首先考虑蒙特卡洛方法用于在给定policy下学习state-value function。跟Policy Evaluation(Prediction)类似的情况。原理是大数定理,这也是所有蒙特卡洛方法的基础
- First-Visit Monte-Carlo Policy Evaluation:estimate
vπ(s)
v
π
(
s
)
as the average of the returns following first visits to s.
- To evaluate state s
- The fi rst time-step t that state s is visited in an episode,
- Increment counter N(s)←N(s)+1 N ( s ) ← N ( s ) + 1
- Increment total return S(s)←S(s)+Gt S ( s ) ← S ( s ) + G t
- Value is estimated by mean return V(s)=S(s)/N(s) V ( s ) = S ( s ) / N ( s )
- By law of large numbers, V(s)→vπ(s) as N(s)→∞ V ( s ) → v π ( s ) as N ( s ) → ∞
- Every-Visit Monte-Carlo Policy Evaluation:estimate
vπ(s)
v
π
(
s
)
as the average of the returns following every visits to s.
- To evaluate state s
- Every time-step t that state s is visited in an episode,
- Increment counter N(s)←N(s)+1 N ( s ) ← N ( s ) + 1
- Increment total return S(s)←S(s)+Gt S ( s ) ← S ( s ) + G t
- Value is estimated by mean return V(s)=S(s)/N(s) V ( s ) = S ( s ) / N ( s )
- Again, V(s)→vπ(s) as N(s)→∞ V ( s ) → v π ( s ) as N ( s ) → ∞
这里的说的 对s的visit 是指在一个episode中 state s 出现一次
first-visit MC和every-visit MC都收敛到
vπ(s)
v
π
(
s
)
,当visit的数量增加到无限的时候
5.3 Monte Carlo Control
policy improvement theorem应用到 πk π k 和 πk+1 π k + 1
exploring starts就是开始的时候手动给一个好的值
5.4 Monte Carlo Control without Exploring Starts
有两种方法可以避开exploring starts的需求
- On-policy learning
- ”Learn on the job”
- Learn about policy π π from experience sampled from π π
- On-policy更新的policy与产生样本的policy是一样的
- Off-policy learning
- ”Look over someone’s shoulder”
- Learning about policy π π from experience sampled from μ μ
- Off-policy更新的policy与产生样本的policy不一样
关于On-policy和Off-policy的定义和关系是后面近似方法的核心
5.5 Off-policy Prediction via Importance Sampling
off-policy的方差更大,收敛的更慢
on-policy approach实际上是一种妥协,是探索近似最优policy。
off-policy approach使用一种更直观的方式是使用两个policy,一个用来学习并成为the optimal policy,另一个更exploratory,用来generate behavior。
用来学习的policy称为target policy,这里是
π
π
;用来生成行为的policy称为behavior policy,这里是
b
b
。
In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
因为behavior policy更stochastic and more exploratory,所以可以是 方法
Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.
给定开始状态
St
S
t
,后续state-action trajectory在任意policy
π
π
下发生的概率
注意trajectory,后面蒙特卡洛搜索树会用到这个概念
那么importance sampling ratio为
应用importance ratio。在只有有behavior policy得到的returns
Gt
G
t
的情况下,想得到在target policy下的expected returns(values)。
In particular, we can define the set of all time steps in which state s is visited, denoted J(s) J ( s ) . This is for an every-visit method; for a fi rst-visit method, J(s) J ( s ) would only include time steps that were fi rst-visits to s within their episodes.
Ordinary importance sampling:
Weighted importance sampling:
5.6 Incremental Implementation
Wt=ρt:T−1
W
t
=
ρ
t
:
T
−
1
则有
V(s)≐∑n−1t=1WkGt∑n−1t=1Wk,n≥2
V
(
s
)
≐
∑
t
=
1
n
−
1
W
k
G
t
∑
t
=
1
n
−
1
W
k
,
n
≥
2
把上面的权重更新写成递增实现
Vn+1≐Vn+WnCn[Cn−Vn],n≥1
V
n
+
1
≐
V
n
+
W
n
C
n
[
C
n
−
V
n
]
,
n
≥
1
和
Gn+1≐Cn+Wn+1
G
n
+
1
≐
C
n
+
W
n
+
1
这里其实就是上面增量的实现weighted importance-sampling。这里只表现了增量实现与importance-sampling的关系
5.7 Off-policy Monte Carlo Control
5.8 *Discounting-aware Importance Sampling
把returns的内部结构添加到discounted rewards的总和的考虑中。这可以减小方差
The essence of the idea is to think of discounting as determining a probability of termination or, equivalently, a degree of partial termination.
G¯t:h≐Rt+1+Rt+2+⋯+Rh,0≤t<h≤T, G ¯ t : h ≐ R t + 1 + R t + 2 + ⋯ + R h , 0 ≤ t < h ≤ T ,
The conventional full return
Gt
G
t
can be viewed as a sum of at partial returns
则有
ordinary importance-sampling estimator
weighted importance-sampling estimator
5.9 *Per-decision Importance Sampling
另外一种把structure of the return作为rewards的总和,可以被考虑在off-policy importance sampling中。也可以减小方差
上式的第一个子项可以写为
注意到上式中的各项,只有第一项和最后一项(the reward)是相关的;其他各项都是独立随机变量,它们的期望值为1
所有的比率值中,只有第一项留下来了,则有
重复上述分析过程则可以得到
其中
我们称这个想法为 per-decision importance sampling
使用
G¯t
G
¯
t
的ordinary-importance-sampling estimator