Chapter 5 Monte Carlo Methods

不像以前章节,这里不假设有complete knowledge of the environment。

   不需要完美的model,只要有experiences就行,用episodes表示,一个episode就是一个完整的从开始到结束的state、action、reward序列。蒙特卡洛方法的特点就是要使用整个序列,举例来说就是必须在一个episode结束后得到了整个序列才能使用蒙特卡洛方法。
   蒙特卡洛方法因此可以episode-by-episode的增加,但不是step-by-step的(在线)的增加。
   蒙特卡洛方法在这里用于基于averaging complete returns。而且要处理的问题也是nonstationarity

5.1 Monte Carlo Prediction

首先考虑蒙特卡洛方法用于在给定policy下学习state-value function。跟Policy Evaluation(Prediction)类似的情况。原理是大数定理,这也是所有蒙特卡洛方法的基础

  • First-Visit Monte-Carlo Policy Evaluation:estimate vπ(s) v π ( s ) as the average of the returns following first visits to s.
    • To evaluate state s
    • The fi rst time-step t that state s is visited in an episode,
    • Increment counter N(s)N(s)+1 N ( s ) ← N ( s ) + 1
    • Increment total return S(s)S(s)+Gt S ( s ) ← S ( s ) + G t
    • Value is estimated by mean return V(s)=S(s)/N(s) V ( s ) = S ( s ) / N ( s )
    • By law of large numbers, V(s)vπ(s) as N(s) V ( s ) → v π ( s )   as   N ( s ) → ∞
  • Every-Visit Monte-Carlo Policy Evaluation:estimate vπ(s) v π ( s ) as the average of the returns following every visits to s.
    • To evaluate state s
    • Every time-step t that state s is visited in an episode,
    • Increment counter N(s)N(s)+1 N ( s ) ← N ( s ) + 1
    • Increment total return S(s)S(s)+Gt S ( s ) ← S ( s ) + G t
    • Value is estimated by mean return V(s)=S(s)/N(s) V ( s ) = S ( s ) / N ( s )
    • Again, V(s)vπ(s) as N(s) V ( s ) → v π ( s )   as   N ( s ) → ∞

这里的说的 对svisit 是指在一个episode中 state s 出现一次

First-visit MC prediction
first-visit MC和every-visit MC都收敛到 vπ(s) v π ( s ) ,当visit的数量增加到无限的时候

5.3 Monte Carlo Control

π(s)argmaxaq(s,a). π ( s ) ≐ a r g max a q ( s , a ) .

policy improvement theorem应用到 πk π k πk+1 π k + 1
qπk(s,πk+1(s))=qπk(s,argmaxaqπk(s,a))=maxaqπk(s,a)qπk(s,πk(s))vπk(s). q π k ( s , π k + 1 ( s ) ) = q π k ( s , a r g max a q π k ( s , a ) ) = max a q π k ( s , a ) ≥ q π k ( s , π k ( s ) ) ≥ v π k ( s ) .

Monte Carlo ES
exploring starts就是开始的时候手动给一个好的值

5.4 Monte Carlo Control without Exploring Starts

有两种方法可以避开exploring starts的需求

  • On-policy learning
    • ”Learn on the job”
    • Learn about policy π π from experience sampled from π π
    • On-policy更新的policy与产生样本的policy是一样的
  • Off-policy learning
    • ”Look over someone’s shoulder”
    • Learning about policy π π from experience sampled from μ μ
    • Off-policy更新的policy与产生样本的policy不一样

关于On-policy和Off-policy的定义和关系是后面近似方法的核心

On-policy rst-visit MC control

5.5 Off-policy Prediction via Importance Sampling

off-policy的方差更大,收敛的更慢

   on-policy approach实际上是一种妥协,是探索近似最优policy
   off-policy approach使用一种更直观的方式是使用两个policy,一个用来学习并成为the optimal policy,另一个更exploratory,用来generate behavior。

   用来学习的policy称为target policy,这里是 π π ;用来生成行为的policy称为behavior policy,这里是 b b
   In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

因为behavior policy更stochastic and more exploratory,所以可以是 ε-greedy 方法

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

给定开始状态 St S t ,后续state-action trajectory在任意policy π π 下发生的概率

Pr{At,St+1,AT=1,,ST|St,At:T1π}=π(At|St)p(St+1|St,At)π(At+1|St+1)p(ST|ST1,AT1)=k=tT1π(Ak|Sk)p(Sk+1|Sk,Ak), P r { A t , S t + 1 , A T = 1 , ⋯ , S T | S t , A t : T − 1 ∼ π } = π ( A t | S t ) p ( S t + 1 | S t , A t ) π ( A t + 1 | S t + 1 ) ⋯ p ( S T | S T − 1 , A T − 1 ) = ∏ k = t T − 1 π ( A k | S k ) p ( S k + 1 | S k , A k ) ,

注意trajectory,后面蒙特卡洛搜索树会用到这个概念

那么importance sampling ratio

ρt:T1T1k=tπ(Ak|Sk)p(Sk+1|Sk,Ak)T1k=tb(Ak|Sk)p(Sk+1|Sk,Ak)=k=tT1π(Ak|Sk)b(Ak|Sk) ρ t : T − 1 ≐ ∏ k = t T − 1 π ( A k | S k ) p ( S k + 1 | S k , A k ) ∏ k = t T − 1 b ( A k | S k ) p ( S k + 1 | S k , A k ) = ∏ k = t T − 1 π ( A k | S k ) b ( A k | S k )

应用importance ratio。在只有有behavior policy得到的returns Gt G t 的情况下,想得到在target policy下的expected returns(values)。

E[ρt:T1Gt|St=s]=vπ(s) E [ ρ t : T − 1 G t | S t = s ] = v π ( s )

In particular, we can define the set of all time steps in which state s is visited, denoted J(s) J ( s ) . This is for an every-visit method; for a fi rst-visit method, J(s) J ( s ) would only include time steps that were fi rst-visits to s within their episodes.

Ordinary importance sampling:

V(s)tJ(s)ρt:T1Gt|J(s)| V ( s ) ≐ ∑ t ∈ J ( s ) ρ t : T − 1 G t | J ( s ) |

Weighted importance sampling:

V(s)tJ(s)ρt:T1GttJ(s)ρt:T1 V ( s ) ≐ ∑ t ∈ J ( s ) ρ t : T − 1 G t ∑ t ∈ J ( s ) ρ t : T − 1

5.6 Incremental Implementation

Wt=ρt:T1 W t = ρ t : T − 1
则有
V(s)n1t=1WkGtn1t=1Wk,n2 V ( s ) ≐ ∑ t = 1 n − 1 W k G t ∑ t = 1 n − 1 W k , n ≥ 2

把上面的权重更新写成递增实现
Vn+1Vn+WnCn[CnVn],n1 V n + 1 ≐ V n + W n C n [ C n − V n ] , n ≥ 1

Gn+1Cn+Wn+1 G n + 1 ≐ C n + W n + 1

Off-policy MC prediction
这里其实就是上面增量的实现weighted importance-sampling。这里只表现了增量实现与importance-sampling的关系

5.7 Off-policy Monte Carlo Control

Off-policy MC control

5.8 *Discounting-aware Importance Sampling

把returns的内部结构添加到discounted rewards的总和的考虑中。这可以减小方差

The essence of the idea is to think of discounting as determining a probability of termination or, equivalently, a degree of partial termination.

G¯t:hRt+1+Rt+2++Rh,0t<hT, G ¯ t : h ≐ R t + 1 + R t + 2 + ⋯ + R h , 0 ≤ t < h ≤ T ,

The conventional full return Gt G t can be viewed as a sum of at partial returns

GtRt+1+γRt+2+γ2Rt+3++γTt1RT=(1γ)Rt+1+(1γ)γ(Rt+1+Rt+2)+(1γ)γ2(Rt+1+Rt+2+Rt+3)+(1γ)γTt2(Rt+1+Rt+2++RT1)+γTt1(Rt+1+Rt+2++RT1)=(1γ)h=t+1T1γht1G¯t:h+γTt1G¯t:T G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ + γ T − t − 1 R T = ( 1 − γ ) R t + 1 + ( 1 − γ ) γ ( R t + 1 + R t + 2 ) + ( 1 − γ ) γ 2 ( R t + 1 + R t + 2 + R t + 3 ) ⋮ + ( 1 − γ ) γ T − t − 2 ( R t + 1 + R t + 2 + ⋯ + R T − 1 ) + γ T − t − 1 ( R t + 1 + R t + 2 + ⋯ + R T − 1 ) = ( 1 − γ ) ∑ h = t + 1 T − 1 γ h − t − 1 G ¯ t : h + γ T − t − 1 G ¯ t : T

则有
ordinary importance-sampling estimator
V(s)tJ(s)((1γ)T(t)1h=t+1γht1ρt:h1G¯t:h+γT(t)t1ρt:T(t)1G¯t:T(t))|J(s)| V ( s ) ≐ ∑ t ∈ J ( s ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 G ¯ t : h + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 G ¯ t : T ( t ) ) | J ( s ) |

weighted importance-sampling estimator

V(s)tJ(s)((1γ)T(t)1h=t+1γht1ρt:h1G¯t:h+γT(t)t1ρt:T(t)1G¯t:T(t))tJ(s)((1γ)T(t)1h=t+1γht1ρt:h1+γT(t)t1ρt:T(t)1) V ( s ) ≐ ∑ t ∈ J ( s ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 G ¯ t : h + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 G ¯ t : T ( t ) ) ∑ t ∈ J ( s ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 )

5.9 *Per-decision Importance Sampling

另外一种把structure of the return作为rewards的总和,可以被考虑在off-policy importance sampling中。也可以减小方差

ρt:T1Gt=ρt:T1(Rt+1+γRt+2++γTt1RT)=ρt:T1Rt+1+γρt:T1Rt+2++γTt1ρt:T1RT) ρ t : T − 1 G t = ρ t : T − 1 ( R t + 1 + γ R t + 2 + ⋯ + γ T − t − 1 R T ) = ρ t : T − 1 R t + 1 + γ ρ t : T − 1 R t + 2 + ⋯ + γ T − t − 1 ρ t : T − 1 R T )

上式的第一个子项可以写为
ρt:T1Rt+1=π(At|St)b(At|St)π(At+1|St+1)b(At+1|St+1)π(At+2|St+2)b(At+2|St+2)π(AT1|ST1)b(AT1|ST1)Rt+1 ρ t : T − 1 R t + 1 = π ( A t | S t ) b ( A t | S t ) π ( A t + 1 | S t + 1 ) b ( A t + 1 | S t + 1 ) π ( A t + 2 | S t + 2 ) b ( A t + 2 | S t + 2 ) ⋯ π ( A T − 1 | S T − 1 ) b ( A T − 1 | S T − 1 ) R t + 1

注意到上式中的各项,只有第一项和最后一项(the reward)是相关的;其他各项都是独立随机变量,它们的期望值为1

E[π(Ak|Sk)b(Ak|Sk)]ab(a|Sk)π(a|Sk)b(a|Sk)=aπ(a|Sk)=1 E [ π ( A k | S k ) b ( A k | S k ) ] ≐ ∑ a b ( a | S k ) π ( a | S k ) b ( a | S k ) = ∑ a π ( a | S k ) = 1

所有的比率值中,只有第一项留下来了,则有

E[ρt:T1Rt+1]=[ρt:tRt+1] E [ ρ t : T − 1 R t + 1 ] = [ ρ t : t R t + 1 ]

重复上述分析过程则可以得到

E[ρt:T1Gt]=E[G¯t] E [ ρ t : T − 1 G t ] = E [ G ¯ t ]

其中

G¯t=ρt:tRt+1+γρt:t+1Rt+2+γ2ρt:t+2Rt+3++γTt1ρt:T1RT G ¯ t = ρ t : t R t + 1 + γ ρ t : t + 1 R t + 2 + γ 2 ρ t : t + 2 R t + 3 + ⋯ + γ T − t − 1 ρ t : T − 1 R T

我们称这个想法为 per-decision importance sampling

使用 G¯t G ¯ t 的ordinary-importance-sampling estimator

V(s)tJ(s)G¯t|J(s)| V ( s ) ≐ ∑ t ∈ J ( s ) G ¯ t | J ( s ) |

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值