Importance Sampling in Reinforcement Learning - An overview

Thanks Sutton and Barto for their great work of Reinforcement Learning: An Introduction.

Almost all off-policy reinforcement learning methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply it by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

Given a set of trajectory St,At,St+1,At+1,,ST under policy π , i.e.

k=tT1π(Ak|Sk)p(Sk+1|Sk,Ak)
where p here is the state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the importance-sampling ratio) is
ρTt=T1k=tπ(Ak|Sk)p(Sk+1|Sk,Ak)T1k=1μ(Ak|Sk)p(Sk+1|Sk,Ak)=T1k=tπ(Ak|Sk)T1k=1μ(Ak|Sk)
Note that the importance sampling ratio depends only on the two policies and not at all on the MDP.

Define J(s) as the set of all time steps in which state s is visited. This is for an every-visit method; for a first-visit method, J(s) would only include time steps that were first visits to s within their episodes. let T(t) denote the first time of termination following time t , and Gt denote the return after t up through T(t).

To estimate vπ(s) , we simply scale the returns by the ratios and average the results:

V(s)=tJ(s)ρT(t)tGt|J(s)|

When importance sampling is done as a simple average in this way it is called ordinary importance sampling. An important alternative is weighted importance sampling, which uses a weighted average, defined as:
V(s)=tJ(s)ρT(t)tGttJ(s)ρT(t)t
or zero if the denominator is zero.

The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one.

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值