Thanks Sutton and Barto for their great work of Reinforcement Learning: An Introduction.
Almost all off-policy reinforcement learning methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply it by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.
Given a set of trajectory
St,At,St+1,At+1,…,ST
under policy
π
, i.e.
Define
J(s)
as the set of all time steps in which state
s
is visited. This is for an every-visit method; for a first-visit method,
To estimate
vπ(s)
, we simply scale the returns by the ratios and average the results:
When importance sampling is done as a simple average in this way it is called ordinary importance sampling. An important alternative is weighted importance sampling, which uses a weighted average, defined as:
The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one.