Importance Sampling in Reinforcement Learning - An overview

最新推荐文章于 2023-04-28 18:03:37 发布

止于至玄

最新推荐文章于 2023-04-28 18:03:37 发布

阅读量1.8k

点赞数 1

分类专栏： Reinforcement Learning 文章标签：强化学习

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

Thanks Sutton and Barto for their great work of Reinforcement Learning: An Introduction.

Almost all off-policy reinforcement learning methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply it by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

Given a set of trajectory $S_{t}, A_{t}, S_{t+1}, A_{t+1}, \dots, S_{T}$ under policy $\pi$ , i.e.

\prod k = t T - 1 π (A k | S k) p (S k + 1 | S k, A k)

$\prod_{k=t}^{T-1}\pi(A_{k}|S_{k})p(S_{k+1}|S_{k}, A_{k})$ where

p $p$ here is the state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the importance-sampling ratio) is

ρ T t = \prod T - 1 k = t π ( A k | S k ) p ( S k + 1 | S k , A k ) \prod T - 1 k = 1 μ ( A k | S k ) p ( S k + 1 | S k , A k ) = \prod T - 1 k = t π ( A k | S k ) \prod T - 1 k = 1 μ ( A k | S k )

$\rho_{t}^{T}=\frac{\prod_{k=t}^{T-1}\pi(A_{k}|S_{k})p(S_{k+1}|S_{k}, A_{k})}{\prod_{k=1}^{T-1}\mu(A_{k}|S_{k})p(S_{k+1}|S_{k}, A_{k})}=\frac{\prod_{k=t}^{T-1}\pi(A_{k}|S_{k})}{\prod_{k=1}^{T-1}\mu(A_{k}|S_{k})}$ Note that the importance sampling ratio depends only on the two policies and not at all on the MDP.

Define $J(s)$ as the set of all time steps in which state $s$ is visited. This is for an every-visit method; for a first-visit method, $J(s)$ would only include time steps that were first visits to $s$ within their episodes. let $T(t)$ denote the first time of termination following time $t$ , and $G_t$ denote the return after $t$ up through $T(t)$ .

To estimate $v_π (s)$ , we simply scale the returns by the ratios and average the results:

V (s) = \sum t \in J ( s ) ρ T ( t ) t G t | J ( s ) |

$V(s)=\frac{\sum_{t\in J(s)}\rho_{t}^{T(t)} G_{t}}{|J(s)|}$
When importance sampling is done as a simple average in this way it is called ordinary importance sampling. An important alternative is weighted importance sampling, which uses a weighted average, defined as:

V (s) = \sum t \in J ( s ) ρ T ( t ) t G t \sum t \in J ( s ) ρ T ( t ) t

$V(s)=\frac{\sum_{t\in J(s)}\rho_{t}^{T(t)}G_{t}}{\sum_{t\in J(s)}\rho_{t}^{T(t)}}$ or zero if the denominator is zero.

The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one.

止于至玄

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Importance Sampling in Reinforcement Learning - An overview

Thanks Sutton and Barto for their great work of Reinforcement Learning: An Introduction.Almost all off-policy reinforcement learning methods utilize importance sampling, a general technique for...
复制链接

扫一扫

专栏目录