Typical Policy Evaluation Strategies in Model-free Policy Search

最新推荐文章于 2023-11-11 21:42:25 发布

止于至玄

最新推荐文章于 2023-11-11 21:42:25 发布

阅读量220

点赞数

分类专栏： Reinforcement Learning 文章标签：强化学习

本文链接：https://blog.csdn.net/philthinker/article/details/77385376

版权

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.

Policy evaluation strategies are used to assess the quality of the executed policy. They may be used to transform sampled trjectories $\tau^{[i]}$ into a data set $\mathcal{D}$ that contains samples of either the state-action pairs $x_{t}^{[i]},u_{t}^{[i]}$ or the parameter vectors $\theta^{[i]}$ . The data set $\mathcal{D}$ is subsequently processed by the policy update strategies to determine the new policy.

Step-based Policy Evaluation

In step-based policy evaluation, we decompose the sampled trajectories $\tau^{[i]}$ into its single $x_{t}^{[i]}, u_{t}^{[i]}$ , and estimate the quality of the single actions. The quality of an action is given by:

Q [i] t = Q π t (x [i] t, u [i] t) = E p θ (τ) [\sum h = 1 T r h (x h, u h) ∣ ∣ ∣ ∣ x t = x [i] t, u t = u [i] t]

$Q_{t}^{[i]}=Q_{t}^{\pi}\left(x_{t}^{[i]},u_{t}^{[i]}\right)=\mathbb{E}_{p_{\theta}(\tau)}\left[\left.\sum_{h=1}^{T}r_{h}(x_{h},u_{h}) \right| x_{t}=x_{t}^{[i]},u_{t}=u_{t}^{[i]}\right]$ Estimating the state-action value function usually suffers from high dimensional continuous spaces, approximation errors and a bias induced by the bootstrapping approach. Monte-Carlo estimates are unbiased, however, they typically exhibit a high variance.

Algorithms based on step-based policy evaluation use a data set $\mathcal{D}_{step}=\left\{x^{[i]},u^{[i]},Q^{[i]} \right\}$ to determine the policy upadte step.

Episode-based Policy Evaluation

Episode-based policy evaluation strategies directly use the expected return $R^{[i]}=R(\theta^{[i]})$ to evaluate the quality of a parameter vector $\theta^{[i]}$ :

R (θ [i]) = E p θ (τ) [\sum t = 0 T r t ∣ ∣ ∣ ∣ θ = θ [i]]

$R(\theta^{[i]})=\mathbb{E}_{p_{\theta}(\tau)}\left[\left.\sum_{t=0}^{T}r_{t}\right|\theta=\theta^{[i]}\right]$ However, this structure of return is by no means the only choice. Any reward function

R(θ[i]) $R(\theta^{[i]})$ which depends on the resulting trajectory of the robot can be used. The expected return can be estimated by performing multiple rollouts on the real system. However, in order to avoid such an expensive operation, some approaches can cope with noisy estimates of

R[i] $R^{[i]}$ and, hence, directly use the return

∑Tt=0r[i]t $\sum_{t=0}^{T}r_{t}^{[i]}$ of a single trajectory

τ[i] $\tau^{[i]}$ to estimate

R[i] $R^{[i]}$ .

Episode-based policy evaluation produces a data set $\mathcal{D}_{ep}=\{\theta^{[i]}, R^{[i]}\}$ and is typically connected with parameter-based exploration strategies, and, hence, such algorithms can be formalized by the problem of learning an upper-level policy $\pi_{w}(\theta)$ .

An underlying problem of episode-based evaluation is the variance of the $R^{[i]}$ estimates. For a high number of time steps and highly stochastic systems, step-based algorithms should be preferred.

Generalization to Multiple Tasks

For generalizing the learned polices to multiple tasks, so far, mainly episode-based policy evaluation strategies have been which learn an upper-level policy. Define a context vector $s$ which describes all variables which do not change during the execution of the task but might change from task to task. The upper-level policy is extended to generalize the lower-level policy $\pi_{\theta}(u|x)$ to different tasks by conditioning the upper-level policy $\pi_{w}(\theta|s)$ on the context $s$ .

The problem of learning $\pi_{w}(\theta|s)$ can be characterized by maximizing the expected returns over all contexts:

J w = \int s μ (s) \int θ π w (θ | s) \int τ p (τ | θ, s) R (τ, s) d τ d θ d s = \int s μ (s) \int θ π w (θ | s) R (θ, s) d θ d s

$\begin{split}J_{w} &= \int_{s}\mu(s)\int_{\theta}\pi_{w}(\theta|s)\int_{\tau}p(\tau|\theta,s)R(\tau,s)d\tau d\theta ds\\ &=\int_{s}\mu(s)\int_{\theta}\pi_{w}(\theta|s)R(\theta,s) d\theta ds \end{split}$ where

R(θ,s) $R(\theta,s)$ is the expected return for executing the lower-level policy with parameter vector

θ $\theta$ in context

s $s$ and

μ(s) $\mu(s)$ is the distribution over the contexts. The data set

Dep={s[i],θ[i],R[i]} $\mathcal{D}_{ep}=\{s^{[i]},\theta^{[i]},R^{[i]}\}$ is used for updating the policy.

止于至玄

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Typical Policy Evaluation Strategies in Model-free Policy Search

Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.Policy evaluation strategies are used to assess the quality of the executed policy. They may be used to transform...
复制链接

扫一扫

专栏目录