Typical Exploration Strategies in Model-free Policy Search

最新推荐文章于 2023-11-11 21:42:25 发布

止于至玄

最新推荐文章于 2023-11-11 21:42:25 发布

阅读量306

点赞数

分类专栏： Reinforcement Learning 文章标签：强化学习

本文链接：https://blog.csdn.net/philthinker/article/details/77367576

版权

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.

The exploration strategy is used to generate new trajectory samples $\tau^{[i]}$ . All exploration strategies in model-free policy search are local and use stochastic policies to implement exploration. Typically, Gaussian polices are used to model such stochastic policies.

Many model-free policy search approaches update the exploration distribution and , hence, the covariance of the Gaussian policy. Typically, a large exploration rate is used in the beginning of learning which is then gradually decreased to fine tune the policy parameters.

Action Space vs Parameter Space

In action space. we can simply add an exploration noise $\epsilon_{u}$ to the executed actions, i.e.

u t = μ (x, t) + ϵ u

$u_{t}=\mu(x,t)+\epsilon_{u}$ The exploratino nose is always sampled independently for each time step from a zero-mean Gaussian distribution with covariance

Σu $\Sigma_{u}$ . The policy

π $\pi$ is given as :

π θ (u | x) = N (u | μ u (x, t), Σ u)

$\pi_{\theta}(u|x)=\mathcal{N}(u|\mu_{u}(x,t),\Sigma_{u})$ Applications of exploration in action space can be found in REINFORCE algorithm or eNAC algorithm.

Exploration in parameter space perturb the paramter vector $\theta$ . In contrast to exploration in action space, that in paramter space can use more structured nose and adapt the variance of the exploration noise in dependence of the state features $\phi_{t}(x)$ .

Many approaches can be formulized with the concept of an upper-level policy $\pi_{w}(\theta)$ which selects the parameters of the actual control policy $\pi_{\theta}(u|x)$ , i.e. the lower-level policy. The upper-level policy is typically modeled as a Gaussian distribution $\pi_{w}(\theta)=\mathcal{N}(\theta|\mu_{\theta},\Sigma_{\theta})$ . The lower-level control policy $u=\pi_{\theta}(x,t)$ is typically modeled as deterministic policy since exploration only takes place in the parameter space.

Now we use the paramter vector $w$ defining a distribution over $\theta$ . Then we can use this distribution to directly explore in parameter space. The optimization problem for learning upper-level polices goes as maximizing:

J w = \int θ π w (θ) \int τ p (τ | θ) R (τ) d τ d θ = \int θ π w (θ) R (θ) d θ

$J_{w}=\int_{\theta}\pi_{w}(\theta)\int_{\tau}p(\tau|\theta)R(\tau)d\tau d\theta = \int_{\theta}\pi_{w}(\theta)R(\theta)d\theta$ For a linear control policy

u=ϕt(x)Tθ $u=\phi_{t}(x)^{T}\theta$ , we can rewrite the deterministic lower-level policy in combination with the upper-level policy as a single, stochastic policy:

π θ (u t | x t, t) = N (u t | ϕ t (x) T μ θ, ϕ t (x) T Σ θ ϕ t (x))

$\pi_{\theta}(u_{t}|x_{t},t)=\mathcal{N}(u_{t}|\phi_{t}(x)^{T}\mu_{\theta},\phi_{t}(x)^{T}\Sigma_{\theta}\phi_{t}(x))$

Episode-based vs Step-based

Step-based exploration use different exploration noise at each time step and can either in action space or in paramter space. Step-based exploration can be problematic as it might produce action sequences which are not reproducible by noise free control law.

Episode-base exploration use exploration noise only at the beginning of the episode, which leads to an exploration in parameter space. Episode-based exploration might produce more reliable policy updates.

Uncorrelated vs Correlated

As most policies are represented as Gaussian distributions, uncorrelated exploration noise is obtained by using a diagonal covariance matrix. It is also usable to achieve correlated exploration by maintaining a full representation of the covariance matrix.

Exploration in action space typically use a diagnoal covariance matrix. In paramter space, many approaches can be used to update the full covariance matrix of the Gaussian policy. Using the full covariance matrx often resultes in a considerably increased learning speed.

止于至玄

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Typical Exploration Strategies in Model-free Policy Search

The exploration strategy is used to generate new trajectory samples τ[i]. All exploration strategies in model-free policy search are local and use stochastic policies to implement exploration....
复制链接

扫一扫