Typical Exploration Strategies in Model-free Policy Search

Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.

The exploration strategy is used to generate new trajectory samples τ[i] . All exploration strategies in model-free policy search are local and use stochastic policies to implement exploration. Typically, Gaussian polices are used to model such stochastic policies.

Many model-free policy search approaches update the exploration distribution and , hence, the covariance of the Gaussian policy. Typically, a large exploration rate is used in the beginning of learning which is then gradually decreased to fine tune the policy parameters.

Action Space vs Parameter Space

In action space. we can simply add an exploration noise ϵu to the executed actions, i.e.

ut=μ(x,t)+ϵu
The exploratino nose is always sampled independently for each time step from a zero-mean Gaussian distribution with covariance Σu . The policy π is given as :
πθ(u|x)=N(u|μu(x,t),Σu)
Applications of exploration in action space can be found in REINFORCE algorithm or eNAC algorithm.

Exploration in parameter space perturb the paramter vector θ . In contrast to exploration in action space, that in paramter space can use more structured nose and adapt the variance of the exploration noise in dependence of the state features ϕt(x) .

Many approaches can be formulized with the concept of an upper-level policy πw(θ) which selects the parameters of the actual control policy πθ(u|x) , i.e. the lower-level policy. The upper-level policy is typically modeled as a Gaussian distribution πw(θ)=N(θ|μθ,Σθ) . The lower-level control policy u=πθ(x,t) is typically modeled as deterministic policy since exploration only takes place in the parameter space.

Now we use the paramter vector w defining a distribution over θ. Then we can use this distribution to directly explore in parameter space. The optimization problem for learning upper-level polices goes as maximizing:

Jw=θπw(θ)τp(τ|θ)R(τ)dτdθ=θπw(θ)R(θ)dθ
For a linear control policy u=ϕt(x)Tθ , we can rewrite the deterministic lower-level policy in combination with the upper-level policy as a single, stochastic policy:
πθ(ut|xt,t)=N(ut|ϕt(x)Tμθ,ϕt(x)TΣθϕt(x))

Episode-based vs Step-based

Step-based exploration use different exploration noise at each time step and can either in action space or in paramter space. Step-based exploration can be problematic as it might produce action sequences which are not reproducible by noise free control law.

Episode-base exploration use exploration noise only at the beginning of the episode, which leads to an exploration in parameter space. Episode-based exploration might produce more reliable policy updates.

Uncorrelated vs Correlated

As most policies are represented as Gaussian distributions, uncorrelated exploration noise is obtained by using a diagonal covariance matrix. It is also usable to achieve correlated exploration by maintaining a full representation of the covariance matrix.

Exploration in action space typically use a diagnoal covariance matrix. In paramter space, many approaches can be used to update the full covariance matrix of the Gaussian policy. Using the full covariance matrx often resultes in a considerably increased learning speed.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值