Typical Policy Representation in Policy Search Methods

Thanks Jan Peters et al for their great work of A Survey on Policy Search for Robotics.

Policy representation may be categorized into time-independent representation π(x) and time-dependent representation π(x,t) . Since time-dependent representations can use different policies for different time steps, they allow for a simpler structure of the individual policies can be used.

In the following content, we will describe all these representations in their deterministic formulation πθ(x,t) . In stochastic formulations, typically a zero-mean Gaussian noise vector ϵt is added to πθ(x,t) . In this case, the parameter vector θ typically includes the covariance matrix used for generating the noise ϵt .

Linear Polices

Linear policy π :

πθ(x)=θTϕ(x)
where ϕ is a basis function vector. Linear polices are always limited to problems where appropriate basis functions are known.

Radial Basis Functions Networks

An RBF policy πθ is given as

πθ(x)=wTϕ(x),ϕi(x)=exp(12(xμi)TDi(xμi))
where Di=diag(di) . The parameters β={μi,di}i=1,,n of the basis functions are now free parameters to be learned. Hence θ={w,β} .

Dynamic Movement Primitives

DMPs are most widely used time-dependent policy representation in robotics. The key principle is to use a linear spring-damper system which is modulated by a nonlinear forcing function :

y¨t=τ2αy(βy(gyt)y˙t)+τ2ft
where the variable yt specifies the desired joint position. τ is the time-scaling coefficient, the coefficients αy and βy define the spring and damping constants and the goal parameter g is the unique point attractor of the system. The forcing function ft changes the goal attractor g .

One key innovation of the DMP approach is the use of a phase variable zt to scale the execution speed of the movement.

z˙=ταzz,z(0)=1

For each degree of freedom, an individual spring-damper system and forcing function is used.
f(z)=Ki=1ϕi(z)wiKi=1ϕi(z)z,ϕi(z)=exp(12σ2(zci)2)
The parameters wi are denoted as shape-parameters of the DMP as they modulate the acceleration profile and, hence, indirectly specify the shape of the movement. The nonlinear dynamic system is globally stable. We can think of the fact that the goal paramter g specifies the final position while the shape parameters wi specify how to reach the final position.

A policy πθ(xt,t) that is specified by a DMP, directly controls the acceleration of the joint, is given by:

πθ(xt,t)=τ2αy(βy(gyt)y˙)+τ2f(zt)
Note that the DMP policy is linear in the shape parameters w and the goal attractor g, but nonlinear in the time-scaling constant τ . Then θ={w,g,τ} .

Miscellaneous Representations

There exist other representations such as central pattern generators for robot walking and feed-forward neural networks used in simulation.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值