Thanks Jan Peters et al for their great work of A Survey on Policy Search for Robotics.
Policy representation may be categorized into time-independent representation π(x) and time-dependent representation π(x,t) . Since time-dependent representations can use different policies for different time steps, they allow for a simpler structure of the individual policies can be used.
In the following content, we will describe all these representations in their deterministic formulation πθ(x,t) . In stochastic formulations, typically a zero-mean Gaussian noise vector ϵt is added to πθ(x,t) . In this case, the parameter vector θ typically includes the covariance matrix used for generating the noise ϵt .
Linear Polices
Linear policy π :
Radial Basis Functions Networks
An RBF policy πθ is given as
Dynamic Movement Primitives
DMPs are most widely used time-dependent policy representation in robotics. The key principle is to use a linear spring-damper system which is modulated by a nonlinear forcing function :
One key innovation of the DMP approach is the use of a phase variable
For each degree of freedom, an individual spring-damper system and forcing function is used.
A policy
πθ(xt,t)
that is specified by a DMP, directly controls the acceleration of the joint, is given by:
Miscellaneous Representations
There exist other representations such as central pattern generators for robot walking and feed-forward neural networks used in simulation.