Typical Policy Representation in Policy Search Methods

最新推荐文章于 2020-11-22 15:36:12 发布

止于至玄

最新推荐文章于 2020-11-22 15:36:12 发布

阅读量291

点赞数

分类专栏： Reinforcement Learning 文章标签：强化学习

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

Thanks Jan Peters et al for their great work of A Survey on Policy Search for Robotics.

Policy representation may be categorized into time-independent representation $\pi(x)$ and time-dependent representation $\pi(x,t)$ . Since time-dependent representations can use different policies for different time steps, they allow for a simpler structure of the individual policies can be used.

In the following content, we will describe all these representations in their deterministic formulation $\pi_{\theta}(x,t)$ . In stochastic formulations, typically a zero-mean Gaussian noise vector $\epsilon_{t}$ is added to $\pi_{\theta}(x,t)$ . In this case, the parameter vector $\theta$ typically includes the covariance matrix used for generating the noise $\epsilon_{t}$ .

Linear Polices

Linear policy $\pi$ :

π θ (x) = θ T ϕ (x)

$\pi_{\theta}(x)=\theta^{T}\phi(x)$ where

ϕ $\phi$ is a basis function vector. Linear polices are always limited to problems where appropriate basis functions are known.

Radial Basis Functions Networks

An RBF policy $\pi_{\theta}$ is given as

π θ (x) = w T ϕ (x), ϕ i (x) = exp (- 1 2 (x - μ i) T D i (x - μ i))

$\pi_{\theta}(x)=w^{T}\phi(x), \quad \phi_{i}(x)=\exp\left(-\frac{1}{2}(x-\mu_{i})^{T}D_{i}(x-\mu_{i})\right)$ where

Di=diag(di) $D_{i}=\mathrm{diag}(d_{i})$ . The parameters

β={μi,di}i=1,…,n $\beta=\{\mu_{i},d_{i}\}_{i=1,\dots,n}$ of the basis functions are now free parameters to be learned. Hence

θ={w,β} $\theta=\{w,\beta\}$ .

Dynamic Movement Primitives

DMPs are most widely used time-dependent policy representation in robotics. The key principle is to use a linear spring-damper system which is modulated by a nonlinear forcing function :

y ¨ t = τ 2 α y (β y (g - y t) - y ˙ t) + τ 2 f t

$\ddot{y}_{t}=\tau^{2}\alpha_{y}(\beta_{y}(g-y_{t})-\dot{y}_{t})+\tau^{2}f_{t}$ where the variable

yt $y_{t}$ specifies the desired joint position.

τ $\tau$ is the time-scaling coefficient, the coefficients

αy $\alpha_{y}$ and

βy $\beta_{y}$ define the spring and damping constants and the goal parameter

g $g$ is the unique point attractor of the system. The forcing function

ft $f_{t}$ changes the goal attractor

g $g$ .

One key innovation of the DMP approach is the use of a phase variable $z_{t}$ to scale the execution speed of the movement.

z ˙ = - τ α z z, z (0) = 1

$\dot{z}=-\tau\alpha_{z}z,\quad z(0)=1$
For each degree of freedom, an individual spring-damper system and forcing function is used.

f (z) = \sum K i = 1 ϕ i ( z ) w i \sum K i = 1 ϕ i ( z ) z, ϕ i (z) = exp (- 1 2 σ 2 (z - c i) 2)

$f(z)=\frac{\sum_{i=1}^{K}\phi_{i}(z)w_{i}}{\sum_{i=1}^{K}\phi_{i}(z)}z,\quad \phi_{i}(z)=\exp\left(-\frac{1}{2\sigma^{2}}(z-c_{i})^{2}\right)$ The parameters

wi $w_{i}$ are denoted as shape-parameters of the DMP as they modulate the acceleration profile and, hence, indirectly specify the shape of the movement. The nonlinear dynamic system is globally stable. We can think of the fact that the goal paramter

g $g$ specifies the final position while the shape parameters

wi $w_{i}$ specify how to reach the final position.

A policy $\pi_{\theta}(x_{t},t)$ that is specified by a DMP, directly controls the acceleration of the joint, is given by:

π θ (x t, t) = τ 2 α y (β y (g - y t) - y ˙) + τ 2 f (z t)

$\pi_{\theta}(x_{t},t)=\tau^{2}\alpha_{y}(\beta_{y}(g-y_{t})-\dot{y})+\tau^{2}f(z_{t})$ Note that the DMP policy is linear in the shape parameters

w $w$ and the goal attractor

g $g$ , but nonlinear in the time-scaling constant

τ $\tau$ . Then

θ={w,g,τ} $\theta=\{w,g,\tau\}$ .

Miscellaneous Representations

There exist other representations such as central pattern generators for robot walking and feed-forward neural networks used in simulation.

止于至玄

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Typical Policy Representation in Policy Search Methods

Thanks Jan Peters et al for their great work of A Survey on Policy Search for Robotics. Policy representation may be categorized into time-independent representation π(x)\pi(x) and time-dependent...
复制链接

扫一扫