Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.
Policy evaluation strategies are used to assess the quality of the executed policy. They may be used to transform sampled trjectories τ[i] into a data set D that contains samples of either the state-action pairs x[i]t,u[i]t or the parameter vectors θ[i] . The data set D is subsequently processed by the policy update strategies to determine the new policy.
Step-based Policy Evaluation
In step-based policy evaluation, we decompose the sampled trajectories
τ[i]
into its single
x[i]t,u[i]t
, and estimate the quality of the single actions. The quality of an action is given by:
Algorithms based on step-based policy evaluation use a data set Dstep={x[i],u[i],Q[i]} to determine the policy upadte step.
Episode-based Policy Evaluation
Episode-based policy evaluation strategies directly use the expected return
R[i]=R(θ[i])
to evaluate the quality of a parameter vector
θ[i]
:
Episode-based policy evaluation produces a data set Dep={θ[i],R[i]} and is typically connected with parameter-based exploration strategies, and, hence, such algorithms can be formalized by the problem of learning an upper-level policy πw(θ) .
An underlying problem of episode-based evaluation is the variance of the R[i] estimates. For a high number of time steps and highly stochastic systems, step-based algorithms should be preferred.
Generalization to Multiple Tasks
For generalizing the learned polices to multiple tasks, so far, mainly episode-based policy evaluation strategies have been which learn an upper-level policy. Define a context vector
s
which describes all variables which do not change during the execution of the task but might change from task to task. The upper-level policy is extended to generalize the lower-level policy
The problem of learning