Expectation-Maximization Algorithm
Policy gradient methods require the user to specify the learning rate which can be problematic and often results in an unstable learning process or slow convergence. By formualting policy search as an inference problem with latent variables and using the EM algorithm to infer a new policy, this problem can be avoided since no learning rate is required.
The standard EM algorithm, which is well-known for determining the maximum likelihood solution of a probabilitistic latent variable model, takes the parameter update as a weighted maximum likelihood estimate which has a closed form solution for most of the used polices.
Let’s assume that :
An equation is said to be a closed-form solution if it solves a given problem in terms of functions and mathematical operations from a given generally-accepted set. For example, an infinite sum would generally not be considered closed-form. However, the choice of what to call closed-form and what not is rather arbitrary since a new “closed-form” function could simply be defined in terms of the infinite sum.
EM(Expectation-Maximization) is a powerful method to estimate parameterized latent variables. The basic idea behind is that if parameter θ θ is known then we can estimate the optimal latent variable Z Z in view of (E-Step); If latent variable Z Z is known, we can estimate by maximum likelihood estimation. EM method can be seen as a kind of coordinate descent method to maximize the lower-bound of the log likelihood.
The iterative procedure for estimating the maximum log-likelihood consists of two main segments: Expectation Steps and Maximization Steps as mentioned above. Assume that we begin at the θ0 θ 0 . Then we execute the following iterative steps:
- Based on θt θ t estimating the expectation of the latent variable Zt Z t .
- Based on Y Y and estimating the parameter θt+1 θ t + 1 by maximum likelihood estimation.
In general, we do not feel like the expectation of
Z
Z
but the distribution of , i.e.
pθt(Z|Y)
p
θ
t
(
Z
|
Y
)
. To be specific, let’s introduce an auxiliary distribution
q(Z)
q
(
Z
)
, which is variational, to decompose the marginal log-likelihood by using the identity
pθ(Y)=pθ(Y,Z)/pθ(Z|Y)
p
θ
(
Y
)
=
p
θ
(
Y
,
Z
)
/
p
θ
(
Z
|
Y
)
:
E-Step
In E-step we update the variational distribution
q(Z)
q
(
Z
)
by minimizing the KL divergence
KL(q(Z)∥pθ(Z|Y))
K
L
(
q
(
Z
)
‖
p
θ
(
Z
|
Y
)
)
, i.e. setting
q(Z)=pθ(Z|Y)
q
(
Z
)
=
p
θ
(
Z
|
Y
)
. Note that the value of the log-likelihood
logpθ(Y)
log
p
θ
(
Y
)
has nothing to do with the variational distribution
q(Z)
q
(
Z
)
. In summary, E-step :
M-Step
In M-step we optimize the lower bound w.r.t.
θ
θ
, i.e.
Reformulate Policy Search as an Inference Problem
Let’s assume that :
We would like to find a parameter vector
θ
θ
that maximizes the probability of the reward event, i.e.
E-Step
M-Step
EM-based Policy Search Algorithms
Monte-Carlo EM-based Policy Search
MC-EM-algorithm uses a sample-based approximation for the variational distribution
q
q
, i.e. in the E-step, MC-EM minimizes the KL divergence by using samples
Zj∼pθ(Z|Y)
Z
j
∼
p
θ
(
Z
|
Y
)
. Subsequently, these samples
Zj
Z
j
are used to estimate the expectation of the complete data log-liklihood:
There are Episode-based EM-algorithms such as Reward-Weighted Regression(RWR) and Cost-Regularized Kernel Regression(CrKR), and Step-based EM-algorithms such as Episodic Reward-Weighted Regression(eRWR) and Policy Learning by Weighting Exploration with Returns(PoWER).
Variational Inference-based Methods
The MC-EM approach uses a weighted maximum likelihood estimate to obtain the new parameters θ θ of the policy. It averages over several modes of the reward function. Such a behavior might result in slow convergence to good policies as the average of several modes might be in an area with low reward.
The maximization used for the MC-EM approach is equivalent to minimizing:
Alternatively, we can use the Information projection argminθKL(pθ(τ)∥p(R|τ)pθ′(τ)) arg min θ K L ( p θ ( τ ) ‖ p ( R | τ ) p θ ′ ( τ ) ) to update the policy. This projection forces the new trajectory distribution pθ(τ) p θ ( τ ) to be zero everywhere where the reward-weighted trajectory distribution is zero.
- Thanks J. Peters et al for their great work of A Survey on Policy Search for Robotics .
- 感谢周志华——《机器学习》清华大学出版社