If major in ML, you will see the EM algorithm which divide into two steps
- E step (Exception)
- M step (Maximum)
You will be confused only through the step of the algorithm. Let’s explore the inner theory.
Goal
The first thing is to get our aim. eg, we have input and output, we should make a model to fit the data. Our aim is to get the parameter of the model.
we denote θ \theta θ is the parameter, X X X is the input, Y Y Y is the output, Z Z Z is the latent variable.
So we use the condition model to get the
P
(
Y
∣
θ
)
P(Y| \theta)
P(Y∣θ), we will use maximum likelihood to solve it.
L
(
θ
)
=
l
o
g
P
(
Y
∣
θ
)
L(\theta) = logP(Y| \theta)
L(θ)=logP(Y∣θ)
If there is no latent variable in the model, we can use maximum likelihood directly.
eg: throw a coin for five times, get three up and two down, to calculate the probability of up?
We will use the maximum likelihood to deal. First define the probability of up is
θ
\theta
θ and the output(result) is
Y
Y
Y. So the likelihood function is
L
(
θ
)
=
P
(
Y
∣
θ
)
L(\theta) = P(Y|\theta)
L(θ)=P(Y∣θ)
We can also get
P
(
Y
∣
θ
)
P(Y|\theta)
P(Y∣θ)
P
(
y
i
∣
θ
)
=
θ
y
i
+
(
1
−
θ
)
1
−
y
i
P(y_i|\theta) =\theta ^{y_i}+(1-\theta)^{1-y_i}
P(yi∣θ)=θyi+(1−θ)1−yi
To the end,
L
(
θ
)
=
∏
i
=
1
n
(
θ
y
i
+
(
1
−
θ
)
1
−
y
i
)
L(\theta) = \prod_{i=1}^{n}(\theta ^{y_i}+(1-\theta)^{1-y_i})
L(θ)=i=1∏n(θyi+(1−θ)1−yi)
call by value, and m a x ( L ( θ ) ) max(L(\theta)) max(L(θ)) by making the derivative equal to zero, we get θ = 3 5 \theta = \frac{3}{5} θ=53
BUT in EM algorithm, there is a latent variable in model.
Such as
L
(
θ
)
=
P
(
Y
,
Z
∣
θ
)
=
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
\begin{aligned} L(\theta)& = P(Y,Z|\theta) \\ & = P(Y|Z,\theta)P(Z|\theta) \end{aligned}
L(θ)=P(Y,Z∣θ)=P(Y∣Z,θ)P(Z∣θ)
Then
P
(
Y
∣
θ
)
=
∏
i
=
1
n
[
π
p
y
i
(
1
−
p
)
1
−
y
i
+
(
1
−
π
)
q
y
i
(
1
−
q
)
1
−
y
i
]
P(Y| \theta) =\prod_{i=1}^{n}[\pi p^{y_i}(1-p)^{1-y_i}+(1-\pi) q^{y_i}(1-q)^{1-y_i}]
P(Y∣θ)=i=1∏n[πpyi(1−p)1−yi+(1−π)qyi(1−q)1−yi]
Our aim is to
θ
^
=
arg
max
θ
l
o
g
P
(
Y
∣
θ
)
\hat \theta = \mathop{\arg \max_{\theta}}logP(Y|\theta)
θ^=argθmaxlogP(Y∣θ)
Proof
We have already get
L
(
θ
)
=
l
o
g
P
(
Y
,
Z
∣
θ
)
=
l
o
g
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
L(\theta) = logP(Y,Z|\theta) = logP(Y|Z,\theta)P(Z|\theta)
L(θ)=logP(Y,Z∣θ)=logP(Y∣Z,θ)P(Z∣θ)
In order to m a x ( L ( θ ) ) max(L(\theta)) max(L(θ)), we max it step by step. So we hope through each step, we can always increase the L ( θ ) L(\theta) L(θ)
Jensen Inequity
Let f be a convex function, and let X be a random variable.Then:
E
(
f
(
X
)
)
≥
f
(
E
(
X
)
)
E(f(X))\ge f(E(X))
E(f(X))≥f(E(X))
[1] : 《统计学习方法》 李航
[2] : cs229 Andrew Ng http://cs229.stanford.edu/notes/cs229-notes8.pdf