Reference:
https://mbernste.github.io/posts/elbo/
https://mbernste.github.io/posts/variational_inference/
https://mbernste.github.io/posts/em/
https://mbernste.github.io/posts/gmm_em/
Theodoridis S. Machine learning: a Bayesian and optimization perspective[M]. Academic press, 2015.
Content
Latent variable model
- We posit that our observed data x x x is a realization from some random variable X X X.
- We posit the existence of another random variable Z Z Z where X X X and Z Z Z are distributed according to a joint distribution p ( X , Z ; θ ) p(X,Z;\theta) p(X,Z;θ) where θ \theta θ parameterizes the distribution.
- Our data is only a realization of X X X, not Z Z Z, and therefore Z Z Z remains unobserved (i.e. latent).
There are two predominant tasks that we may be interested in accomplishing:
- Given some fixed value for θ \theta θ, compute the posterior distribution p ( Z ∣ X ; θ ) p(Z|X;\theta) p(Z∣X;θ) [Can be solved by variational inference]
- Given that θ \theta θ is unknown, find the maximum likelihood estimate of θ \theta θ [Can be solved by EM]
Both variational inference and EM rely on the ELBO.
The Evidence Lower Bound (ELBO)
What is the ELBO?
To understand the evidence lower bound, we must first understand what we mean by “evidence”: it is just a name given to the likelihood function evaluated at a fixed
θ
\theta
θ
evidence
:
=
log
p
(
x
;
θ
)
(ELBO.1)
\text{evidence}:=\log p(x;\theta) \tag{ELBO.1}
evidence:=logp(x;θ)(ELBO.1)
Why is this quantity called the “evidence”?
Intuitively, if we have chosen the right model p p p and θ \theta θ, then we would expect that the marginal probability of our observed data x x x, would be high. Thus, a higher value of log p ( x ; θ ) \logp(x;θ) logp(x;θ) indicates, in some sense, that we may be on the right track with the model p p p and parameters θ \theta θ that we have chosen. That is, this quantity is “evidence” that we have chosen the right model for the data.
If we happen to know (or posit) that
Z
Z
Z follows some distribution denoted by
q
q
q, s.t.
p
(
x
,
z
;
θ
)
:
=
p
(
x
∣
z
;
θ
)
q
(
z
)
(ELBO.2)
p(x,z;\theta):=p(x|z;\theta)q(z) \tag{ELBO.2}
p(x,z;θ):=p(x∣z;θ)q(z)(ELBO.2)
Then the evidence lower bound is just a lower bound on the evidence that makes use of the known
q
q
q. Specifically,
log
p
(
x
;
θ
)
≥
E
Z
∼
q
[
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
]
(ELBO.3)
\log p(x;\theta)\ge E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{ELBO.3}
logp(x;θ)≥EZ∼q[logq(Z)p(x,Z;θ)](ELBO.3)
where the ELBO is simply the right-hand side of the above equation:
E
L
B
O
:
=
E
Z
∼
q
[
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
]
(ELBO.4)
ELBO:=E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{ELBO.4}
ELBO:=EZ∼q[logq(Z)p(x,Z;θ)](ELBO.4)
Derivation
Jensen’s Inequality: if
X
X
X is a random variable and
φ
\varphi
φ is a convex (concave) function, then
φ
(
E
[
X
]
)
≤
(
≥
)
E
[
φ
(
X
)
]
\varphi(E[X])\le (\ge) E[\varphi(X)]
φ(E[X])≤(≥)E[φ(X)]
Since
log
(
⋅
)
\log (\cdot)
log(⋅) is a concave function,
E
Z
∼
q
[
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
]
≤
log
[
E
Z
∼
q
(
p
(
x
,
Z
;
θ
)
q
(
Z
)
)
]
=
log
[
∫
q
(
Z
)
p
(
x
,
Z
;
θ
)
q
(
Z
)
d
z
]
=
log
p
(
x
;
θ
)
\begin{aligned} E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right]&\le \log \left[E_{Z\sim q}\left( \frac{p(x,Z;\theta)}{q(Z)}\right) \right]\\ &=\log \left[\int q(Z)\frac{p(x,Z;\theta)}{q(Z)}dz \right]\\ &=\log p(x;\theta) \end{aligned}
EZ∼q[logq(Z)p(x,Z;θ)]≤log[EZ∼q(q(Z)p(x,Z;θ))]=log[∫q(Z)q(Z)p(x,Z;θ)dz]=logp(x;θ)
The gap between the evidence and the ELBO
It turns out that the gap between the evidence and the ELBO is precisely the Kullback Leibler divergence between
q
(
Z
)
q(Z)
q(Z) and
p
(
Z
∣
x
;
θ
)
p(Z|x;θ)
p(Z∣x;θ).
evidence
−
ELBO
:
=
log
p
(
x
;
θ
)
−
E
Z
∼
q
[
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
]
=
K
L
(
q
(
Z
)
∥
p
(
Z
∣
x
;
θ
)
)
(ELBO.5)
\text{evidence}-\text{ELBO}:=\log p(x;\theta)-E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] =KL(q(Z)\|p(Z|x;\theta)) \tag{ELBO.5}
evidence−ELBO:=logp(x;θ)−EZ∼q[logq(Z)p(x,Z;θ)]=KL(q(Z)∥p(Z∣x;θ))(ELBO.5)
This fact forms the basis of the [variational inference algorithm] for approximate Bayesian inference.
Derivation
log
p
(
x
;
θ
)
−
E
Z
∼
q
[
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
]
=
∫
q
(
Z
)
log
p
(
x
;
θ
)
d
z
−
∫
q
(
Z
)
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
d
z
=
∫
q
(
Z
)
log
p
(
x
;
θ
)
q
(
Z
)
p
(
x
,
Z
;
θ
)
d
z
=
∫
q
(
Z
)
log
q
(
Z
)
p
(
Z
∣
x
;
θ
)
d
z
=
K
L
(
q
(
Z
)
∥
p
(
Z
∣
x
;
θ
)
)
\begin{aligned} \log p(x;\theta)-E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] &= \int q(Z)\log p(x;\theta)dz-\int q(Z)\log \frac{p(x,Z;\theta)}{q(Z)}dz\\ &= \int q(Z) \log \frac{p(x;\theta) q(Z)}{p(x,Z;\theta)}dz\\ &= \int q(Z) \log \frac{q(Z)}{p(Z|x;\theta)}dz\\ &= KL(q(Z)\|p(Z|x;\theta)) \end{aligned}
logp(x;θ)−EZ∼q[logq(Z)p(x,Z;θ)]=∫q(Z)logp(x;θ)dz−∫q(Z)logq(Z)p(x,Z;θ)dz=∫q(Z)logp(x,Z;θ)p(x;θ)q(Z)dz=∫q(Z)logp(Z∣x;θ)q(Z)dz=KL(q(Z)∥p(Z∣x;θ))
Variational Inference
Why variational inference?
Variational inference is a paradigm for estimating a posterior distribution when computing it explicitly is intractable.
Assume that we have a model that involves hidden random variables Z Z Z, observed random variables X X X, and some posited probabilistic model over the hidden and the observed random variables P ( X , Z ) P(X,Z) P(X,Z). The goal is to compute the posterior distribution P ( Z ∣ X ) P(Z|X) P(Z∣X).
Ideally, we would do so by using Bayes theorem:
p
(
z
∣
x
)
=
p
(
x
∣
z
)
p
(
z
)
p
(
x
)
p(z|x)=\frac{p(x|z)p(z)}{p(x)}
p(z∣x)=p(x)p(x∣z)p(z)
In practice, it is often difficult to compute
p
(
z
∣
x
)
p(z|x)
p(z∣x) via Bayes theorem because the denominator
p
(
x
)
p(x)
p(x) does not have a closed form. Usually, the denominator
p
(
x
)
p(x)
p(x) can be only be expressed as an integral that marginalizes over
z
z
z:
p
(
x
)
=
∫
p
(
x
,
z
)
d
z
p(x)=\int p(x,z) dz
p(x)=∫p(x,z)dz. In such scenarios, we’re often forced to approximate
p
(
z
∣
x
)
p(z|x)
p(z∣x) rather than compute it directly. Variational inference is one such approximation technique.
Details
Instead of computing
p
(
z
∣
x
)
p(z|x)
p(z∣x) exactly via Bayes theorem, variational inference attempts to find another distribution
q
(
z
)
q(z)
q(z) that is ‘close’ to
p
(
z
∣
x
)
p(z|x)
p(z∣x), where the ‘closeness’ is measured by the KL-divergence.
K
L
(
q
(
Z
)
∥
p
(
Z
∣
x
)
)
=
∫
q
(
Z
)
log
q
(
Z
)
p
(
Z
∣
x
)
d
z
=
E
Z
∼
q
[
log
q
(
Z
)
p
(
Z
∣
x
)
]
(VI.1)
KL(q(Z)\|p(Z|x))=\int q(Z) \log \frac{q(Z)}{p(Z|x)}dz=E_{Z\sim q}\left[ \log \frac{q(Z)}{p(Z|x)}\right] \tag{VI.1}
KL(q(Z)∥p(Z∣x))=∫q(Z)logp(Z∣x)q(Z)dz=EZ∼q[logp(Z∣x)q(Z)](VI.1)
We restrict the search for
q
(
z
)
q(z)
q(z) to a family of surrogate distributions over
Z
Z
Z, called the variational distribution family, denoted by the set of distributions
Q
\mathcal Q
Q. Each member of
Q
\mathcal Q
Q is characterized by the values of a set of variational parameters
ϕ
\phi
ϕ.
Then we can formulate the optimization problem as:
q
^
:
=
q
(
z
∣
ϕ
^
)
=
arg
min
q
K
L
(
q
(
Z
)
∥
p
(
Z
∣
x
)
)
(VI.2)
\hat q :=q(z|\hat \phi)=\arg\min_q KL(q(Z)\|p(Z|x)) \tag{VI.2}
q^:=q(z∣ϕ^)=argqminKL(q(Z)∥p(Z∣x))(VI.2)
From
(
E
L
B
O
.
5
)
(ELBO.5)
(ELBO.5), we know that minimizing the KL-divergence is equivalent to maximizing the ELBO:
q
^
:
=
q
(
z
∣
ϕ
^
)
=
arg
max
q
E
L
B
O
(
q
)
(VI.3)
\hat q :=q(z|\hat \phi)=\arg\max_q ELBO(q) \tag{VI.3}
q^:=q(z∣ϕ^)=argqmaxELBO(q)(VI.3)
where
E
L
B
O
(
q
)
:
=
E
Z
∼
q
[
log
p
(
x
,
Z
)
]
−
E
Z
∼
q
[
log
p
(
Z
)
]
ELBO(q):=E_{Z\sim q}[\log p(x,Z)]-E_{Z\sim q}[\log p(Z)]
ELBO(q):=EZ∼q[logp(x,Z)]−EZ∼q[logp(Z)].
Why is the method called ‘variational’ inference?
The term “variational” in “variational inference” comes from the mathematical area of the calculus of variations. The calculus of variations is all about optimization problems that optimize functions of functions (called functionals).
More specifically, let’s say we have some set of functions F \mathcal{F} F where each f ∈ F f \in \mathcal{F} f∈F maps items from some set A A A to some set B B B. That is,
f : A → B f: A \rightarrow B f:A→B
Let’s say we have some function g g g that maps functions in F \mathcal{F} F to real numbers R \mathbb{R} R. That is,
g : F → R g: \mathcal{F} \rightarrow \mathbb{R} g:F→R
Then, we may wish to solve an optimization problem of the form:
arg max f ∈ F g ( f ) \arg \max _{f \in \mathcal{F}} g(f) argf∈Fmaxg(f)
This is precisely the problem addressed in the calculus of variations. In the case of variational inference, the functional, g g g, that we are optimizing is the ELBO. The set of functions, F \mathcal{F} F, that we are searching over is the set of measurable functions in the variational family, Q \mathcal{Q} Q.
Expectation-Maximization (EM)
Expectation-maximization (EM) is a popular algorithm for performing maximum-likelihood estimation of the parameters in a [latent variable model](# Latent variable model).
Despite not observing
z
z
z, we wish to find the value for
θ
\theta
θ that makes
x
x
x most likely under our model. Since we have not observed
z
z
z, our likelihood function must marginalize over
z
z
z:
l
(
θ
)
:
=
log
p
(
x
;
θ
)
=
log
∫
p
(
x
,
z
;
θ
)
d
z
(EM.1)
l(\theta):=\log p(x;\theta)=\log \int p(x,z;\theta)dz \tag{EM.1}
l(θ):=logp(x;θ)=log∫p(x,z;θ)dz(EM.1)
In practice, maximizing this function over
θ
\theta
θ may be difficult to do analytically due to this integral. For such situations, the EM algorithm may provide a method for computing a local maximum of this function with respect to
θ
\theta
θ.
Description of EM
The EM algorithm alternates between two steps: an expectation-step (E-step) and a maximization-step (M-step). Throughout execution, the algorithm maintains an estimate of the parameters θ \theta θ that is updated on each iteration. As the algorithm progresses, this estimate of θ \theta θ converges to a local maximum of l ( θ ) l(\theta) l(θ).
Let t t t denote a given step of the algorithm. On the t t t-th step, we’ll denote θ t \theta_t θt as the t t t-th estimate of θ \theta θ.
-
E-Step
-
Compute the conditional probability p ( z ∣ x ; θ t ) p(z|x;\theta_t) p(z∣x;θt)
-
Then formulate the Q-function:
Q t ( θ ) : = E Z ∣ x , θ t [ log p ( x , z ; θ ) ] = ∫ p ( z ∣ x ; θ t ) log p ( x , z ; θ ) d z (EM.2) Q_t(\theta):=E_{Z|x,\theta_t}[\log p(x,z;\theta)]=\int p(z|x;\theta_t)\log p(x,z;\theta)dz \tag{EM.2} Qt(θ):=EZ∣x,θt[logp(x,z;θ)]=∫p(z∣x;θt)logp(x,z;θ)dz(EM.2)
-
-
M-Step
Assign θ t + 1 \theta_{t+1} θt+1 to be the value that maximizes the Q-function that was formulated on the previous E-step:
θ t + 1 : = arg max θ Q t ( θ ) (EM.3) \theta_{t+1}:=\arg \max _\theta Q_t(\theta) \tag{EM.3} θt+1:=argθmaxQt(θ)(EM.3)
Note:
The core of formulating an EM algorithm for a particular model is calculating p ( z ∣ x ; θ t ) p(z|x;\theta_t) p(z∣x;θt). If the model admits an easy calculation of this function, then the EM algorithm may be a good a choice.
We repeat the E-Step and M-Step until the difference between θ t \theta_t θt and θ t + 1 \theta_{t+1} θt+1 is small, indicating convergence. We also note that θ 0 \theta_0 θ0 can be set to an arbitrary value – usually set randomly. Since EM finds a local maximum of l ( θ ) l(\theta) l(θ), it is often wise to try many different θ 0 \theta_0 θ0’s.
EM is coordinate ascent on the evidence lower bound
What is coordinate ascent?
Given a function f ( a , b ) f(a,b) f(a,b) that takes two arguments a a a and b b b, the idea of coordinate ascent is that we will iteratively fix a a a to some value a ^ \hat a a^ and then choose for b b b the value b ^ \hat b b^ that maximizes f ( a ^ , b ) f(\hat a,b) f(a^,b). Given b ^ \hat b b^, we then re-assign to a a a the value that maximizes f ( a , b ^ ) f(a,\hat b) f(a,b^). This procedure is repeated until convergence.
In our setting, we do not yet know which value to use for θ \theta θ, nor do we know the distribution q q q for Z Z Z. Thus, the ELBO becomes a function of both q q q and θ \theta θ:
F
(
q
,
θ
)
:
=
E
Z
∼
q
[
log
p
(
x
,
Z
;
θ
)
q
(
Z
)
]
(EM.4)
F(q,\theta):=E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{EM.4}
F(q,θ):=EZ∼q[logq(Z)p(x,Z;θ)](EM.4)
EM is just coordinate ascent on the ELBO. In the E-step, we fix
θ
\theta
θ and solve for
q
q
q. In the M-step, we fix
q
q
q and solve for
θ
\theta
θ.
-
Fix θ \theta θ to θ t \theta_t θt (the current value for θ \theta θ) and solve for q q q:
argmax q F ( q , θ t ) = argmax q E z ∼ q [ log p ( Z , x ; θ ) q ( Z ) ] = argmin q K L ( q ( z ) ∥ p ( z ∣ x ; θ t ) ) = p ( z ∣ x ; θ t ) \begin{aligned} \operatorname{argmax}_{q} F\left(q, \theta_{t}\right) &=\operatorname{argmax}_{q} E_{z \sim q}\left[\log \frac{p(Z, x ; \theta)}{q(Z)}\right] \\ &=\operatorname{argmin}_{q} \mathrm{KL}\left(q(z) \| p\left(z | x ; \theta_{t}\right)\right) \\ &=p\left(z | x ; \theta_{t}\right) \end{aligned} argmaxqF(q,θt)=argmaxqEz∼q[logq(Z)p(Z,x;θ)]=argminqKL(q(z)∥p(z∣x;θt))=p(z∣x;θt)
[[Make use of Eq. ( E L B O . 5 ) (ELBO.5) (ELBO.5)]](# The gap between the evidence and the ELBO)Notably, computing p ( z ∣ x ; θ t ) p\left(z | x ; \theta_{t}\right) p(z∣x;θt) is exactly what is needed to formulate the Q-function in the E-Step.
-
Next, if we hold q q q fixed to q t : = p ( z ∣ x ; θ t ) q_{t}:=p\left(z | x ; \theta_{t}\right) qt:=p(z∣x;θt), we see that maximizing F ( q t , θ ) F\left(q_{t}, \theta\right) F(qt,θ) with respect to θ \theta θ is equivalent to maximizing the Q-function:
argmax θ F ( q t , θ ) = argmax θ E z ∼ q [ log p ( Z , x ; θ ) p ( z ∣ x ; θ t ) ] = argmax θ E Z ∣ x , θ t [ log p ( x , z ; θ ) ] \begin{aligned} \operatorname{argmax}_{\theta} F\left(q_{t}, \theta\right) &=\operatorname{argmax}_{\theta} E_{z \sim q}\left[\log \frac{p(Z, x ; \theta)}{p\left(z | x ; \theta_{t}\right)}\right] \\ &=\operatorname{argmax}_{\theta} E_{Z | x, \theta_{t}}[\log p(x, z ; \theta)] \end{aligned} argmaxθF(qt,θ)=argmaxθEz∼q[logp(z∣x;θt)p(Z,x;θ)]=argmaxθEZ∣x,θt[logp(x,z;θ)]
The last line in the above derivation is simply maximizing the Q-function - exactly the M-Step.
Visualization
-
Coordinate ascent on the evidence lower bound
-
EM algorithm - overview
We can observe that EM is in line with a more general class of optimization algorithms, known as minorize-maximization (MM) methods. A surrogate function that minorizes the cost function (lower bound), which is easier to be maximized, is employed. Then maximizing this lower bound iteratively pushes the cost function to a local maximum.
Intuition behind the Q-function
-
A generalization of the likelihood function
In the case in which Z Z Z is a discrete random variable, the Q-function can be interpreted as a “generalized” version of the complete data likelihood function.
Let us say that both X X X and Z Z Z are observed where X = x X=x X=x and Z = z Z=z Z=z. Then, we could write the complete data likelihood as
log p ( x , z ; θ ) : = ∑ z ′ ∈ Z I ( z = z ′ ) log p ( x , z ′ ; θ ) \log p(x, z ; \theta):=\sum_{z^{\prime} \in \mathcal Z} \mathbb{I}\left(z=z^{\prime}\right) \log p\left(x, z^{\prime} ; \theta\right) logp(x,z;θ):=z′∈Z∑I(z=z′)logp(x,z′;θ)
where Z \mathcal Z Z is the support of Z Z Z's probability mass function and I ( z = z ′ ) \mathbb{I}\left(z=z^{\prime}\right) I(z=z′) is an indicator that equals one when z ′ = z z'=z z′=z and equals zero otherwise.Now recall the Q-function:
Q t ( θ ) : = ∑ z ′ p ( z ′ ∣ x ; θ t ) log p ( x , z ′ ; θ ) Q_t(\theta):=\sum_{z'} p(z'|x;\theta_t)\log p(x,z';\theta) Qt(θ):=z′∑p(z′∣x;θt)logp(x,z′;θ)
We note that the difference between these two functions is that the indicator variable in the complete data likelihood I ( z = z ′ ) \mathbb{I}\left(z=z^{\prime}\right) I(z=z′) is replaced by the probability p ( z ′ ∣ x ; θ t ) p(z'|x;\theta_t) p(z′∣x;θt). This leads us to viewing the Q-function as a sort of generalization of the complete data likelihood. That is, p ( z ′ ∣ x ; θ t ) p(z'|x;\theta_t) p(z′∣x;θt) acts as a weight for its corresponding term in the summation, and measures our current certainty that the hidden data is equal to z ′ z' z′. -
A likelihood function over a hypothetical dataset
Another way to view the Q-function is as a complete data likelihood function over a hypothetical dataset where this hypothetical dataset is generated from a distribution that depends on our current best guess of θ \theta θ (i.e. θ t \theta_t θt at the t t t-th iteration).
That is, let us assume that we generate a very large set of values for Z Z Z drawn i.i.d. from the distribution specified by p ( z ∣ x ; θ t ) p(z|x;\theta_t) p(z∣x;θt):
z 1 ′ , z 2 ′ , ⋯ , z n ′ ∼ p ( z ∣ x ; θ t ) z_1',z_2',\cdots,z_n' \sim p(z|x;\theta_t) z1′,z2′,⋯,zn′∼p(z∣x;θt)
For each z i ′ z_i' zi′, we create a new ‘sub’-dataset ( x , z i ′ ) (x,z_i') (x,zi′). Note that the observed data x x x is duplicated in each of these sub-datasets. We then merge all of the datasets ( x , z 1 ′ ) , ⋯ , ( x , z n ′ ) (x,z_1'),\cdots,(x,z_n') (x,z1′),⋯,(x,zn′) to form our new hypothetical dataset. Then, the likelihood over these data is
l ′ ( θ ) : = ∏ i = 1 n p ( x , z i ′ ; θ ) l'(\theta):=\prod_{i=1}^n p(x,z_i';\theta) l′(θ):=i=1∏np(x,zi′;θ)
Now we will show that maximizing the Q-function is equivalent to maximizing this likelihood function over the hypothetical data.Let Z ′ : = { z 1 ′ , z 2 ′ , ⋯ , z n ′ } Z':=\{z_1',z_2',\cdots,z_n'\} Z′:={z1′,z2′,⋯,zn′}. Now let c ( z ) c(z) c(z) be the cardinality of { z ′ ∈ Z ′ ∣ z ′ = z } \{z'\in Z'|z'=z\} {z′∈Z′∣z′=z}. That is, c ( z ) c(z) c(z) is the number of items in Z ′ Z' Z′ that equal z z z. As n n n approaches infinity then
c ( z ) ∑ z ∗ c ( z ∗ ) = p ( z ∣ x ; θ t ) \frac{c(z)}{\sum_{z^*}c(z^*)}=p(z|x;\theta_t) ∑z∗c(z∗)c(z)=p(z∣x;θt)
Finally we can formulate our maximum likelihood problem on this hypothetical data:
argmax θ l ′ ( θ ) : = argmax θ ∏ i = 1 n p ( x , z i ′ ; θ ) = argmax θ ∏ z ∈ Z p ( x , z ; θ ) c ( z ) = argmax θ ∑ z ∈ Z c ( z ) ∑ z ∗ c ( z ∗ ) log p ( x , z ; θ ) = argmax θ ∑ z ∈ Z p ( z ∣ x ; θ t ) log p ( x , z ; θ ) \begin{aligned} \operatorname{argmax}_{\theta} l^{\prime}(\theta) &:=\operatorname{argmax}_{\theta} \prod_{i=1}^{n} p\left(x, z_{i}^{\prime} ; \theta\right) \\ &=\operatorname{argmax}_{\theta} \prod_{z \in \mathscr{Z}} p(x, z ; \theta)^{c(z)} \\ &=\operatorname{argmax}_{\theta} \sum_{z \in Z} \frac{c(z)}{\sum_{z^{*}} c\left(z^{*}\right)} \log p(x, z ; \theta) \\ &=\operatorname{argmax}_{\theta} \sum_{z \in Z} p\left(z | x ; \theta_{t}\right) \log p(x, z ; \theta) \end{aligned} argmaxθl′(θ):=argmaxθi=1∏np(x,zi′;θ)=argmaxθz∈Z∏p(x,z;θ)c(z)=argmaxθz∈Z∑∑z∗c(z∗)c(z)logp(x,z;θ)=argmaxθz∈Z∑p(z∣x;θt)logp(x,z;θ)
The final line in the above calculation is the Q-function!
Gaussian Mixture Model (GMM)
Definition
Mixture modelling: Model the unknown PDF,
p
(
x
)
p(\mathbf x)
p(x), as a linear combination of different distributions, i.e.,
p
(
x
)
=
∑
k
=
1
K
α
k
p
(
x
∣
k
)
(GMM.1)
p(\mathbf x)=\sum_{k=1}^K \alpha_k p(\mathbf x|k) \tag{GMM.1}
p(x)=k=1∑Kαkp(x∣k)(GMM.1)
where
α
k
\alpha_k
αk is the parameter weighting the specific contributing PDF,
p
(
x
∣
k
)
p(\mathbf x|k)
p(x∣k). To guarantee that
p
(
x
)
p(\mathbf x)
p(x) is a PDF, the weighting parameters must be nonnegative and add to one (
∑
k
=
1
K
α
k
=
1
\sum_{k=1}^K \alpha_k=1
∑k=1Kαk=1).
The physical interpretation of Eq. ( G M M . 1 ) (GMM.1) (GMM.1) is that we are given a set of K K K distributions, p ( x ∣ k ) , k = 1 , 2 , ⋯ , K p(\mathbf x|k),k=1,2,\cdots,K p(x∣k),k=1,2,⋯,K. Each observation x n , n = 1 , 2 , ⋯ , N \mathbf x_n,n=1,2,\cdots,N xn,n=1,2,⋯,N, is drawn from one of these K K K distributions, but we are not told from which one. All we know is a set of parameters, α k , k = 1 , 2 , ⋯ , K \alpha_k,k=1,2,\cdots,K αk,k=1,2,⋯,K, each one providing the probability that a sample has been drawn from the corresponding PDF, p ( x ∣ k ) p(\mathbf x|k) p(x∣k).
It can be shown that for a large enough number of mixtures, K K K, and appropriate choice of the involved parameters, one can approximate arbitrarily close any continuous PDF.
Mixture modelling is a typical task involving latent variables; that is, the labels
k
k
k of the PDF from which an obtained observation has originated. In practice, each
p
(
x
∣
k
)
p(\mathbf x|k)
p(x∣k) is chosen from a known PDF family, parameterized via a set of parameters, and
(
G
M
M
.
1
)
(GMM.1)
(GMM.1) can be rewritten as
p
(
x
;
Θ
)
=
∑
k
=
1
K
α
k
p
(
x
∣
k
;
Θ
k
)
(GMM.2)
p(\mathbf x;\mathbf \Theta)=\sum_{k=1}^K \alpha_k p(\mathbf x|k;\mathbf \Theta_k) \tag{GMM.2}
p(x;Θ)=k=1∑Kαkp(x∣k;Θk)(GMM.2)
and the task is to estimate
(
α
k
,
Θ
k
)
,
k
=
1
,
2
,
⋯
,
K
(\alpha _k,\mathbf \Theta_k), k=1,2,\cdots,K
(αk,Θk),k=1,2,⋯,K, based on a set of observations
x
n
,
n
=
1
,
2
,
⋯
,
N
\mathbf x_n, n=1,2,\cdots,N
xn,n=1,2,⋯,N.
If we choose the PDFs
p
(
x
∣
k
;
Θ
k
)
p(\mathbf x|k;\mathbf \Theta_k)
p(x∣k;Θk) to be Gaussian, then we obtain the GMM:
p
(
x
;
Θ
)
=
∑
k
=
1
K
α
k
ϕ
(
x
;
μ
k
,
Σ
k
)
(GMM.3)
p(\mathbf x;\mathbf \Theta)=\sum_{k=1}^K \alpha_k \phi(\mathbf x;\boldsymbol \mu_k,\boldsymbol \Sigma_k) \tag{GMM.3}
p(x;Θ)=k=1∑Kαkϕ(x;μk,Σk)(GMM.3)
where
ϕ
\phi
ϕ is the PDF of the multivariate Gaussian distribution:
ϕ
(
x
;
μ
,
Σ
)
:
=
1
(
2
π
)
N
2
det
(
Σ
)
1
2
exp
[
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
]
\phi(\boldsymbol{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma}):=\frac{1}{(2 \pi)^{\frac{N}{2}} \operatorname{det}(\boldsymbol{\Sigma})^{\frac{1}{2}}} \exp \left[-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{T} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right]
ϕ(x;μ,Σ):=(2π)2Ndet(Σ)211exp[−21(x−μ)TΣ−1(x−μ)]
and
Θ
:
=
{
μ
1
,
⋯
,
μ
K
,
Σ
1
,
⋯
,
Σ
K
,
α
1
,
⋯
,
α
K
}
\mathbf \Theta:=\{\boldsymbol \mu_1,\cdots,\boldsymbol \mu_K,\boldsymbol \Sigma_1,\cdots,\boldsymbol \Sigma_K,\alpha_1,\cdots,\alpha_K \}
Θ:={μ1,⋯,μK,Σ1,⋯,ΣK,α1,⋯,αK}
Here’s an example density function for a two-dimensional GMM with three Gaussians (i.e.
K
=
3
K=3
K=3):
A model for data clustering
GMM’s provide a framework for finding clusters within data points x 1 , ⋯ , x N \mathbf x_1,\cdots,\mathbf x_N x1,⋯,xN. To cluster the data, we make a very strong assumption that our data points were samples from a GMM with K K K Gaussians. Unfortunately, we do not know the GMM’s parameters Θ \mathbf \Theta Θ, nor do we know which Gaussian generated each data point.
If we denote z 1 , ⋯ , z N z_1,\cdots,z_N z1,⋯,zN to be the label of x 1 , ⋯ , x N \mathbf x_1,\cdots,\mathbf x_N x1,⋯,xN, then x 1 , ⋯ , x N \mathbf x_1,\cdots,\mathbf x_N x1,⋯,xN constitutes the observed data and z 1 , ⋯ , z N z_1,\cdots,z_N z1,⋯,zN constitutes the latent data. The situation we are facing can be depicted in the figure below:
Parameter estimation for such a problem naturally lends itself to be treated via the [EM algorithm](# Expectation-Maximization (EM)).
Maximum-likelihood estimation for GMM’s via the EM algorithm
-
E-Step
-
Compute the conditional probability p ( Z n = k ∣ x n ; Θ k , t ) p(Z_n=k|\mathbf x_n;{\mathbf \Theta}_{k,t}) p(Zn=k∣xn;Θk,t)
p ( Z n = k ∣ x n ; Θ k , t ) = p ( x n ∣ Z n = k ; Θ k , t ) p ( Z n = k ) ∑ k = 1 K p ( x n ∣ Z n = k ; Θ k , t ) p ( Z n = k ) = ϕ ( x n ∣ μ k , t , Σ k , t ) α k , t ∑ k = 1 K ϕ ( x n ∣ μ k , t , Σ k , t ) α k , t (GMM.4) \begin{aligned} p(Z_n=k|\mathbf x_n;{\mathbf \Theta}_{k,t})&=\frac{p(\mathbf x_n|Z_n=k;{\mathbf \Theta}_{k,t})p(Z_n=k)}{\sum_{k=1}^K p(\mathbf x_n|Z_n=k;{\mathbf \Theta}_{k,t})p(Z_n=k)}\\ &=\frac{\phi(\mathbf x_n|{\boldsymbol \mu}_{k,t},{\boldsymbol \Sigma}_{k,t}) \alpha_{k,t}}{\sum_{k=1}^K \phi(\mathbf x_n|{\boldsymbol \mu}_{k,t},{\boldsymbol \Sigma}_{k,t}) \alpha_{k,t}} \end{aligned} \tag{GMM.4} p(Zn=k∣xn;Θk,t)=∑k=1Kp(xn∣Zn=k;Θk,t)p(Zn=k)p(xn∣Zn=k;Θk,t)p(Zn=k)=∑k=1Kϕ(xn∣μk,t,Σk,t)αk,tϕ(xn∣μk,t,Σk,t)αk,t(GMM.4)
For simplicity, denote γ n , k , t ≜ p ( Z n = k ∣ x n ; Θ k , t ) \gamma_{n,k,t}\triangleq p(Z_n=k|\mathbf x_n;{\mathbf \Theta}_{k,t}) γn,k,t≜p(Zn=k∣xn;Θk,t). -
Then formulate the Q-function from ( G M M . 3 ) (GMM.3) (GMM.3) and ( G M M . 4 ) (GMM.4) (GMM.4):
Q t ( Θ ) : = ∑ n = 1 N ∑ k = 1 K γ n , k , t log [ α k ϕ ( x n ; μ k , Σ k ) ] (GMM.5) Q_t(\mathbf \Theta):=\sum_{n=1}^N \sum_{k=1}^K\gamma_{n,k,t}\log[\alpha_k\phi(\mathbf x_n;{\boldsymbol \mu}_{k},{\boldsymbol \Sigma}_{k}) ] \tag{GMM.5} Qt(Θ):=n=1∑Nk=1∑Kγn,k,tlog[αkϕ(xn;μk,Σk)](GMM.5)
-
-
M-Step
Assign Θ t + 1 {\mathbf \Theta}_{t+1} Θt+1 to be the value that maximizes the Q-function that was formulated on the previous E-step:
Θ t + 1 : = arg max Θ Q t ( Θ ) s.t. ∑ k = 1 K α k , t = 1 (GMM.6) {\mathbf \Theta}_{t+1}:=\arg \max _{\mathbf \Theta} Q_t(\mathbf \Theta)\quad \text{s.t. }\sum_{k=1}^K \alpha_{k,t}=1 \tag{GMM.6} Θt+1:=argΘmaxQt(Θ)s.t. k=1∑Kαk,t=1(GMM.6)The solution to this optimization problem is given by
α k , t + 1 : = 1 n ∑ i = 1 n γ n , k , t μ k , t + 1 : = 1 ∑ n = 1 N γ n , k , t ∑ n = 1 N γ n , k , t x n Σ k , t + 1 : = 1 ∑ n = 1 N γ n , k , t ∑ i = 1 n γ n , k , t ( x n − μ k , t ) ( x n − μ k , t ) T (GMM.7) \begin{aligned} \alpha_{ k,t+1} &:=\frac{1}{n} \sum_{i=1}^{n} \gamma_{n,k,t} \\ \boldsymbol{\mu}_{ k,t+1} &:=\frac{1}{\sum_{n=1}^{N} \gamma_{n,k,t}} \sum_{n=1}^{N} \gamma_{n,k,t} \boldsymbol{x}_{n} \\ \boldsymbol{\Sigma}_{ k,t+1} &:=\frac{1}{\sum_{n=1}^{N} \gamma_{n,k,t}} \sum_{i=1}^{n} \gamma_{n,k,t}\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{ k,t}\right)\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{ k,t}\right)^{T} \end{aligned}\tag{GMM.7} αk,t+1μk,t+1Σk,t+1:=n1i=1∑nγn,k,t:=∑n=1Nγn,k,t1n=1∑Nγn,k,txn:=∑n=1Nγn,k,t1i=1∑nγn,k,t(xn−μk,t)(xn−μk,t)T(GMM.7)
Note:- α k , t + 1 \alpha_{ k,t+1} αk,t+1 can be solved using Lagrange multipliers. The value of λ \lambda λ can be determined by making use of ∑ k = 1 K γ n , k , t = 1 \sum_{k=1}^K \gamma_{n,k,t}=1 ∑k=1Kγn,k,t=1.
- ∂ ∂ s ( x − s ) T W ( x − s ) = − 2 W − 1 ( x − s ) \frac{\partial}{\partial s}(\boldsymbol{x}-\boldsymbol{s})^{T} \boldsymbol{W}(\boldsymbol{x}-\boldsymbol{s})=-2 \boldsymbol{W}^{-1}(\boldsymbol{x}-\boldsymbol{s}) ∂s∂(x−s)TW(x−s)=−2W−1(x−s)
- ∂ ∂ X log ∣ det ( X ) ∣ = ( X T ) − 1 \frac{\partial}{\partial \boldsymbol{X}} \log |\operatorname{det}(\boldsymbol{X})|=\left(\boldsymbol{X}^{T}\right)^{-1} ∂X∂log∣det(X)∣=(XT)−1
- ∂ ∂ X a T X − 1 b = − ( X − 1 ) T a b T ( X − 1 ) T \frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^{T} \boldsymbol{X}^{-1} \boldsymbol{b}=-\left(\boldsymbol{X}^{-1}\right)^{T} \boldsymbol{a} \boldsymbol{b}^{T}\left(\boldsymbol{X}^{-1}\right)^{T} ∂X∂aTX−1b=−(X−1)TabT(X−1)T
The complete derivation can be found here.
Visualization: