Bayesian Learning_ELBO, Variational Inference, and EM Algorithm

Reference:

https://mbernste.github.io/posts/elbo/

https://mbernste.github.io/posts/variational_inference/

https://mbernste.github.io/posts/em/

https://mbernste.github.io/posts/gmm_em/

Theodoridis S. Machine learning: a Bayesian and optimization perspective[M]. Academic press, 2015.

Latent variable model

  • We posit that our observed data x x x is a realization from some random variable X X X.
  • We posit the existence of another random variable Z Z Z where X X X and Z Z Z are distributed according to a joint distribution p ( X , Z ; θ ) p(X,Z;\theta) p(X,Z;θ) where θ \theta θ parameterizes the distribution.
  • Our data is only a realization of X X X, not Z Z Z, and therefore Z Z Z remains unobserved (i.e. latent).

There are two predominant tasks that we may be interested in accomplishing:

  1. Given some fixed value for θ \theta θ, compute the posterior distribution p ( Z ∣ X ; θ ) p(Z|X;\theta) p(ZX;θ) [Can be solved by variational inference]
  2. Given that θ \theta θ is unknown, find the maximum likelihood estimate of θ \theta θ [Can be solved by EM]

Both variational inference and EM rely on the ELBO.

The Evidence Lower Bound (ELBO)

What is the ELBO?

To understand the evidence lower bound, we must first understand what we mean by “evidence”: it is just a name given to the likelihood function evaluated at a fixed θ \theta θ
evidence : = log ⁡ p ( x ; θ ) (ELBO.1) \text{evidence}:=\log p(x;\theta) \tag{ELBO.1} evidence:=logp(x;θ)(ELBO.1)

Why is this quantity called the “evidence”?

Intuitively, if we have chosen the right model p p p and θ \theta θ, then we would expect that the marginal probability of our observed data x x x, would be high. Thus, a higher value of log ⁡ ⁡ p ( x ; θ ) \log⁡p(x;θ) logp(x;θ) indicates, in some sense, that we may be on the right track with the model p p p and parameters θ \theta θ that we have chosen. That is, this quantity is “evidence” that we have chosen the right model for the data.

If we happen to know (or posit) that Z Z Z follows some distribution denoted by q q q, s.t.
p ( x , z ; θ ) : = p ( x ∣ z ; θ ) q ( z ) (ELBO.2) p(x,z;\theta):=p(x|z;\theta)q(z) \tag{ELBO.2} p(x,z;θ):=p(xz;θ)q(z)(ELBO.2)
Then the evidence lower bound is just a lower bound on the evidence that makes use of the known q q q. Specifically,
log ⁡ p ( x ; θ ) ≥ E Z ∼ q [ log ⁡ p ( x , Z ; θ ) q ( Z ) ] (ELBO.3) \log p(x;\theta)\ge E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{ELBO.3} logp(x;θ)EZq[logq(Z)p(x,Z;θ)](ELBO.3)
where the ELBO is simply the right-hand side of the above equation:
E L B O : = E Z ∼ q [ log ⁡ p ( x , Z ; θ ) q ( Z ) ] (ELBO.4) ELBO:=E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{ELBO.4} ELBO:=EZq[logq(Z)p(x,Z;θ)](ELBO.4)

Derivation

Jensen’s Inequality: if X X X is a random variable and φ \varphi φ is a convex (concave) function, then
φ ( E [ X ] ) ≤ ( ≥ ) E [ φ ( X ) ] \varphi(E[X])\le (\ge) E[\varphi(X)] φ(E[X])()E[φ(X)]
Since log ⁡ ( ⋅ ) \log (\cdot) log() is a concave function,
E Z ∼ q [ log ⁡ p ( x , Z ; θ ) q ( Z ) ] ≤ log ⁡ [ E Z ∼ q ( p ( x , Z ; θ ) q ( Z ) ) ] = log ⁡ [ ∫ q ( Z ) p ( x , Z ; θ ) q ( Z ) d z ] = log ⁡ p ( x ; θ ) \begin{aligned} E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right]&\le \log \left[E_{Z\sim q}\left( \frac{p(x,Z;\theta)}{q(Z)}\right) \right]\\ &=\log \left[\int q(Z)\frac{p(x,Z;\theta)}{q(Z)}dz \right]\\ &=\log p(x;\theta) \end{aligned} EZq[logq(Z)p(x,Z;θ)]log[EZq(q(Z)p(x,Z;θ))]=log[q(Z)q(Z)p(x,Z;θ)dz]=logp(x;θ)

The gap between the evidence and the ELBO

It turns out that the gap between the evidence and the ELBO is precisely the Kullback Leibler divergence between q ( Z ) q(Z) q(Z) and p ( Z ∣ x ; θ ) p(Z|x;θ) p(Zx;θ).
evidence − ELBO : = log ⁡ p ( x ; θ ) − E Z ∼ q [ log ⁡ p ( x , Z ; θ ) q ( Z ) ] = K L ( q ( Z ) ∥ p ( Z ∣ x ; θ ) ) (ELBO.5) \text{evidence}-\text{ELBO}:=\log p(x;\theta)-E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] =KL(q(Z)\|p(Z|x;\theta)) \tag{ELBO.5} evidenceELBO:=logp(x;θ)EZq[logq(Z)p(x,Z;θ)]=KL(q(Z)p(Zx;θ))(ELBO.5)
This fact forms the basis of the [variational inference algorithm] for approximate Bayesian inference.

在这里插入图片描述

Derivation
log ⁡ p ( x ; θ ) − E Z ∼ q [ log ⁡ p ( x , Z ; θ ) q ( Z ) ] = ∫ q ( Z ) log ⁡ p ( x ; θ ) d z − ∫ q ( Z ) log ⁡ p ( x , Z ; θ ) q ( Z ) d z = ∫ q ( Z ) log ⁡ p ( x ; θ ) q ( Z ) p ( x , Z ; θ ) d z = ∫ q ( Z ) log ⁡ q ( Z ) p ( Z ∣ x ; θ ) d z = K L ( q ( Z ) ∥ p ( Z ∣ x ; θ ) ) \begin{aligned} \log p(x;\theta)-E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] &= \int q(Z)\log p(x;\theta)dz-\int q(Z)\log \frac{p(x,Z;\theta)}{q(Z)}dz\\ &= \int q(Z) \log \frac{p(x;\theta) q(Z)}{p(x,Z;\theta)}dz\\ &= \int q(Z) \log \frac{q(Z)}{p(Z|x;\theta)}dz\\ &= KL(q(Z)\|p(Z|x;\theta)) \end{aligned} logp(x;θ)EZq[logq(Z)p(x,Z;θ)]=q(Z)logp(x;θ)dzq(Z)logq(Z)p(x,Z;θ)dz=q(Z)logp(x,Z;θ)p(x;θ)q(Z)dz=q(Z)logp(Zx;θ)q(Z)dz=KL(q(Z)p(Zx;θ))

Variational Inference

Why variational inference?

Variational inference is a paradigm for estimating a posterior distribution when computing it explicitly is intractable.

Assume that we have a model that involves hidden random variables Z Z Z, observed random variables X X X, and some posited probabilistic model over the hidden and the observed random variables P ( X , Z ) P(X,Z) P(X,Z). The goal is to compute the posterior distribution P ( Z ∣ X ) P(Z|X) P(ZX).

Ideally, we would do so by using Bayes theorem:
p ( z ∣ x ) = p ( x ∣ z ) p ( z ) p ( x ) p(z|x)=\frac{p(x|z)p(z)}{p(x)} p(zx)=p(x)p(xz)p(z)
In practice, it is often difficult to compute p ( z ∣ x ) p(z|x) p(zx) via Bayes theorem because the denominator p ( x ) p(x) p(x) does not have a closed form. Usually, the denominator p ( x ) p(x) p(x) can be only be expressed as an integral that marginalizes over z z z: p ( x ) = ∫ p ( x , z ) d z p(x)=\int p(x,z) dz p(x)=p(x,z)dz. In such scenarios, we’re often forced to approximate p ( z ∣ x ) p(z|x) p(zx) rather than compute it directly. Variational inference is one such approximation technique.

Details

Instead of computing p ( z ∣ x ) p(z|x) p(zx) exactly via Bayes theorem, variational inference attempts to find another distribution q ( z ) q(z) q(z) that is ‘close’ to p ( z ∣ x ) p(z|x) p(zx), where the ‘closeness’ is measured by the KL-divergence.
K L ( q ( Z ) ∥ p ( Z ∣ x ) ) = ∫ q ( Z ) log ⁡ q ( Z ) p ( Z ∣ x ) d z = E Z ∼ q [ log ⁡ q ( Z ) p ( Z ∣ x ) ] (VI.1) KL(q(Z)\|p(Z|x))=\int q(Z) \log \frac{q(Z)}{p(Z|x)}dz=E_{Z\sim q}\left[ \log \frac{q(Z)}{p(Z|x)}\right] \tag{VI.1} KL(q(Z)p(Zx))=q(Z)logp(Zx)q(Z)dz=EZq[logp(Zx)q(Z)](VI.1)
We restrict the search for q ( z ) q(z) q(z) to a family of surrogate distributions over Z Z Z, called the variational distribution family, denoted by the set of distributions Q \mathcal Q Q. Each member of Q \mathcal Q Q is characterized by the values of a set of variational parameters ϕ \phi ϕ.

Then we can formulate the optimization problem as:
q ^ : = q ( z ∣ ϕ ^ ) = arg ⁡ min ⁡ q K L ( q ( Z ) ∥ p ( Z ∣ x ) ) (VI.2) \hat q :=q(z|\hat \phi)=\arg\min_q KL(q(Z)\|p(Z|x)) \tag{VI.2} q^:=q(zϕ^)=argqminKL(q(Z)p(Zx))(VI.2)
From ( E L B O . 5 ) (ELBO.5) (ELBO.5), we know that minimizing the KL-divergence is equivalent to maximizing the ELBO:
q ^ : = q ( z ∣ ϕ ^ ) = arg ⁡ max ⁡ q E L B O ( q ) (VI.3) \hat q :=q(z|\hat \phi)=\arg\max_q ELBO(q) \tag{VI.3} q^:=q(zϕ^)=argqmaxELBO(q)(VI.3)
where E L B O ( q ) : = E Z ∼ q [ log ⁡ p ( x , Z ) ] − E Z ∼ q [ log ⁡ p ( Z ) ] ELBO(q):=E_{Z\sim q}[\log p(x,Z)]-E_{Z\sim q}[\log p(Z)] ELBO(q):=EZq[logp(x,Z)]EZq[logp(Z)].

Why is the method called ‘variational’ inference?

The term “variational” in “variational inference” comes from the mathematical area of the calculus of variations. The calculus of variations is all about optimization problems that optimize functions of functions (called functionals).

More specifically, let’s say we have some set of functions F \mathcal{F} F where each f ∈ F f \in \mathcal{F} fF maps items from some set A A A to some set B B B. That is,
f : A → B f: A \rightarrow B f:AB
Let’s say we have some function g g g that maps functions in F \mathcal{F} F to real numbers R \mathbb{R} R. That is,
g : F → R g: \mathcal{F} \rightarrow \mathbb{R} g:FR
Then, we may wish to solve an optimization problem of the form:
arg ⁡ max ⁡ f ∈ F g ( f ) \arg \max _{f \in \mathcal{F}} g(f) argfFmaxg(f)
This is precisely the problem addressed in the calculus of variations. In the case of variational inference, the functional, g g g, that we are optimizing is the ELBO. The set of functions, F \mathcal{F} F, that we are searching over is the set of measurable functions in the variational family, Q \mathcal{Q} Q.

Expectation-Maximization (EM)

Expectation-maximization (EM) is a popular algorithm for performing maximum-likelihood estimation of the parameters in a [latent variable model](# Latent variable model).

Despite not observing z z z, we wish to find the value for θ \theta θ that makes x x x most likely under our model. Since we have not observed z z z, our likelihood function must marginalize over z z z:
l ( θ ) : = log ⁡ p ( x ; θ ) = log ⁡ ∫ p ( x , z ; θ ) d z (EM.1) l(\theta):=\log p(x;\theta)=\log \int p(x,z;\theta)dz \tag{EM.1} l(θ):=logp(x;θ)=logp(x,z;θ)dz(EM.1)
In practice, maximizing this function over θ \theta θ may be difficult to do analytically due to this integral. For such situations, the EM algorithm may provide a method for computing a local maximum of this function with respect to θ \theta θ.

Description of EM

The EM algorithm alternates between two steps: an expectation-step (E-step) and a maximization-step (M-step). Throughout execution, the algorithm maintains an estimate of the parameters θ \theta θ that is updated on each iteration. As the algorithm progresses, this estimate of θ \theta θ converges to a local maximum of l ( θ ) l(\theta) l(θ).

Let t t t denote a given step of the algorithm. On the t t t-th step, we’ll denote θ t \theta_t θt as the t t t-th estimate of θ \theta θ.

  • E-Step

    • Compute the conditional probability p ( z ∣ x ; θ t ) p(z|x;\theta_t) p(zx;θt)

    • Then formulate the Q-function:
      Q t ( θ ) : = E Z ∣ x , θ t [ log ⁡ p ( x , z ; θ ) ] = ∫ p ( z ∣ x ; θ t ) log ⁡ p ( x , z ; θ ) d z (EM.2) Q_t(\theta):=E_{Z|x,\theta_t}[\log p(x,z;\theta)]=\int p(z|x;\theta_t)\log p(x,z;\theta)dz \tag{EM.2} Qt(θ):=EZx,θt[logp(x,z;θ)]=p(zx;θt)logp(x,z;θ)dz(EM.2)

  • M-Step

    Assign θ t + 1 \theta_{t+1} θt+1 to be the value that maximizes the Q-function that was formulated on the previous E-step:
    θ t + 1 : = arg ⁡ max ⁡ θ Q t ( θ ) (EM.3) \theta_{t+1}:=\arg \max _\theta Q_t(\theta) \tag{EM.3} θt+1:=argθmaxQt(θ)(EM.3)

Note:

The core of formulating an EM algorithm for a particular model is calculating p ( z ∣ x ; θ t ) p(z|x;\theta_t) p(zx;θt). If the model admits an easy calculation of this function, then the EM algorithm may be a good a choice.

We repeat the E-Step and M-Step until the difference between θ t \theta_t θt and θ t + 1 \theta_{t+1} θt+1 is small, indicating convergence. We also note that θ 0 \theta_0 θ0 can be set to an arbitrary value – usually set randomly. Since EM finds a local maximum of l ( θ ) l(\theta) l(θ), it is often wise to try many different θ 0 \theta_0 θ0’s.

EM is coordinate ascent on the evidence lower bound

What is coordinate ascent?

Given a function f ( a , b ) f(a,b) f(a,b) that takes two arguments a a a and b b b, the idea of coordinate ascent is that we will iteratively fix a a a to some value a ^ \hat a a^ and then choose for b b b the value b ^ \hat b b^ that maximizes f ( a ^ , b ) f(\hat a,b) f(a^,b). Given b ^ \hat b b^, we then re-assign to a a a the value that maximizes f ( a , b ^ ) f(a,\hat b) f(a,b^). This procedure is repeated until convergence.

In our setting, we do not yet know which value to use for θ \theta θ, nor do we know the distribution q q q for Z Z Z. Thus, the ELBO becomes a function of both q q q and θ \theta θ:

F ( q , θ ) : = E Z ∼ q [ log ⁡ p ( x , Z ; θ ) q ( Z ) ] (EM.4) F(q,\theta):=E_{Z\sim q}\left[\log \frac{p(x,Z;\theta)}{q(Z)} \right] \tag{EM.4} F(q,θ):=EZq[logq(Z)p(x,Z;θ)](EM.4)
EM is just coordinate ascent on the ELBO. In the E-step, we fix θ \theta θ and solve for q q q. In the M-step, we fix q q q and solve for θ \theta θ.

  • Fix θ \theta θ to θ t \theta_t θt (the current value for θ \theta θ) and solve for q q q:
    argmax ⁡ q F ( q , θ t ) = argmax ⁡ q E z ∼ q [ log ⁡ p ( Z , x ; θ ) q ( Z ) ] = argmin ⁡ q K L ( q ( z ) ∥ p ( z ∣ x ; θ t ) ) = p ( z ∣ x ; θ t ) \begin{aligned} \operatorname{argmax}_{q} F\left(q, \theta_{t}\right) &=\operatorname{argmax}_{q} E_{z \sim q}\left[\log \frac{p(Z, x ; \theta)}{q(Z)}\right] \\ &=\operatorname{argmin}_{q} \mathrm{KL}\left(q(z) \| p\left(z | x ; \theta_{t}\right)\right) \\ &=p\left(z | x ; \theta_{t}\right) \end{aligned} argmaxqF(q,θt)=argmaxqEzq[logq(Z)p(Z,x;θ)]=argminqKL(q(z)p(zx;θt))=p(zx;θt)
    [[Make use of Eq. ( E L B O . 5 ) (ELBO.5) (ELBO.5)]](# The gap between the evidence and the ELBO)

    Notably, computing p ( z ∣ x ; θ t ) p\left(z | x ; \theta_{t}\right) p(zx;θt) is exactly what is needed to formulate the Q-function in the E-Step.

  • Next, if we hold q q q fixed to q t : = p ( z ∣ x ; θ t ) q_{t}:=p\left(z | x ; \theta_{t}\right) qt:=p(zx;θt), we see that maximizing F ( q t , θ ) F\left(q_{t}, \theta\right) F(qt,θ) with respect to θ \theta θ is equivalent to maximizing the Q-function:
    argmax ⁡ θ F ( q t , θ ) = argmax ⁡ θ E z ∼ q [ log ⁡ p ( Z , x ; θ ) p ( z ∣ x ; θ t ) ] = argmax ⁡ θ E Z ∣ x , θ t [ log ⁡ p ( x , z ; θ ) ] \begin{aligned} \operatorname{argmax}_{\theta} F\left(q_{t}, \theta\right) &=\operatorname{argmax}_{\theta} E_{z \sim q}\left[\log \frac{p(Z, x ; \theta)}{p\left(z | x ; \theta_{t}\right)}\right] \\ &=\operatorname{argmax}_{\theta} E_{Z | x, \theta_{t}}[\log p(x, z ; \theta)] \end{aligned} argmaxθF(qt,θ)=argmaxθEzq[logp(zx;θt)p(Z,x;θ)]=argmaxθEZx,θt[logp(x,z;θ)]
    The last line in the above derivation is simply maximizing the Q-function - exactly the M-Step.

Visualization

  • Coordinate ascent on the evidence lower bound
    请添加图片描述

  • EM algorithm - overview

在这里插入图片描述

We can observe that EM is in line with a more general class of optimization algorithms, known as minorize-maximization (MM) methods. A surrogate function that minorizes the cost function (lower bound), which is easier to be maximized, is employed. Then maximizing this lower bound iteratively pushes the cost function to a local maximum.

在这里插入图片描述

Intuition behind the Q-function

  • A generalization of the likelihood function

    In the case in which Z Z Z is a discrete random variable, the Q-function can be interpreted as a “generalized” version of the complete data likelihood function.

    Let us say that both X X X and Z Z Z are observed where X = x X=x X=x and Z = z Z=z Z=z. Then, we could write the complete data likelihood as
    log ⁡ p ( x , z ; θ ) : = ∑ z ′ ∈ Z I ( z = z ′ ) log ⁡ p ( x , z ′ ; θ ) \log p(x, z ; \theta):=\sum_{z^{\prime} \in \mathcal Z} \mathbb{I}\left(z=z^{\prime}\right) \log p\left(x, z^{\prime} ; \theta\right) logp(x,z;θ):=zZI(z=z)logp(x,z;θ)
    where Z \mathcal Z Z is the support of Z Z Z's probability mass function and I ( z = z ′ ) \mathbb{I}\left(z=z^{\prime}\right) I(z=z) is an indicator that equals one when z ′ = z z'=z z=z and equals zero otherwise.

    Now recall the Q-function:
    Q t ( θ ) : = ∑ z ′ p ( z ′ ∣ x ; θ t ) log ⁡ p ( x , z ′ ; θ ) Q_t(\theta):=\sum_{z'} p(z'|x;\theta_t)\log p(x,z';\theta) Qt(θ):=zp(zx;θt)logp(x,z;θ)
    We note that the difference between these two functions is that the indicator variable in the complete data likelihood I ( z = z ′ ) \mathbb{I}\left(z=z^{\prime}\right) I(z=z) is replaced by the probability p ( z ′ ∣ x ; θ t ) p(z'|x;\theta_t) p(zx;θt). This leads us to viewing the Q-function as a sort of generalization of the complete data likelihood. That is, p ( z ′ ∣ x ; θ t ) p(z'|x;\theta_t) p(zx;θt) acts as a weight for its corresponding term in the summation, and measures our current certainty that the hidden data is equal to z ′ z' z.

  • A likelihood function over a hypothetical dataset

    Another way to view the Q-function is as a complete data likelihood function over a hypothetical dataset where this hypothetical dataset is generated from a distribution that depends on our current best guess of θ \theta θ (i.e. θ t \theta_t θt at the t t t-th iteration).

    That is, let us assume that we generate a very large set of values for Z Z Z drawn i.i.d. from the distribution specified by p ( z ∣ x ; θ t ) p(z|x;\theta_t) p(zx;θt):
    z 1 ′ , z 2 ′ , ⋯   , z n ′ ∼ p ( z ∣ x ; θ t ) z_1',z_2',\cdots,z_n' \sim p(z|x;\theta_t) z1,z2,,znp(zx;θt)
    For each z i ′ z_i' zi, we create a new ‘sub’-dataset ( x , z i ′ ) (x,z_i') (x,zi). Note that the observed data x x x is duplicated in each of these sub-datasets. We then merge all of the datasets ( x , z 1 ′ ) , ⋯   , ( x , z n ′ ) (x,z_1'),\cdots,(x,z_n') (x,z1),,(x,zn) to form our new hypothetical dataset. Then, the likelihood over these data is
    l ′ ( θ ) : = ∏ i = 1 n p ( x , z i ′ ; θ ) l'(\theta):=\prod_{i=1}^n p(x,z_i';\theta) l(θ):=i=1np(x,zi;θ)
    Now we will show that maximizing the Q-function is equivalent to maximizing this likelihood function over the hypothetical data.

    Let Z ′ : = { z 1 ′ , z 2 ′ , ⋯   , z n ′ } Z':=\{z_1',z_2',\cdots,z_n'\} Z:={z1,z2,,zn}. Now let c ( z ) c(z) c(z) be the cardinality of { z ′ ∈ Z ′ ∣ z ′ = z } \{z'\in Z'|z'=z\} {zZz=z}. That is, c ( z ) c(z) c(z) is the number of items in Z ′ Z' Z that equal z z z. As n n n approaches infinity then
    c ( z ) ∑ z ∗ c ( z ∗ ) = p ( z ∣ x ; θ t ) \frac{c(z)}{\sum_{z^*}c(z^*)}=p(z|x;\theta_t) zc(z)c(z)=p(zx;θt)
    Finally we can formulate our maximum likelihood problem on this hypothetical data:
    argmax ⁡ θ l ′ ( θ ) : = argmax ⁡ θ ∏ i = 1 n p ( x , z i ′ ; θ ) = argmax ⁡ θ ∏ z ∈ Z p ( x , z ; θ ) c ( z ) = argmax ⁡ θ ∑ z ∈ Z c ( z ) ∑ z ∗ c ( z ∗ ) log ⁡ p ( x , z ; θ ) = argmax ⁡ θ ∑ z ∈ Z p ( z ∣ x ; θ t ) log ⁡ p ( x , z ; θ ) \begin{aligned} \operatorname{argmax}_{\theta} l^{\prime}(\theta) &:=\operatorname{argmax}_{\theta} \prod_{i=1}^{n} p\left(x, z_{i}^{\prime} ; \theta\right) \\ &=\operatorname{argmax}_{\theta} \prod_{z \in \mathscr{Z}} p(x, z ; \theta)^{c(z)} \\ &=\operatorname{argmax}_{\theta} \sum_{z \in Z} \frac{c(z)}{\sum_{z^{*}} c\left(z^{*}\right)} \log p(x, z ; \theta) \\ &=\operatorname{argmax}_{\theta} \sum_{z \in Z} p\left(z | x ; \theta_{t}\right) \log p(x, z ; \theta) \end{aligned} argmaxθl(θ):=argmaxθi=1np(x,zi;θ)=argmaxθzZp(x,z;θ)c(z)=argmaxθzZzc(z)c(z)logp(x,z;θ)=argmaxθzZp(zx;θt)logp(x,z;θ)
    The final line in the above calculation is the Q-function!

Gaussian Mixture Model (GMM)

Definition

Mixture modelling: Model the unknown PDF, p ( x ) p(\mathbf x) p(x), as a linear combination of different distributions, i.e.,
p ( x ) = ∑ k = 1 K α k p ( x ∣ k ) (GMM.1) p(\mathbf x)=\sum_{k=1}^K \alpha_k p(\mathbf x|k) \tag{GMM.1} p(x)=k=1Kαkp(xk)(GMM.1)
where α k \alpha_k αk is the parameter weighting the specific contributing PDF, p ( x ∣ k ) p(\mathbf x|k) p(xk). To guarantee that p ( x ) p(\mathbf x) p(x) is a PDF, the weighting parameters must be nonnegative and add to one ( ∑ k = 1 K α k = 1 \sum_{k=1}^K \alpha_k=1 k=1Kαk=1).

The physical interpretation of Eq. ( G M M . 1 ) (GMM.1) (GMM.1) is that we are given a set of K K K distributions, p ( x ∣ k ) , k = 1 , 2 , ⋯   , K p(\mathbf x|k),k=1,2,\cdots,K p(xk),k=1,2,,K. Each observation x n , n = 1 , 2 , ⋯   , N \mathbf x_n,n=1,2,\cdots,N xn,n=1,2,,N, is drawn from one of these K K K distributions, but we are not told from which one. All we know is a set of parameters, α k , k = 1 , 2 , ⋯   , K \alpha_k,k=1,2,\cdots,K αk,k=1,2,,K, each one providing the probability that a sample has been drawn from the corresponding PDF, p ( x ∣ k ) p(\mathbf x|k) p(xk).

It can be shown that for a large enough number of mixtures, K K K, and appropriate choice of the involved parameters, one can approximate arbitrarily close any continuous PDF.

Mixture modelling is a typical task involving latent variables; that is, the labels k k k of the PDF from which an obtained observation has originated. In practice, each p ( x ∣ k ) p(\mathbf x|k) p(xk) is chosen from a known PDF family, parameterized via a set of parameters, and ( G M M . 1 ) (GMM.1) (GMM.1) can be rewritten as
p ( x ; Θ ) = ∑ k = 1 K α k p ( x ∣ k ; Θ k ) (GMM.2) p(\mathbf x;\mathbf \Theta)=\sum_{k=1}^K \alpha_k p(\mathbf x|k;\mathbf \Theta_k) \tag{GMM.2} p(x;Θ)=k=1Kαkp(xk;Θk)(GMM.2)
and the task is to estimate ( α k , Θ k ) , k = 1 , 2 , ⋯   , K (\alpha _k,\mathbf \Theta_k), k=1,2,\cdots,K (αk,Θk),k=1,2,,K, based on a set of observations x n , n = 1 , 2 , ⋯   , N \mathbf x_n, n=1,2,\cdots,N xn,n=1,2,,N.

If we choose the PDFs p ( x ∣ k ; Θ k ) p(\mathbf x|k;\mathbf \Theta_k) p(xk;Θk) to be Gaussian, then we obtain the GMM:
p ( x ; Θ ) = ∑ k = 1 K α k ϕ ( x ; μ k , Σ k ) (GMM.3) p(\mathbf x;\mathbf \Theta)=\sum_{k=1}^K \alpha_k \phi(\mathbf x;\boldsymbol \mu_k,\boldsymbol \Sigma_k) \tag{GMM.3} p(x;Θ)=k=1Kαkϕ(x;μk,Σk)(GMM.3)
where ϕ \phi ϕ is the PDF of the multivariate Gaussian distribution:
ϕ ( x ; μ , Σ ) : = 1 ( 2 π ) N 2 det ⁡ ( Σ ) 1 2 exp ⁡ [ − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ] \phi(\boldsymbol{x} ; \boldsymbol{\mu}, \boldsymbol{\Sigma}):=\frac{1}{(2 \pi)^{\frac{N}{2}} \operatorname{det}(\boldsymbol{\Sigma})^{\frac{1}{2}}} \exp \left[-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{T} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right] ϕ(x;μ,Σ):=(2π)2Ndet(Σ)211exp[21(xμ)TΣ1(xμ)]
and
Θ : = { μ 1 , ⋯   , μ K , Σ 1 , ⋯   , Σ K , α 1 , ⋯   , α K } \mathbf \Theta:=\{\boldsymbol \mu_1,\cdots,\boldsymbol \mu_K,\boldsymbol \Sigma_1,\cdots,\boldsymbol \Sigma_K,\alpha_1,\cdots,\alpha_K \} Θ:={μ1,,μK,Σ1,,ΣK,α1,,αK}
Here’s an example density function for a two-dimensional GMM with three Gaussians (i.e. K = 3 K=3 K=3):

在这里插入图片描述

A model for data clustering

GMM’s provide a framework for finding clusters within data points x 1 , ⋯   , x N \mathbf x_1,\cdots,\mathbf x_N x1,,xN. To cluster the data, we make a very strong assumption that our data points were samples from a GMM with K K K Gaussians. Unfortunately, we do not know the GMM’s parameters Θ \mathbf \Theta Θ, nor do we know which Gaussian generated each data point.

If we denote z 1 , ⋯   , z N z_1,\cdots,z_N z1,,zN to be the label of x 1 , ⋯   , x N \mathbf x_1,\cdots,\mathbf x_N x1,,xN, then x 1 , ⋯   , x N \mathbf x_1,\cdots,\mathbf x_N x1,,xN constitutes the observed data and z 1 , ⋯   , z N z_1,\cdots,z_N z1,,zN constitutes the latent data. The situation we are facing can be depicted in the figure below:

在这里插入图片描述

Parameter estimation for such a problem naturally lends itself to be treated via the [EM algorithm](# Expectation-Maximization (EM)).

Maximum-likelihood estimation for GMM’s via the EM algorithm

  • E-Step

    • Compute the conditional probability p ( Z n = k ∣ x n ; Θ k , t ) p(Z_n=k|\mathbf x_n;{\mathbf \Theta}_{k,t}) p(Zn=kxn;Θk,t)
      p ( Z n = k ∣ x n ; Θ k , t ) = p ( x n ∣ Z n = k ; Θ k , t ) p ( Z n = k ) ∑ k = 1 K p ( x n ∣ Z n = k ; Θ k , t ) p ( Z n = k ) = ϕ ( x n ∣ μ k , t , Σ k , t ) α k , t ∑ k = 1 K ϕ ( x n ∣ μ k , t , Σ k , t ) α k , t (GMM.4) \begin{aligned} p(Z_n=k|\mathbf x_n;{\mathbf \Theta}_{k,t})&=\frac{p(\mathbf x_n|Z_n=k;{\mathbf \Theta}_{k,t})p(Z_n=k)}{\sum_{k=1}^K p(\mathbf x_n|Z_n=k;{\mathbf \Theta}_{k,t})p(Z_n=k)}\\ &=\frac{\phi(\mathbf x_n|{\boldsymbol \mu}_{k,t},{\boldsymbol \Sigma}_{k,t}) \alpha_{k,t}}{\sum_{k=1}^K \phi(\mathbf x_n|{\boldsymbol \mu}_{k,t},{\boldsymbol \Sigma}_{k,t}) \alpha_{k,t}} \end{aligned} \tag{GMM.4} p(Zn=kxn;Θk,t)=k=1Kp(xnZn=k;Θk,t)p(Zn=k)p(xnZn=k;Θk,t)p(Zn=k)=k=1Kϕ(xnμk,t,Σk,t)αk,tϕ(xnμk,t,Σk,t)αk,t(GMM.4)
      For simplicity, denote γ n , k , t ≜ p ( Z n = k ∣ x n ; Θ k , t ) \gamma_{n,k,t}\triangleq p(Z_n=k|\mathbf x_n;{\mathbf \Theta}_{k,t}) γn,k,tp(Zn=kxn;Θk,t).

    • Then formulate the Q-function from ( G M M . 3 ) (GMM.3) (GMM.3) and ( G M M . 4 ) (GMM.4) (GMM.4):
      Q t ( Θ ) : = ∑ n = 1 N ∑ k = 1 K γ n , k , t log ⁡ [ α k ϕ ( x n ; μ k , Σ k ) ] (GMM.5) Q_t(\mathbf \Theta):=\sum_{n=1}^N \sum_{k=1}^K\gamma_{n,k,t}\log[\alpha_k\phi(\mathbf x_n;{\boldsymbol \mu}_{k},{\boldsymbol \Sigma}_{k}) ] \tag{GMM.5} Qt(Θ):=n=1Nk=1Kγn,k,tlog[αkϕ(xn;μk,Σk)](GMM.5)

  • M-Step

    Assign Θ t + 1 {\mathbf \Theta}_{t+1} Θt+1 to be the value that maximizes the Q-function that was formulated on the previous E-step:
    Θ t + 1 : = arg ⁡ max ⁡ Θ Q t ( Θ ) s.t.  ∑ k = 1 K α k , t = 1 (GMM.6) {\mathbf \Theta}_{t+1}:=\arg \max _{\mathbf \Theta} Q_t(\mathbf \Theta)\quad \text{s.t. }\sum_{k=1}^K \alpha_{k,t}=1 \tag{GMM.6} Θt+1:=argΘmaxQt(Θ)s.t. k=1Kαk,t=1(GMM.6)

    The solution to this optimization problem is given by
    α k , t + 1 : = 1 n ∑ i = 1 n γ n , k , t μ k , t + 1 : = 1 ∑ n = 1 N γ n , k , t ∑ n = 1 N γ n , k , t x n Σ k , t + 1 : = 1 ∑ n = 1 N γ n , k , t ∑ i = 1 n γ n , k , t ( x n − μ k , t ) ( x n − μ k , t ) T (GMM.7) \begin{aligned} \alpha_{ k,t+1} &:=\frac{1}{n} \sum_{i=1}^{n} \gamma_{n,k,t} \\ \boldsymbol{\mu}_{ k,t+1} &:=\frac{1}{\sum_{n=1}^{N} \gamma_{n,k,t}} \sum_{n=1}^{N} \gamma_{n,k,t} \boldsymbol{x}_{n} \\ \boldsymbol{\Sigma}_{ k,t+1} &:=\frac{1}{\sum_{n=1}^{N} \gamma_{n,k,t}} \sum_{i=1}^{n} \gamma_{n,k,t}\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{ k,t}\right)\left(\boldsymbol{x}_{n}-\boldsymbol{\mu}_{ k,t}\right)^{T} \end{aligned}\tag{GMM.7} αk,t+1μk,t+1Σk,t+1:=n1i=1nγn,k,t:=n=1Nγn,k,t1n=1Nγn,k,txn:=n=1Nγn,k,t1i=1nγn,k,t(xnμk,t)(xnμk,t)T(GMM.7)
    Note:

    • α k , t + 1 \alpha_{ k,t+1} αk,t+1 can be solved using Lagrange multipliers. The value of λ \lambda λ can be determined by making use of ∑ k = 1 K γ n , k , t = 1 \sum_{k=1}^K \gamma_{n,k,t}=1 k=1Kγn,k,t=1.
    • ∂ ∂ s ( x − s ) T W ( x − s ) = − 2 W − 1 ( x − s ) \frac{\partial}{\partial s}(\boldsymbol{x}-\boldsymbol{s})^{T} \boldsymbol{W}(\boldsymbol{x}-\boldsymbol{s})=-2 \boldsymbol{W}^{-1}(\boldsymbol{x}-\boldsymbol{s}) s(xs)TW(xs)=2W1(xs)
    • ∂ ∂ X log ⁡ ∣ det ⁡ ( X ) ∣ = ( X T ) − 1 \frac{\partial}{\partial \boldsymbol{X}} \log |\operatorname{det}(\boldsymbol{X})|=\left(\boldsymbol{X}^{T}\right)^{-1} Xlogdet(X)=(XT)1
    • ∂ ∂ X a T X − 1 b = − ( X − 1 ) T a b T ( X − 1 ) T \frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^{T} \boldsymbol{X}^{-1} \boldsymbol{b}=-\left(\boldsymbol{X}^{-1}\right)^{T} \boldsymbol{a} \boldsymbol{b}^{T}\left(\boldsymbol{X}^{-1}\right)^{T} XaTX1b=(X1)TabT(X1)T

    The complete derivation can be found here.

Visualization:

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值