个人博客:www.qiuyun-blog.cn
Notations:
- Diag ( a ) \text{Diag}(\boldsymbol{a}) Diag(a): a diagonal matrix with a \boldsymbol{a} a being its diagonal element.
- diag ( A ) \text{diag}(\mathbf{A}) diag(A): a vector from the diagonal element of A \mathbf{A} A.
- a ⊙ b \boldsymbol{a}\odot \boldsymbol{b} a⊙b: componentwise multiply.
- a ⊘ b \boldsymbol{a}\oslash \boldsymbol{b} a⊘b: componentwise divide.
Recap of Variational Inference
As mentioned in [1], we have introduced variational inference and its application in Bayesian linear regression. In this blog, we focus on a variational inference perspective on expectation propagation.
In signal processing regime, the posterior distribution is interested. However, it is difficult to obtain owing to many high-dimension integral. For example, we consider linear Gaussian model
y = H x + w \mathbf{y}=\mathbf{Hx}+\mathbf{w} y=Hx+w
Its posterior distribution denoted by
p ( x ∣ y ) = p ( y ∣ x ) p ( x ) ∫ p ( y ∣ x ) p ( x ) d x p(\mathbf{x}|\mathbf{y})=\frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{\int p(\mathbf{y}|\mathbf{x})p(\mathbf{x}) \text{d}\mathbf{x} } p(x∣y)=∫p(y∣x)p(x)dxp(y∣x)p(x)
where p ( y ∣ x ) = p w ( y − H x ) p(\mathbf{y}|\mathbf{x})=p_{\mathbf{w} }(\mathbf{y}-\mathbf{Hx}) p(y∣x)=pw(y−Hx). Unless both p ( y ∣ x ) p(\mathbf{y}|\mathbf{x}) p(y∣x) and p ( x ) p(\mathbf{x}) p(x) are Gaussian, we can’t obtain the close-form of p ( x ∣ y ) p(\mathbf{x}|\mathbf{y}) p(x∣y) directly. For that, some approximations are necessary.
To thid end, we use q ( x ) q(\mathbf{x}) q(x) to approximate the posterior distribution and KL-divergence to measure the difference between q ( x ) q(\mathbf{x}) q(x) and p ( x ∣ y ) p(\mathbf{x}|\mathbf{y}) p(x∣y). For simplification, we generally restrict the form of q ( x ) q(\mathbf{x}) q(x) from the distribution family S \mathcal{S} S, i.e.,
q ( x ) = arg min q ( x ) ∈ S D KL ( p ∣ ∣ q ) q(\mathbf{x})=\underset{q(\mathbf{x})\in \mathcal{S} } {\arg \min} \ \mathcal{D}_{\text{KL} }(p||q) q(x)=q(x)∈Sargmin DKL(p∣∣q)
Obviously, a distribution family with excellent properties will greatly reduce the amount of computation. Fortunately, exponential family is one of that.
Exponential Family
The exponential family over x \mathbf{x} x parametered by η \boldsymbol{\eta} η is defined by
p ( x ; η ) = h ( x ) g ( η ) exp ( η T u ( x ) ) p(\mathbf{x};\boldsymbol{\eta})=h(\mathbf{x})g(\boldsymbol{\eta})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right) p(x;η)=h(x)g(η)exp(ηTu(x))
where g ( η ) g(\boldsymbol{\eta}) g(η) is normalization constant
g ( η ) ( ∫ h ( x ) exp ( η T u ( x ) ) d x ) = 1 g(\boldsymbol{\eta}) \left(\int h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x}\right)=1 g(η)(∫h(x)exp(ηTu(x))dx)=1
Taking the gradient of both side of the above w.r.t. η \boldsymbol{\eta} η, we get
∇ g ( η ) ∫ h ( x ) exp ( η T u ( x ) ) d x + g ( η ) ∫ h ( x ) ( η T u ( x ) ) u ( x ) d x = 0 \nabla g(\boldsymbol{\eta})\int h(\mathbf{x})\exp \left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x}+g(\boldsymbol{\eta})\int h(\mathbf{x})\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\boldsymbol{u}(\mathbf{x})\text{d}\mathbf{x}=0 ∇g(η)∫h(x)exp(ηTu(x))dx+g(η)∫h(x)(ηTu(x))u(x)dx=0
Rearranging yields
− 1 g ( η ) ∇ g ( η ) = g ( η ) ∫ u ( x ) h ( x ) exp ( η T u ( x ) ) d x = ∫ u ( x ) h ( x ) exp ( η T u ( x ) ) d x ∫ h ( x ) exp ( η T u ( x ) ) d x = E [ u ( x ) ] -\frac{1}{g(\boldsymbol{\eta})}\nabla g(\boldsymbol{\eta}) =g(\boldsymbol{\eta}) \int \boldsymbol{u}(\mathbf{x})h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x}\\ \qquad \qquad =\frac{ \int \boldsymbol{u}(\mathbf{x})h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x} }{ \int h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x} }\\ =\mathbb{E}[\boldsymbol{u}(\mathbf{x})] \qquad \qquad \quad \ \ −g(η)1∇g(η)=g(η)∫u(x)h(x)exp(ηTu(x))dx=∫h(x)exp(ηTu(x))dx∫u(x)h(x)exp(ηTu(x))dx=E[u(x)]
Using the fact ∇ log g ( η ) = 1 g ( η ) ∇ g ( η ) \nabla \log g(\boldsymbol{\eta})=\frac{1}{g(\boldsymbol{\eta})}\nabla g(\boldsymbol{\eta}) ∇logg(η)=g(η)1∇g(η), we have
− ∇ log g ( η ) = E [ u ( x ) ] ⋯ ⋯ ⋯ ⋯ ( ∗ 1 ) \qquad \qquad -\nabla \log g(\boldsymbol{\eta})=\mathbb{E}[\boldsymbol{u}(\mathbf{x})] \quad \cdots \cdots \cdots \cdots \quad (*1) −∇logg(η)=E[u(x)]⋯⋯⋯⋯(∗1)
A Variational Inference Perspective on EP
For the distribution of q ( x ) q(\mathbf{x}) q(x) in variational inference, We take exponential family distribution into account
q ( x ) = h ( x ) g ( η ) exp ( η T u ( x ) ) q(\mathbf{x})=h(\mathbf{x})g(\boldsymbol{\eta})\exp \left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right) q(x)=h(x)g(η)exp(ηTu(x))
we then write D KL ( p ∣ ∣ q ) \mathcal{D}_{\text{KL} }(p||q) DKL(p∣∣q) as
D KL ( p ∣ ∣ q ) = − log g ( η ) − η T E p ( x ) [ u ( x ) ] + const \mathcal{D}_{\text{KL} }(p||q)=-\log g(\boldsymbol{\eta})-\boldsymbol{\eta}^T\mathbb{E}_{p(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})]+\text{const} DKL(p∣∣q)=−logg(η)−ηTEp(x)[u(x)]+const
Taking the gradient of the both side of above w.r.t. η \boldsymbol{\eta} η to zero yields
− ∇ log g ( η ) = E p ( x ) [ u ( x ) ] -\nabla \log g(\boldsymbol{\eta}) =\mathbb{E}_{p(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})] −∇logg(η)=Ep(x)[u(x)]
As mentioned in ( ∗ 1 ) (*1) (∗1), we then get
E q ( x ) [ u ( x ) ] = E p ( x ) [ u ( x ) ] \mathbb{E}_{q(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})]=\mathbb{E}_{p(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})]