期望传播算法及其推导

个人博客www.qiuyun-blog.cn

Notations:

  1. Diag ( a ) \text{Diag}(\boldsymbol{a}) Diag(a): a diagonal matrix with a \boldsymbol{a} a being its diagonal element.
  2. diag ( A ) \text{diag}(\mathbf{A}) diag(A): a vector from the diagonal element of A \mathbf{A} A.
  3. a ⊙ b \boldsymbol{a}\odot \boldsymbol{b} ab: componentwise multiply.
  4. a ⊘ b \boldsymbol{a}\oslash \boldsymbol{b} ab: componentwise divide.

Recap of Variational Inference

As mentioned in [1], we have introduced variational inference and its application in Bayesian linear regression. In this blog, we focus on a variational inference perspective on expectation propagation.

In signal processing regime, the posterior distribution is interested. However, it is difficult to obtain owing to many high-dimension integral. For example, we consider linear Gaussian model
y = H x + w \mathbf{y}=\mathbf{Hx}+\mathbf{w} y=Hx+w
Its posterior distribution denoted by
p ( x ∣ y ) = p ( y ∣ x ) p ( x ) ∫ p ( y ∣ x ) p ( x ) d x p(\mathbf{x}|\mathbf{y})=\frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{\int p(\mathbf{y}|\mathbf{x})p(\mathbf{x}) \text{d}\mathbf{x} } p(xy)=p(yx)p(x)dxp(yx)p(x)
where p ( y ∣ x ) = p w ( y − H x ) p(\mathbf{y}|\mathbf{x})=p_{\mathbf{w} }(\mathbf{y}-\mathbf{Hx}) p(yx)=pw(yHx). Unless both p ( y ∣ x ) p(\mathbf{y}|\mathbf{x}) p(yx) and p ( x ) p(\mathbf{x}) p(x) are Gaussian, we can’t obtain the close-form of p ( x ∣ y ) p(\mathbf{x}|\mathbf{y}) p(xy) directly. For that, some approximations are necessary.

To thid end, we use q ( x ) q(\mathbf{x}) q(x) to approximate the posterior distribution and KL-divergence to measure the difference between q ( x ) q(\mathbf{x}) q(x) and p ( x ∣ y ) p(\mathbf{x}|\mathbf{y}) p(xy). For simplification, we generally restrict the form of q ( x ) q(\mathbf{x}) q(x) from the distribution family S \mathcal{S} S, i.e.,
q ( x ) = arg ⁡ min ⁡ q ( x ) ∈ S   D KL ( p ∣ ∣ q ) q(\mathbf{x})=\underset{q(\mathbf{x})\in \mathcal{S} } {\arg \min} \ \mathcal{D}_{\text{KL} }(p||q) q(x)=q(x)Sargmin DKL(pq)
Obviously, a distribution family with excellent properties will greatly reduce the amount of computation. Fortunately, exponential family is one of that.

Exponential Family

The exponential family over x \mathbf{x} x parametered by η \boldsymbol{\eta} η is defined by
p ( x ; η ) = h ( x ) g ( η ) exp ⁡ ( η T u ( x ) ) p(\mathbf{x};\boldsymbol{\eta})=h(\mathbf{x})g(\boldsymbol{\eta})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right) p(x;η)=h(x)g(η)exp(ηTu(x))
where g ( η ) g(\boldsymbol{\eta}) g(η) is normalization constant
g ( η ) ( ∫ h ( x ) exp ⁡ ( η T u ( x ) ) d x ) = 1 g(\boldsymbol{\eta}) \left(\int h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x}\right)=1 g(η)(h(x)exp(ηTu(x))dx)=1
Taking the gradient of both side of the above w.r.t. η \boldsymbol{\eta} η, we get
∇ g ( η ) ∫ h ( x ) exp ⁡ ( η T u ( x ) ) d x + g ( η ) ∫ h ( x ) ( η T u ( x ) ) u ( x ) d x = 0 \nabla g(\boldsymbol{\eta})\int h(\mathbf{x})\exp \left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x}+g(\boldsymbol{\eta})\int h(\mathbf{x})\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\boldsymbol{u}(\mathbf{x})\text{d}\mathbf{x}=0 g(η)h(x)exp(ηTu(x))dx+g(η)h(x)(ηTu(x))u(x)dx=0
Rearranging yields
− 1 g ( η ) ∇ g ( η ) = g ( η ) ∫ u ( x ) h ( x ) exp ⁡ ( η T u ( x ) ) d x = ∫ u ( x ) h ( x ) exp ⁡ ( η T u ( x ) ) d x ∫ h ( x ) exp ⁡ ( η T u ( x ) ) d x = E [ u ( x ) ]    -\frac{1}{g(\boldsymbol{\eta})}\nabla g(\boldsymbol{\eta}) =g(\boldsymbol{\eta}) \int \boldsymbol{u}(\mathbf{x})h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x}\\ \qquad \qquad =\frac{ \int \boldsymbol{u}(\mathbf{x})h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x} }{ \int h(\mathbf{x})\exp\left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right)\text{d}\mathbf{x} }\\ =\mathbb{E}[\boldsymbol{u}(\mathbf{x})] \qquad \qquad \quad \ \ g(η)1g(η)=g(η)u(x)h(x)exp(ηTu(x))dx=h(x)exp(ηTu(x))dxu(x)h(x)exp(ηTu(x))dx=E[u(x)]  
Using the fact ∇ log ⁡ g ( η ) = 1 g ( η ) ∇ g ( η ) \nabla \log g(\boldsymbol{\eta})=\frac{1}{g(\boldsymbol{\eta})}\nabla g(\boldsymbol{\eta}) logg(η)=g(η)1g(η), we have
− ∇ log ⁡ g ( η ) = E [ u ( x ) ] ⋯ ⋯ ⋯ ⋯ ( ∗ 1 ) \qquad \qquad -\nabla \log g(\boldsymbol{\eta})=\mathbb{E}[\boldsymbol{u}(\mathbf{x})] \quad \cdots \cdots \cdots \cdots \quad (*1) logg(η)=E[u(x)](1)

A Variational Inference Perspective on EP

For the distribution of q ( x ) q(\mathbf{x}) q(x) in variational inference, We take exponential family distribution into account
q ( x ) = h ( x ) g ( η ) exp ⁡ ( η T u ( x ) ) q(\mathbf{x})=h(\mathbf{x})g(\boldsymbol{\eta})\exp \left(\boldsymbol{\eta}^T\boldsymbol{u}(\mathbf{x})\right) q(x)=h(x)g(η)exp(ηTu(x))
we then write D KL ( p ∣ ∣ q ) \mathcal{D}_{\text{KL} }(p||q) DKL(pq) as
D KL ( p ∣ ∣ q ) = − log ⁡ g ( η ) − η T E p ( x ) [ u ( x ) ] + const \mathcal{D}_{\text{KL} }(p||q)=-\log g(\boldsymbol{\eta})-\boldsymbol{\eta}^T\mathbb{E}_{p(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})]+\text{const} DKL(pq)=logg(η)ηTEp(x)[u(x)]+const
Taking the gradient of the both side of above w.r.t. η \boldsymbol{\eta} η to zero yields
− ∇ log ⁡ g ( η ) = E p ( x ) [ u ( x ) ] -\nabla \log g(\boldsymbol{\eta}) =\mathbb{E}_{p(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})] logg(η)=Ep(x)[u(x)]
As mentioned in ( ∗ 1 ) (*1) (1), we then get
E q ( x ) [ u ( x ) ] = E p ( x ) [ u ( x ) ] \mathbb{E}_{q(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})]=\mathbb{E}_{p(\mathbf{x})}[\boldsymbol{u}(\mathbf{x})]

  • 10
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值