指数族分布和变分推断

指数族分布

  1. 指数族分布的pdf / pmf可以表示成:

p ( x ∣ η ) = h ( x ) e x p ( T ( x ) T η − A ( η ) ) p(x| \eta)=h(x)exp(T(x)^T \eta - A(\eta)) p(xη)=h(x)exp(T(x)TηA(η))

其中, 、 T ( x ) 、 h ( x ) 、T(x)、h(x) T(x)h(x)只是包含 x x x的函数, A ( η ) A(\eta) A(η)是只包含 η \eta η的函数。 T ( x ) T(x) T(x)叫做sufficient statistics。 A ( η ) A(\eta) A(η)叫做log-normalizer。在变分推断中, A ( η ) A(\eta) A(η)起到很重要的作用。
∫ h ( x ) e x p ( T ( x ) T η ) d x e x p ( A ( η ) ) = 1 A ( η ) = l o g ∫ h ( x ) e x p ( T ( X ) T η ) d x \frac{\int h(x)exp(T(x)^T\eta)dx}{exp(A(\eta))}=1\\ A(\eta)=log\int h(x)exp(T(X)^T\eta)dx exp(A(η))h(x)exp(T(x)Tη)dx=1A(η)=logh(x)exp(T(X)Tη)dx

  1. 我们学到的很多分布都是指数族分布,比如:

Normal, beta, Poisson, gamma, Bernoulli, chi-squared, geometric, exponential, categorical…

  1. 举高斯分布为例子

p ( x ∣ θ ) = p ( x ∣ μ , σ 2 ) = N ( μ , σ 2 ) = 1 2 π σ e x p − ( x − μ ) 2 2 σ 2 p(x| \theta)=p(x|\mu, \sigma^2)=N(\mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}exp{-\frac{(x-\mu)^2}{2\sigma^2}} p(xθ)=p(xμ,σ2)=N(μ,σ2)=2π σ1exp2σ2(xμ)2

  1. 例子:怎样把高斯分布写成指数族分布的形式,就是怎样把均值和方差这两个参数替换成 η 1 , η 2 \eta_1, \eta_2 η1,η2

N ( x ∣ μ , σ 2 ) = ( 2 π σ 2 ) − 1 2 e x p ( − ( x − μ ) 2 2 σ 2 ) = e x p ( − x 2 − 2 x μ + μ 2 2 σ 2 − 1 2 l n ( 2 π σ 2 ) = e x p ( − 1 2 σ 2 x 2 + μ σ 2 x − μ 2 2 σ 2 − 1 2 l n ( 2 π σ 2 ) ) = e x p ( [ x x 2 ] T [ μ σ 2 − 1 2 σ 2 ] − μ 2 2 σ 2 − 1 2 l n ( 2 π σ 2 ) ) \begin{aligned} N(x|\mu,\sigma^2)&=(2\pi \sigma^2)^{-\frac{1}{2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})\\ &=exp(-\frac{x^2-2x\mu+\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)\\ &=exp(-\frac{1}{2\sigma^2}x^2+\frac{\mu}{\sigma^2}x-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln({2\pi\sigma^2}))\\ &=exp(\begin{bmatrix} x \\ x^2 \end{bmatrix}^T\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}ln(2\pi\sigma^2)) \end{aligned} N(xμσ2)=(2πσ2)21exp(2σ2(xμ)2)=exp(2σ2x22xμ+μ221ln(2πσ2)=exp(2σ21x2+σ2μx2σ2μ221ln(2πσ2))=exp([xx2]T[σ2μ2σ21]2σ2μ221ln(2πσ2))

这里,我们得到:
T ( x ) = [ x x 2 ] η = [ η 1 η 2 ] = [ μ σ 2 − 1 2 σ 2 ] θ = [ μ σ 2 ] = [ − η 1 2 η 2 − 1 2 η 2 ] A ( η ) = − η 1 2 4 η 2 − 1 2 l n ( − 2 η 2 ) \begin{aligned} T(x)=\begin{bmatrix} x \\ x^2 \end{bmatrix}\\ \eta=\begin{bmatrix} \eta_1\\ \eta_2 \end{bmatrix}=\begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}\\ \theta=\begin{bmatrix} \mu\\ \sigma^2 \end{bmatrix}=\begin{bmatrix} \frac{-\eta_1}{2\eta_2}\\ \frac{-1}{2\eta_2} \end{bmatrix}\\ A(\eta)=\frac{-\eta_1^2}{4\eta_2}-\frac{1}{2}ln(-2\eta_2) \end{aligned} T(x)=[xx2]η=[η1η2]=[σ2μ2σ21]θ=[μσ2]=[2η2η12η21]A(η)=4η2η1221ln(2η2)
所以均值和方差可以表示为:
η 2 = − 1 2 σ 2 ⇒ σ 2 = − 1 2 η 2 μ = η 1 σ 2 = η 1 − 1 2 η 2 = − η 1 2 η 2 \eta_2=-\frac{1}{2\sigma^2}\Rightarrow \sigma^2=-\frac{1}{2\eta_2}\\ \mu=\eta_1\sigma^2=\eta_1\frac{-1}{2\eta_2}=-\frac{\eta_1}{2\eta_2} η2=2σ21σ2=2η21μ=η1σ2=η12η21=2η2η1

  1. 指数族分布有什么好处呢?
  • 如果一个条件概率可以写成上面的形式,很多问题的求解变得简单。
  • 比如:求解 a r g m a x θ [ l o g p ( X ∣ η ) ] \underset{\theta}{argmax}[log p(X| \eta)] θargmax[logp(Xη)]

a r g m a x η [ l o g p ( X ∣ η ) ] = a r g m a x η [ l o g ∏ i = 1 N p ( x i ∣ η ) ] = a r g m a x η ∑ i = 1 N [ l o g h ( x i ) + T ( x i ) T η − A ( η ) ] = a r g m a x η ∑ i = 1 N T ( x i ) T η − N A ( η ) \begin{aligned} \underset{\eta}{argmax}[log p(X| \eta)]&=\underset{\eta}{argmax}[log \prod_{i=1}^{N} p(x_i| \eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}[logh(x_i)+T(x_i)^T\eta-A(\eta)]\\ &=\underset{\eta}{argmax}\sum_{i=1}^{N}T(x_i)^T\eta-NA(\eta) \end{aligned} ηargmax[logp(Xη)]=ηargmax[logi=1Np(xiη)]=ηargmaxi=1N[logh(xi)+T(xi)TηA(η)]=ηargmaxi=1NT(xi)TηNA(η)

令上式为 L ( η ) L(\eta) L(η),则
∂ L ( η ) ∂ η = ∑ i = 1 N T ( x i ) − N A ′ ( η ) = 0 \frac{\partial{L(\eta)}}{\partial \eta}=\sum_{i=1}^{N}T(x_i)-NA'(\eta)=0 ηL(η)=i=1NT(xi)NA(η)=0
即:
A ′ ( η ) = ∑ i = 1 N T ( x i ) N A'(\eta)=\frac{\sum_{i=1}^{N}T(x_i)}{N} A(η)=Ni=1NT(xi)

  1. 共轭:

p ( β ∣ x ) ∝ p ( x ∣ β ) p ( β ) p(\beta | x) \propto p(x|\beta)p(\beta) p(βx)p(xβ)p(β)

如果似然函数和先验是共轭的,则后验和先验是同一种分布。

如果似然函数是指数族分布,理论上一定可以找到一个与之共轭的先验分布(也是指数族分布)。

  1. 一个结论: A l ′ ( β ) = E p ( x ∣ β ) [ T ( x ) ] A_l'(\beta)=E_{p(x|\beta)}[T(x)] Al(β)=Ep(xβ)[T(x)]

证明:
p ( x ∣ β ) = h ( x ) e x p ( T ( x ) T β − A l ( β ) ) ∵ ∫ p ( x ∣ β ) d x = 1 ∴ ∂ ∫ p ( x ∣ β ) d x ∂ β = ∂ ∫ h ( x ) e x p ( T ( x ) T β − A l ( β ) ) d x ∂ β = 0 = ∫ x ∂ [ h ( x ) e x p [ T ( x ) T β − A l ( β ) ] ∂ β d x = ∫ x h ( x ) e x p [ T ( x ) T β − A l ( β ) ] ( T ( x ) − A l ′ ( β ) ) d x = ∫ x h ( x ) e x p [ T ( x ) T β − A l ( β ) ] T ( x ) d x − ∫ x h ( x ) e x p [ T ( x ) T β − A l ( β ) ] A l ′ ( β ) ) d x = E p ( x ∣ β ) [ T ( x ) ] − A l ′ ( β ) = 0 p(x|\beta)=h(x)exp(T(x)^T\beta-A_l(\beta))\\ \because \int p(x|\beta)dx=1\\ \begin{aligned} \therefore \frac{\partial \int p(x|\beta)dx}{\partial \beta}&=\frac{\partial \int h(x)exp(T(x)^T\beta-A_l(\beta))dx}{\partial \beta}=0\\ &=\int_x \frac{\partial [h(x)exp[T(x)^T\beta - A_l(\beta)]}{\partial \beta}dx\\ &=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)](T(x)-A_l'(\beta))dx\\ &=\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]T(x)dx-\int_x h(x)exp[T(x)^T\beta-A_l(\beta)]A_l'(\beta))dx\\ &=E_{p(x|\beta)}[T(x)]-A_l'(\beta)=0 \end{aligned} p(xβ)=h(x)exp(T(x)TβAl(β))p(xβ)dx=1βp(xβ)dx=βh(x)exp(T(x)TβAl(β))dx=0=xβ[h(x)exp[T(x)TβAl(β)]dx=xh(x)exp[T(x)TβAl(β)](T(x)Al(β))dx=xh(x)exp[T(x)TβAl(β)]T(x)dxxh(x)exp[T(x)TβAl(β)]Al(β))dx=Ep(xβ)[T(x)]Al(β)=0

  1. 数据集合 X X X,隐变量集合 Z Z Z,参数集合 β \beta β

后验概率分布:
p ( β , Z ∣ X ) = p ( β ∣ Z , X ) p ( Z ∣ X ) = p ( Z ∣ β , X ) p ( β ∣ X ) \begin{aligned} p(\beta,Z|X)&=p(\beta|Z,X)p(Z|X)\\ &=p(Z|\beta,X)p(\beta|X) \end{aligned} p(β,ZX)=p(βZ,X)p(ZX)=p(Zβ,X)p(βX)
p ( β ∣ Z , X ) p(\beta|Z,X) p(βZ,X) p ( Z ∣ β , X ) p(Z|\beta,X) p(Zβ,X),这两个后验分布都是指数族分布。

则:
p ( β ∣ Z , X ) = h ( β ) e x p ( T ( β ) T η ( Z , X ) − A l ( η ( Z , X ) ) ) p(\beta|Z,X)=h(\beta)exp(T(\beta)^T\eta(Z,X)-A_l(\eta(Z,X))) p(βZ,X)=h(β)exp(T(β)Tη(Z,X)Al(η(Z,X)))
在做变分推断时,希望用函数 q ( β ∣ λ ) q(\beta|\lambda) q(βλ)去近似 p ( β ∣ Z , X ) p(\beta|Z,X) p(βZ,X),即:
p ( β ∣ Z , X ) ≈ q ( β ∣ λ ) = h ( β ) e x p ( T ( β ) T λ − A g ( λ ) ) p(\beta|Z,X)\approx q(\beta|\lambda)=h(\beta)exp(T(\beta)^T\lambda-A_g(\lambda)) p(βZ,X)q(βλ)=h(β)exp(T(β)TλAg(λ))
接下来,就要不断地调整 λ \lambda λ,使得 q ( β ∣ λ ) q(\beta|\lambda) q(βλ)越来越接近于 p ( β ∣ Z , X ) p(\beta|Z,X) p(βZ,X),即增大 E L O B ELOB ELOB函数。

同样的,对于 p ( Z ∣ β , X ) p(Z|\beta, X) p(Zβ,X)也是如此:
p ( Z ∣ β , X ) = h ( Z ) e x p ( T ( Z ) T η ( β , X ) − A l ( η ( β , X ) ) ) ≈ q ( Z ∣ ϕ ) = h ( Z ) e x p ( T ( Z ) T ϕ − A g ( ϕ ) ) \begin{aligned} p(Z|\beta,X)&=h(Z)exp(T(Z)^T\eta(\beta,X)-A_l(\eta(\beta,X)))\\ &\approx q(Z|\phi)=h(Z)exp(T(Z)^T\phi-A_g(\phi)) \end{aligned} p(Zβ,X)=h(Z)exp(T(Z)Tη(β,X)Al(η(β,X)))q(Zϕ)=h(Z)exp(T(Z)TϕAg(ϕ))
E L O B ELOB ELOB函数如下:
L ( q ) = E q ( Z , β ) [ l o g p ( X , Z , β ) ] − E q ( Z , β ) [ l o g q ( Z , β ) ] L(q)=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)] L(q)=Eq(Z,β)[logp(X,Z,β)]Eq(Z,β)[logq(Z,β)]
现在, E L O B ELOB ELOB函数可以写成:
L ( λ , ϕ ) = E q ( Z , β ) [ l o g P ( X , Z , β ) ] − E q ( Z , β ) [ l o g q ( Z , β ) ] L(\lambda, \phi)=E_{q(Z,\beta)}[logP(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)] L(λ,ϕ)=Eq(Z,β)[logP(X,Z,β)]Eq(Z,β)[logq(Z,β)]
目标:找到一个 λ \lambda λ ϕ \phi ϕ,使得 E L O B ELOB ELOB函数最大化。

方法:

  • 先固定一个参数,对另一个参数优化

具体做法:

  • 固定 ϕ \phi ϕ,优化 λ \lambda λ

  • L ( λ , ϕ ) = E q ( Z , β ) [ l o g p ( X , Z , β ) ] − E q ( Z , β ) [ l o g q ( Z , β ) ] = E q ( Z , β ) [ l o g p ( β ∣ X , Z ) + l o g p ( Z ∣ X ) ] − E q ( Z , β ) [ l o g q ( β ) ] − E q ( Z , β ) [ l o g q ( Z ) ] = E q ( Z , β ) [ l o g p ( β ∣ X , Z ) ] − E q ( Z , β ) [ l o g q ( β ∣ λ ) ] \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logp(X,Z, \beta)]-E_{q(Z,\beta)}[logq(Z,\beta)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)+logp(Z|X)]-E_{q(Z,\beta)}[logq(\beta)]-E_{q(Z,\beta)}[logq(Z)]\\ &=E_{q(Z,\beta)}[logp(\beta|X,Z)]-E_{q(Z,\beta)}[logq(\beta|\lambda)] \end{aligned} L(λ,ϕ)=Eq(Z,β)[logp(X,Z,β)]Eq(Z,β)[logq(Z,β)]=Eq(Z,β)[logp(βX,Z)+logp(ZX)]Eq(Z,β)[logq(β)]Eq(Z,β)[logq(Z)]=Eq(Z,β)[logp(βX,Z)]Eq(Z,β)[logq(βλ)]

  • p ( β ∣ Z , X ) p(\beta|Z,X) p(βZ,X) q ( β ∣ λ ) q(\beta|\lambda) q(βλ)代入上式

  • L ( λ , ϕ ) = E q ( Z , β ) [ l o g h ( β ) ] + E q ( Z , β ) [ T ( β ) T η ( Z , X ) ] − E q ( Z , β ) [ A g ( η ( X , Z ) ) ] − E q ( Z , β ) [ l o g h ( β ) ] − E q ( Z , β ) [ ( T ( β ) T λ ) ] + E q ( Z , β ) [ A g ( λ ) ] = E q ( β ) [ T ( β ) T ] ⋅ E q ( Z ) [ η ( Z , X ) ] − E q ( Z ) [ A g ( η ( X , Z ) ) ] − E q ( β ) [ ( T ( β ) T λ ) ] + A g ( λ ) = A g ′ ( λ ) T E q ( Z ) [ η ( Z , X ) ] − λ A g ′ ( λ ) T + A g ( λ ) \begin{aligned} L(\lambda, \phi)&=E_{q(Z,\beta)}[logh(\beta)]+E_{q(Z,\beta)}[T(\beta)^T\eta(Z,X)]-E_{q(Z,\beta)}[A_g(\eta(X,Z))]-E_{q(Z,\beta)}[logh(\beta)]-E_{q(Z,\beta)}[(T(\beta)^T\lambda)]+E_{q(Z,\beta)}[A_g(\lambda)]\\ &=E_{q(\beta)}[T(\beta)^T]\cdot E_{q(Z)}[\eta(Z,X)]-E_{q(Z)}[A_g(\eta(X,Z))]-E_{q(\beta)}[(T(\beta)^T\lambda)]+A_g(\lambda)\\ &=A_g'(\lambda)^TE_{q(Z)}[\eta(Z,X)]-\lambda A_g'(\lambda)^T+A_g(\lambda) \end{aligned} L(λ,ϕ)=Eq(Z,β)[logh(β)]+Eq(Z,β)[T(β)Tη(Z,X)]Eq(Z,β)[Ag(η(X,Z))]Eq(Z,β)[logh(β)]Eq(Z,β)[(T(β)Tλ)]+Eq(Z,β)[Ag(λ)]=Eq(β)[T(β)T]Eq(Z)[η(Z,X)]Eq(Z)[Ag(η(X,Z))]Eq(β)[(T(β)Tλ)]+Ag(λ)=Ag(λ)TEq(Z)[η(Z,X)]λAg(λ)T+Ag(λ)

  • 上式对 λ \lambda λ求导

  • ∂ L ( λ , ϕ ) ∂ λ = A g ′ ′ ( λ ) T ⋅ E q ( Z ) [ η ( Z , X ) ] − A g ′ ( λ ) T − λ A g ′ ′ ( λ ) T + A g ′ ( λ ) = A g ′ ′ ( λ ) T ( E q ( Z ) [ η ( Z , X ) ] − λ ) = 0 \begin{aligned} \frac{\partial L(\lambda, \phi)}{\partial \lambda}&=A_g''(\lambda)^T\cdot E_{q(Z)}[\eta(Z,X)]-A_g'(\lambda)^T-\lambda A_g''(\lambda)^T+A_g'(\lambda)\\ &=A_g''(\lambda)^T(E_{q(Z)}[\eta(Z,X)]-\lambda)=0 \end{aligned} λL(λ,ϕ)=Ag(λ)TEq(Z)[η(Z,X)]Ag(λ)TλAg(λ)T+Ag(λ)=Ag(λ)T(Eq(Z)[η(Z,X)]λ)=0

  • 如果 A g ′ ′ ( λ ) T ≠ 0 A_g''(\lambda)^T \neq 0 Ag(λ)T̸=0,则
    λ = E q ( Z ∣ ϕ ) [ η ( Z , X ) ] \lambda=E_{q(Z|\phi)}[\eta(Z,X)] λ=Eq(Zϕ)[η(Z,X)]
    同样
    ϕ = E q ( β ∣ λ ) [ η ( X , β ) ] \phi=E_{q(\beta|\lambda)}[\eta(X,\beta)] ϕ=Eq(βλ)[η(X,β)]

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值