Variational Bayesian inference
参考文献
Log-likelihood and Evidence Lower Bound(ELOB)
下列表达式总是成立:
ln ( p ( X ) ) = ln ( p ( X , Z ) ) − ln ( P ( Z ∣ X ) ) \ln(p(X)) = \ln(p(X,Z)) - \ln(P(Z\mid X)) ln(p(X))=ln(p(X,Z))−ln(P(Z∣X))
所以下式也成立:
ln ( P ( X ) ) = [ ln ( p ( X , Z ) ) − ln ( q ( Z ) ) ] − [ ln ( p ( Z ∣ X ) ) − ln ( q ( Z ) ) ] \ln(P(X)) = \left[\ln(p(X,Z))-\ln(q(Z))\right] - \left[\ln(p(Z\mid X))-\ln(q(Z))\right] ln(P(X))=[ln(p(X,Z))−ln(q(Z))]−[ln(p(Z∣X))−ln(q(Z))]
所以现在我们有
ln ( p ( X ) ) = ln ( p ( X , Z ) q ( Z ) ) − ln ( p ( Z ∣ X ) q ( Z ) ) \ln(p(X)) = \ln\left(\frac{p(X,Z)}{q(Z)}\right) - \ln\left(\frac{p(Z\mid X)}{q(Z)}\right) ln(p(X))=ln(q(Z)p(X,Z))−ln(q(Z)p(Z∣X))
两边同时取期望:
ln ( p ( X ) ) = ∫ q ( Z ) ln ( p ( X , Z ) q ( Z ) ) d Z − ∫ q ( Z ) ln ( p ( Z ∣ X ) q ( Z ) ) d Z = ∫ q ( Z ) ln ( p ( X , Z ) ) d Z − ∫ q ( Z ) ln ( q ( Z ) ) d Z ⏟ L ( q ) + ( − ∫ q ( Z ) ln ( p ( Z ∣ X ) q ( Z ) ) d Z ) ⏟ K L ( q ∥ p ) = L ( q ) + K L ( q ∥ p ) \begin{aligned} \ln (p(X)) &=\int q(Z) \ln \left(\frac{p(X, Z)}{q(Z)}\right) \mathrm{d} Z-\int q(Z) \ln \left(\frac{p(Z \mid X)}{q(Z)}\right) \mathrm{d} Z \\ &=\underbrace{\int q(Z) \ln (p(X, Z)) \mathrm{d} Z-\int q(Z) \ln (q(Z)) \mathrm{d} Z}_{\mathcal{L}(q)}+\underbrace{\left(-\int q(Z) \ln \left(\frac{p(Z \mid X)}{q(Z)}\right) \mathrm{d} Z\right)}_{\mathbb{K} \mathbb{L}(q \| p)} \\ &=\mathcal{L}(q)+\mathbb{K} \mathbb{L}(q \| p) \end{aligned} ln(p(X))=∫q(Z)ln(q(Z)p(X,Z))dZ−∫q(Z)ln(q(Z)p(Z∣X))dZ=L(q)
∫q(Z)ln(p(X,Z))dZ−∫q(Z)ln(q(Z))dZ+KL(q∥p)
(−∫q(Z)ln(q(Z)p(Z∣X))dZ)=L(q)+KL(q∥p)
KL散度一般用于度量两个概率分布函数之间的距离,其定义如下:
K L [ p ( X ) ∣ q ( X ) ] = ∑ x ∈ X [ p ( x ) log p ( x ) q ( x ) ] = E x ∼ p ( x ) [ log p ( x ) q ( x ) ] \mathbb{KL}[p(X)\mid q(X)] = \sum_{x\in X}\left[p(x)\log\frac{p(x)}{q(x)}\right] = \mathbb{E}_{x\sim p(x)}\left[\log\frac{p(x)}{q(x)}\right] KL[p(X)∣q(X)]=x∈X∑[p(x)logq(x)p(x)]=Ex∼p(x)[logq(x)p(x)]
我们要做的就是找到与后验分布 p ( Z ∣ X ) p(Z\mid X) p(Z∣X)最接近的简单分布 p ( Z ) p(Z) p(Z)。
Alternative Evidence Lower Bound(ELOB)
我们看另一种推导方法:
ln ( p ( X ) ) = log ∫ Z p ( X , Z ) d z = log ∫ Z p ( X , Z ) q ( Z ) q ( Z ) d z = log ( E q [ p ( X , Z ) q ( Z ) ] ) ≥ E q [ log ( p ( X , Z ) q ( Z ) ) ] using Jensen’s inequality = E q [ log ( p ( X , Z ) ) ] − E q [ log ( q ( Z ) ) ] ≜ L ( q ) \begin{aligned} \ln (p(X)) &=\log \int_{Z} p(X, Z) \mathrm{d} z \\ &=\log \int_{Z} p(X, Z) \frac{q(Z)}{q(Z)} \mathrm{d} z \\ &=\log \left(\mathbb{E}_{q}\left[\frac{p(X, Z)}{q(Z)}\right]\right) \\ & \geq \mathbb{E}_{q}\left[\log \left(\frac{p(X, Z)}{q(Z)}\right)\right] \text { using Jensen's inequality } \\ &=\mathbb{E}_{q}[\log (p(X, Z))]-\mathbb{E}_{q}[\log (q(Z))] \\ & \triangleq \mathcal{L}(q) \end{aligned} ln(p(X))=log∫Zp(X,Z)dz=log∫Zp(X,Z)q(Z)q(Z)dz=log(Eq[q(Z)p(X,Z)])≥Eq[log(q(Z)p(X,Z))] using Jensen’s inequality =Eq[log(p(X,Z))]−Eq[log(q(Z))]≜L(q)
Maximize Evidence Lower Bound(ELOB)
我们给每个部分一个名字:
Evidence Lower Bound (ELOB): L ( q ) = ∫ q ( Z ) ln ( p ( X , Z ) ) d Z − ∫ q ( Z ) ln ( q ( Z ) ) d Z K L divergence: K L ( q ∥ p ) = − ∫ q ( Z ) ln ( p ( Z ∣ X ) q ( Z ) ) d Z \begin{array}{ll} \text {Evidence Lower Bound (ELOB):} & \mathcal{L}(q)=\int q(Z) \ln (p(X, Z)) \mathrm{d} Z-\int q(Z) \ln (q(Z)) \mathrm{d} Z \\ \mathrm{KL} \text { divergence: } & \mathbb{K} \mathbb{L}(q \| p)=-\int q(Z) \ln \left(\frac{p(Z \mid X)}{q(Z)}\right) d Z \end{array} Evidence Lower Bound (ELOB):KL divergence: L(q)=∫q(Z)ln(p(X,Z))dZ−∫q(Z)ln(q(Z))dZKL(q∥p)=−∫q(Z)ln(q(Z)p(Z∣X))dZ
- 注意 p ( X ) p(X) p(X)对于 q ( Z ) q(Z) q(Z)的选择是固定的。我们想要去选择一个 q ( Z ) q(Z) q(Z)函数最小化KL散度,因此 q ( Z ) q(Z) q(Z)变得离 p ( Z ∣ X ) p(Z\mid X) p(Z∣X)越来越近。很容易验证,当 q ( Z ) = p ( Z ∣ X ) q(Z)=p(Z\mid X) q(Z)=p(Z∣X)时,KL散度为 0 0 0。
- 我们知道 ln p ( X ) = L ( q ) + K L ( q ∥ p ) \ln p(X) = \mathcal{L}(q)+\mathbb{KL}(q\| p) lnp(X)=L(q)+KL(q∥p)。最小化 K L ( q ∥ p ) \mathbb{KL}(q\| p) KL(q∥p)等同于最大化 L ( q ) \mathcal{L}(q) L(q)。
我们可以选择 q ( Z ) q(Z) q(Z)使得
q ( Z ) = ∏ i = 1 M q i ( Z i ) q(Z) = \prod_{i=1}^Mq_i(Z_i) q(Z)=i=1∏Mqi(Zi)
其中 M M M为 Z Z Z的维度,也就是说 q ( Z ) q(Z) q(Z)的各个维度是独立的,这被称为平均场变分贝叶斯。
注意 q ( Z ) q(Z) q(Z)对联合概率密度函数 p ( Z ∣ X ) p(Z\mid X) p(Z∣X)是一个很好地近似,但是边缘分布 q ( Z i ) q(Z_i) q(Zi)对 p ( Z i ∣ x ) p(Z_i\mid x) p(Zi∣x)的近似不一定好。
将其带入到 L ( q ) \mathcal{L}(q) L(q)中:
L ( q ) = ∫ q ( Z ) ln ( p ( X , Z ) ) d Z − ∫ q ( Z ) ln ( q ( Z ) ) d Z = ∫ ∏ i = 1 M q i ( Z i ) ln ( p ( X , Z ) ) d Z ⏟ part (1) − ∫ ∏ i = 1 M q i ( Z i ) ∑ i = 1 M ln ( q i ( Z i ) ) d Z ⏟ part (2) \begin{aligned} \mathcal{L}(q) &=\int q(Z) \ln (p(X, Z)) \mathrm{d} Z-\int q(Z) \ln (q(Z)) \mathrm{d} Z \\ &=\underbrace{\int \prod_{i=1}^{M} q_{i}\left(Z_{i}\right) \ln (p(X, Z)) \mathrm{d} Z}_{\text {part (1) }}-\underbrace{\int \prod_{i=1}^{M} q_{i}\left(Z_{i}\right) \sum_{i=1}^{M} \ln \left(q_{i}\left(Z_{i}\right)\right) \mathrm{d} Z}_{\text {part (2) }} \end{aligned} L(q)=∫q(Z)ln(p(X,Z))dZ−∫q(Z)ln(q(Z))dZ=part (1)
∫i=1∏Mqi(Zi)ln(p(X,Z))dZ−part (2)
∫i=1∏Mqi(Zi)i=1∑Mln(qi(Zi))dZ
我们先看Part1,假设我们只对 Z i Z_i Zi感兴趣,将其拿出来,变为:
( Part 1 ) = ∫ Z j q j ( Z j ) ( ∫ Z i ≠ j … ∫ ∏ i ≠ j M q i ( Z i ) ln ( p ( X , Z ) ) ∏ i ≠ j M d Z i ) d Z j (\operatorname{Part} 1)=\int_{Z_{j}} q_{j}\left(Z_{j}\right)\left(\int_{Z_{i \neq j}} \ldots \int \prod_{i \neq j}^{M} q_{i}\left(Z_{i}\right) \ln (p(X, Z)) \prod_{i \neq j}^{M} d Z_{i}\right) d Z_{j} (Part1)=∫Zjqj(Zj)
∫Zi=j…∫i=j∏Mqi(Zi)ln(p(X,Z))i=j∏MdZi
dZj
或者将其写为更紧凑的形式:
( Part 1 ) = ∫ Z j q j ( Z j ) ( ∫ Z i ≠ j ⋯ ∫ ln ( p ( X , Z ) ) ∏ i ≠ j M q i ( Z i ) d Z i ) d Z j (\operatorname{Part} 1)=\int_{Z_{j}} q_{j}\left(Z_{j}\right)\left(\int_{Z_{i \neq j}} \cdots \int \ln (p(X, Z)) \prod_{i \neq j}^{M} q_{i}\left(Z_{i}\right) d Z_{i}\right) d Z_{j} (Part1)=∫Zjqj(Zj)
∫Zi=j⋯∫ln(p(X,Z))i=j∏Mqi(Zi)dZi
dZj
或者,为了让其更具有意义,可以将其放进一个期望函数里:
( Part 1 ) = ∫ Z j q j ( Z j ) [ E i ≠ j [ ln ( p ( X , Z ) ) ] ] d Z j (\operatorname{Part} 1)=\int_{Z_{j}} q_{j}\left(Z_{j}\right)\left[\mathbb{E}_{i \neq j}[\ln (p(X, Z))]\right] d Z_{j} (Part1)=∫Zjqj(Zj)[Ei=j[ln(p(X,Z))]]dZj
现在再看Part2:
( Part 2) = ∫ ∏ i = 1 M q i ( Z i ) ∑ i = 1 M ln ( q i ( Z i ) ) d Z (\text { Part 2) }=\int \prod_{i=1}^{M} q_{i}(Z_{i}) \sum_{i=1}^{M} \ln \left(q_{i}(Z_{i}\right)) d Z ( Part 2) =∫i=1∏Mqi(Zi)i=1∑Mln(qi(Zi))dZ
将其化简:
(Part2) = ∫ q ( Z ) ∑ i = 1 M ln ( q i ( Z i ) ) d Z = ∑ i = 1 M ∫ Z q ( Z 1 , ⋯ , Z M ) ln ( q i ( Z i ) ) d Z = ∑ i = 1 M ∫ Z i q i ( Z i ) ln ( q i ( Z i ) ) d Z i \begin{aligned} \operatorname{(Part2)} &= \int q(Z)\sum_{i=1}^M\ln(q_i(Z_i))dZ\\ &=\sum_{i=1}^M\int_{Z}q(Z_1,\cdots,Z_M)\ln(q_i(Z_i))dZ\\ &=\sum_{i=1}^M\int_{Z_i}q_i(Z_i)\ln(q_i(Z_i))dZ_i \end{aligned} (Part2)=∫q(Z)i=1