大家对贝叶斯公式应该都很熟悉
P
(
Z
∣
X
)
=
p
(
X
,
Z
)
∫
z
p
(
X
,
Z
=
z
)
d
z
P(Z|X)=\frac{p(X,Z)}{\int_z p(X,Z=z)dz}
P(Z∣X)=∫zp(X,Z=z)dzp(X,Z)
我们称
P
(
Z
∣
X
)
P(Z|X)
P(Z∣X)为posterior distribution
。posterior distribution
的计算通常是非常困难的,为什么呢?
假设
Z
Z
Z是一个高维的随机变量,如果要求
P
(
Z
=
z
∣
X
=
x
)
P(Z=z|X=x)
P(Z=z∣X=x),我们不可避免的要计算
∫
z
p
(
X
=
x
,
Z
=
z
)
d
z
\int_z p(X=x,Z=z)dz
∫zp(X=x,Z=z)dz,由于
Z
Z
Z是高维随机变量,这个积分是相当难算的。
variational inference
就是用来计算posterior distribution
的。
core idea
variational inference
的核心思想包含两步:
- 假设分布 q ( z ; λ ) q(z;\lambda) q(z;λ) (这个分布是我们搞得定的,搞不定的就没意义了)
- 通过改变分布的参数 λ \lambda λ , 使 q ( z ; λ ) q(z;\lambda) q(z;λ) 靠近 p ( z ∣ x ) p(z|x) p(z∣x)
总结称一句话就是,用一个简单的分布 q ( z ; λ ) q(z;\lambda) q(z;λ) 拟合复杂的分布 p ( z ∣ x ) p(z|x) p(z∣x)
这种策略将计算
p
(
z
∣
x
)
p(z|x)
p(z∣x) 的问题转化成优化问题了
λ
∗
=
arg
min
λ
d
i
v
e
r
g
e
n
c
e
(
p
(
z
∣
x
)
,
q
(
z
;
λ
)
)
\lambda^* = \arg\min_{\lambda}~divergence(p(z|x),q(z;\lambda))
λ∗=argλmin divergence(p(z∣x),q(z;λ))
收敛后,就可以用
q
(
z
;
λ
)
q(z;\lambda)
q(z;λ) 来代替
p
(
z
∣
x
)
p(z|x)
p(z∣x)了
公式推倒
对概率求对数
log
P
(
x
)
=
log
P
(
x
,
z
)
−
log
P
(
z
∣
x
)
=
log
P
(
x
,
z
)
Q
(
z
;
λ
)
−
log
P
(
z
∣
x
)
Q
(
z
;
λ
)
\begin{aligned} \text{log}P(x) &= \text{log}P(x,z)-\text{log}P(z|x) \\ &=\text{log}\frac{P(x,z)}{Q(z;\lambda)}-\text{log}\frac{P(z|x)}{Q(z;\lambda)} \end{aligned}
logP(x)=logP(x,z)−logP(z∣x)=logQ(z;λ)P(x,z)−logQ(z;λ)P(z∣x)
等式的两边同时对分布
Q
(
z
)
Q(z)
Q(z)求期望,可以得到
E
q
(
z
;
λ
)
log
P
(
x
)
=
E
q
(
z
;
λ
)
log
P
(
x
,
z
)
−
E
q
(
z
;
λ
)
log
P
(
z
∣
x
)
log
P
(
x
)
=
E
q
(
z
;
λ
)
log
p
(
x
,
z
)
q
(
z
;
λ
)
−
E
q
(
z
;
λ
)
log
p
(
z
∣
x
)
q
(
z
;
λ
)
=
K
L
(
q
(
z
;
λ
)
∣
∣
p
(
z
∣
x
)
)
+
E
q
(
z
;
λ
)
log
p
(
x
,
z
)
q
(
z
;
λ
)
log
P
(
x
)
=
K
L
(
q
(
z
;
λ
)
∣
∣
p
(
z
∣
x
)
)
+
E
q
(
z
;
λ
)
log
p
(
x
,
z
)
q
(
z
;
λ
)
\begin{aligned} \mathbb E_{q(z;\lambda)}\text{log}P(x) &= \mathbb E_{q(z;\lambda)}\text{log}P(x,z)-\mathbb E_{q(z;\lambda)}\text{log}P(z|x) \\ \text{log}P(x)&=\mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)}-\mathbb E_{q(z;\lambda)}\text{log}\frac{p(z|x)}{q(z;\lambda)} \\ &=KL(q(z;\lambda)||p(z|x))+\mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)}\\ \text{log}P(x)&=KL(q(z;\lambda)||p(z|x))+\mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)} \end{aligned}
Eq(z;λ)logP(x)logP(x)logP(x)=Eq(z;λ)logP(x,z)−Eq(z;λ)logP(z∣x)=Eq(z;λ)logq(z;λ)p(x,z)−Eq(z;λ)logq(z;λ)p(z∣x)=KL(q(z;λ)∣∣p(z∣x))+Eq(z;λ)logq(z;λ)p(x,z)=KL(q(z;λ)∣∣p(z∣x))+Eq(z;λ)logq(z;λ)p(x,z)
我们的目标是使
q
(
z
:
λ
)
q(z:\lambda)
q(z:λ) 靠近
p
(
z
∣
x
)
p(z|x)
p(z∣x) ,就是
min
λ
K
L
(
q
(
z
;
λ
)
∣
∣
p
(
z
∣
x
)
)
\min_\lambda KL(q(z;\lambda)||p(z|x))
minλKL(q(z;λ)∣∣p(z∣x)) ,由于
K
L
(
q
(
z
;
λ
)
∣
∣
p
(
z
∣
x
)
)
KL(q(z;\lambda)||p(z|x))
KL(q(z;λ)∣∣p(z∣x)) 中包含
p
(
z
∣
x
)
p(z|x)
p(z∣x) ,这项非常难求。将
λ
\lambda
λ看做变量时,
log
P
(
x
)
\text{log}P(x)
logP(x) 为常量,所以,
min
λ
K
L
(
q
(
z
;
λ
)
∣
∣
p
(
z
∣
x
)
)
\min_\lambda KL(q(z;\lambda)||p(z|x))
minλKL(q(z;λ)∣∣p(z∣x)) 等价于
max
λ
E
q
(
z
;
λ
)
log
p
(
x
,
z
)
q
(
z
;
λ
)
\max_\lambda \mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)}
maxλEq(z;λ)logq(z;λ)p(x,z)。
E
q
(
z
;
λ
)
[
log
p
(
x
,
z
)
−
log
q
(
z
;
λ
)
]
\mathbb E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)]
Eq(z;λ)[logp(x,z)−logq(z;λ)] 称为Evidence Lower Bound(ELBO
)。
现在,variational inference
的目标变成
max
λ
E
q
(
z
;
λ
)
[
log
p
(
x
,
z
)
−
log
q
(
z
;
λ
)
]
\max_\lambda \mathbb E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)]
λmaxEq(z;λ)[logp(x,z)−logq(z;λ)]
为什么称之为ELBO
呢?
p
(
x
)
p(x)
p(x)一般被称之为evidence
,又因为
K
L
(
q
∣
∣
p
)
>
=
0
KL(q||p)>=0
KL(q∣∣p)>=0, 所以
p
(
x
)
>
=
E
q
(
z
;
λ
)
[
log
p
(
x
,
z
)
−
log
q
(
z
;
λ
)
]
p(x)>=E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)]
p(x)>=Eq(z;λ)[logp(x,z)−logq(z;λ)], 这就是为什么被称为ELBO
ELBO
继续看一下ELBO
E
L
B
O
(
λ
)
=
E
q
(
z
;
λ
)
[
log
p
(
x
,
z
)
−
log
q
(
z
;
λ
)
]
=
E
q
(
z
;
λ
)
log
p
(
x
,
z
)
−
E
q
(
z
;
λ
)
log
q
(
z
;
λ
)
=
E
q
(
z
;
λ
)
log
p
(
x
,
z
)
+
H
(
q
)
\begin{aligned} ELBO(\lambda) &= \mathbb E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)] \\ &= \mathbb E_{q(z;\lambda)}\text{log}p(x,z) -\mathbb E_{q(z;\lambda)}\text{log}q(z;\lambda)\\ &= \mathbb E_{q(z;\lambda)}\text{log}p(x,z) + H(q) \end{aligned}
ELBO(λ)=Eq(z;λ)[logp(x,z)−logq(z;λ)]=Eq(z;λ)logp(x,z)−Eq(z;λ)logq(z;λ)=Eq(z;λ)logp(x,z)+H(q)
The first term represents an energy. The energy encourages
q
q
q to focus probability mass where the model puts high probability,
p
(
x
,
z
)
p(\mathbf{x}, \mathbf{z})
p(x,z). The entropy encourages
q
q
q to spread probability mass to avoid concentrating to one location.
q(Z)
假设
Z
Z
Z包含K个随机变量( 当然,每个随机变量也有可能为多元随机变量),我们假设:
q
(
Z
;
λ
)
=
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
q(Z;\lambda) = \prod_{k=1}^{K}q_k(Z_k;\lambda_k)
q(Z;λ)=k=1∏Kqk(Zk;λk)
这个被称为mean field approximation
。关于mean field approximation
,https://metacademy.org/graphs/concepts/mean_field
ELBO
则变成
E
L
B
O
(
λ
)
=
E
q
(
Z
;
λ
)
log
p
(
X
,
Z
)
−
E
q
(
z
;
λ
)
log
q
(
Z
;
λ
)
=
∫
q
(
Z
;
λ
)
log
p
(
X
,
Z
)
d
Z
−
∫
q
(
Z
;
λ
)
log
q
(
Z
;
λ
)
d
Z
=
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
log
p
(
X
,
Z
)
d
Z
−
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
log
q
(
Z
;
λ
)
d
Z
\begin{aligned} ELBO(\lambda) &= \mathbb E_{q(Z;\lambda)}\text{log}p(X,Z) -\mathbb E_{q(z;\lambda)}\text{log}q(Z;\lambda) \\ &= \int q(Z;\lambda)\text{log}p(X,Z)dZ-\int q(Z;\lambda)\text{log}q(Z;\lambda)dZ\\ &=\int [\prod_{k=1}^{K}q_k(Z_k;\lambda_k)] \text{log}p(X,Z)dZ-\int [\prod_{k=1}^{K}q_k(Z_k;\lambda_k)] \text{log}q(Z;\lambda)dZ \end{aligned}
ELBO(λ)=Eq(Z;λ)logp(X,Z)−Eq(z;λ)logq(Z;λ)=∫q(Z;λ)logp(X,Z)dZ−∫q(Z;λ)logq(Z;λ)dZ=∫[k=1∏Kqk(Zk;λk)]logp(X,Z)dZ−∫[k=1∏Kqk(Zk;λk)]logq(Z;λ)dZ
第一项为 energy
, 第二项为H(q)
energy
符号的含义:
Z
=
{
Z
j
,
Z
‾
j
}
,
Z
‾
j
=
Z
\
Z
j
Z = \{Z_j,\overline Z_j \}, \overline Z_j=Z\backslash Z_j
Z={Zj,Zj},Zj=Z\Zj
λ
=
{
λ
j
,
λ
‾
j
}
,
λ
‾
j
=
λ
\
λ
j
\lambda=\{\lambda_j, \overline\lambda_j\}, \overline \lambda_j=\lambda\backslash\lambda_j
λ={λj,λj},λj=λ\λj
先处理第一项:
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
log
p
(
X
,
Z
)
d
Z
=
∫
Z
j
q
j
(
Z
j
;
λ
j
)
∫
Z
‾
j
[
∏
k
≠
j
K
q
k
(
Z
k
;
λ
k
)
]
log
p
(
X
,
Z
)
d
Z
‾
j
d
Z
j
=
∫
Z
j
q
j
(
Z
j
;
λ
j
)
[
E
q
(
Z
‾
j
;
λ
‾
j
)
log
p
(
X
,
Z
)
]
d
Z
j
=
∫
Z
j
q
j
(
Z
j
;
λ
j
)
{
log
exp
[
E
q
(
Z
‾
j
;
λ
‾
j
)
log
p
(
X
,
Z
)
]
}
d
Z
j
=
∫
Z
j
q
j
(
Z
j
;
λ
j
)
[
log
q
j
∗
(
Z
j
;
λ
j
)
+
log
C
]
d
Z
j
\begin{aligned} &\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}p(X,Z)dZ = \\ &\int_{Z_j}q_j(Z_j;\lambda_j)\int_{ \overline Z_j}\Bigr[\prod_{k \neq j}^K q_k(Z_k;\lambda_k)\Bigr]\text{log}p(X,Z)d \overline Z_jdZ_j = \\ &\int_{Z_j}q_j(Z_j;\lambda_j)\Bigr[E_{q(\overline Z_j;\overline \lambda_j)}\text{log}p(X,Z)\Bigr]dZ_j=\\ &\int_{Z_j}q_j(Z_j;\lambda_j)\{\log \exp\Bigr[E_{q(\overline Z_j;\overline \lambda_j)}\text{log}p(X,Z)\Bigr]\}dZ_j=\\ &\int_{Z_j}q_j(Z_j;\lambda_j)\Bigr[\log q_j^* (Z_j;\lambda_j)+\log C\Bigr]dZ_j \end{aligned}
∫[k=1∏Kqk(Zk;λk)]logp(X,Z)dZ=∫Zjqj(Zj;λj)∫Zj[k̸=j∏Kqk(Zk;λk)]logp(X,Z)dZjdZj=∫Zjqj(Zj;λj)[Eq(Zj;λj)logp(X,Z)]dZj=∫Zjqj(Zj;λj){logexp[Eq(Zj;λj)logp(X,Z)]}dZj=∫Zjqj(Zj;λj)[logqj∗(Zj;λj)+logC]dZj
其中
q
j
∗
(
Z
j
;
λ
j
)
=
1
C
exp
[
E
q
(
Z
‾
j
;
λ
‾
j
)
log
p
(
X
,
Z
)
]
q_j^* (Z_j;\lambda_j)=\frac{1}{C}\exp[E_{q(\overline Z_j;\overline \lambda_j)}\text{log}p(X,Z)]
qj∗(Zj;λj)=C1exp[Eq(Zj;λj)logp(X,Z)] ,
C
C
C 保证
q
j
∗
(
Z
j
;
λ
j
)
q_j^* (Z_j;\lambda_j)
qj∗(Zj;λj) 是一个分布。
C
C
C 与变分参数
λ
‾
j
\overline \lambda_j
λj 有关,与
Z
,
λ
j
Z, \lambda_j
Z,λj 无关!!
H(q)
再处理第二项:
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
log
q
(
Z
;
λ
)
d
Z
=
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
∑
n
=
1
K
log
q
(
Z
n
;
λ
)
d
Z
=
∑
j
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
log
q
(
Z
j
;
λ
j
)
d
Z
=
∑
j
∫
[
∏
k
=
1
K
q
k
(
Z
k
;
λ
k
)
]
log
q
(
Z
j
;
λ
j
)
d
Z
=
∑
j
∫
Z
j
q
j
(
Z
j
;
λ
j
)
log
q
(
Z
j
;
λ
j
)
d
Z
j
∫
[
∏
k
≠
j
K
q
k
(
Z
k
;
λ
k
)
]
d
Z
‾
j
=
∑
j
∫
Z
j
q
j
(
Z
j
;
λ
j
)
log
q
(
Z
j
;
λ
j
)
d
Z
j
\begin{aligned} &\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}q(Z;\lambda)dZ = \\ &\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \sum_{n=1}^K\text{log}q(Z_n;\lambda)dZ = \\ &\sum_j\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}q(Z_j;\lambda_j)dZ=\\ &\sum_j\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}q(Z_j;\lambda_j)dZ=\\ &\sum_j\int_{Z_j} q_j(Z_j;\lambda_j)\text{log}q(Z_j;\lambda_j)dZ_j\int [\prod_{k\neq j}^{K}q_k(Z_k;\lambda_k)]d\overline Z_j=\\ &\sum_j\int_{Z_j} q_j(Z_j;\lambda_j)\text{log}q(Z_j;\lambda_j)dZ_j \end{aligned}
∫[k=1∏Kqk(Zk;λk)]logq(Z;λ)dZ=∫[k=1∏Kqk(Zk;λk)]n=1∑Klogq(Zn;λ)dZ=j∑∫[k=1∏Kqk(Zk;λk)]logq(Zj;λj)dZ=j∑∫[k=1∏Kqk(Zk;λk)]logq(Zj;λj)dZ=j∑∫Zjqj(Zj;λj)logq(Zj;λj)dZj∫[k̸=j∏Kqk(Zk;λk)]dZj=j∑∫Zjqj(Zj;λj)logq(Zj;λj)dZj
再看ELBO
经过上面的处理,ELBO变为
E
L
B
O
=
∫
Z
i
q
i
(
Z
i
;
λ
j
)
log
q
i
∗
(
Z
i
;
λ
i
)
d
Z
i
−
∑
j
∫
Z
j
q
j
(
Z
j
;
λ
j
)
log
q
(
Z
j
;
λ
j
)
d
Z
j
+
log
C
=
{
∫
Z
i
q
i
(
Z
i
;
λ
j
)
log
q
i
∗
(
Z
i
;
λ
i
)
d
Z
i
−
∫
Z
i
q
i
(
Z
i
;
λ
j
)
log
q
(
Z
i
;
λ
i
)
d
Z
i
}
+
H
(
q
(
Z
‾
i
;
λ
‾
i
)
)
+
log
C
\begin{aligned} ELBO &= \int_{Z_i}q_i(Z_i;\lambda_j)\text{log}q_i^* (Z_i;\lambda_i)dZ_i-\sum_j\int_{Z_j} q_j(Z_j;\lambda_j)\text{log}q(Z_j;\lambda_j)dZ_j+\log C\\ &=\{\int_{Z_i}q_i(Z_i;\lambda_j)\text{log}q_i^* (Z_i;\lambda_i)dZ_i-\int_{Z_i} q_i(Z_i;\lambda_j)\text{log}q(Z_i;\lambda_i)dZ_i\} +H(q(\overline Z_i;\overline \lambda_i))+\log C\\ & \end{aligned}
ELBO=∫Ziqi(Zi;λj)logqi∗(Zi;λi)dZi−j∑∫Zjqj(Zj;λj)logq(Zj;λj)dZj+logC={∫Ziqi(Zi;λj)logqi∗(Zi;λi)dZi−∫Ziqi(Zi;λj)logq(Zi;λi)dZi}+H(q(Zi;λi))+logC
再看上式
{
}
\{\}
{} 中的项:
∫
Z
i
q
i
(
Z
i
;
λ
j
)
log
q
i
∗
(
Z
i
;
λ
i
)
d
Z
i
−
∫
Z
i
q
i
(
Z
i
;
λ
j
)
log
q
(
Z
i
;
λ
i
)
d
Z
i
=
−
K
L
(
q
i
(
Z
i
;
λ
j
)
∣
∣
q
i
∗
(
Z
i
;
λ
i
)
)
\int_{Z_i}q_i(Z_i;\lambda_j)\text{log}q_i^* (Z_i;\lambda_i)dZ_i-\int_{Z_i} q_i(Z_i;\lambda_j)\text{log}q(Z_i;\lambda_i)dZ_i = -KL(q_i(Z_i;\lambda_j)||q_i^* (Z_i;\lambda_i))
∫Ziqi(Zi;λj)logqi∗(Zi;λi)dZi−∫Ziqi(Zi;λj)logq(Zi;λi)dZi=−KL(qi(Zi;λj)∣∣qi∗(Zi;λi))
所以ELBO又可以写成:
E
L
B
O
=
−
K
L
(
q
i
(
Z
i
;
λ
j
)
∣
∣
q
i
∗
(
Z
i
;
λ
i
)
)
+
H
(
q
(
Z
‾
i
;
λ
‾
i
)
)
+
log
C
ELBO=-KL(q_i(Z_i;\lambda_j)||q_i^* (Z_i;\lambda_i))+H(q(\overline Z_i;\overline \lambda_i))+\log C
ELBO=−KL(qi(Zi;λj)∣∣qi∗(Zi;λi))+H(q(Zi;λi))+logC
我们要
m
a
x
m
i
z
e
E
L
B
O
maxmize ELBO
maxmizeELBO,如何更新
q
i
(
Z
i
;
λ
i
)
q_i(Z_i;\lambda_i)
qi(Zi;λi) 呢?
从
E
L
B
O
=
−
K
L
(
q
i
(
Z
i
;
λ
i
)
∣
∣
q
i
∗
(
Z
i
;
λ
i
)
)
+
H
(
q
(
Z
‾
i
;
λ
‾
i
)
)
+
log
C
ELBO=-KL(q_i(Z_i;\lambda_i)||q_i^* (Z_i;\lambda_i))+H(q(\overline Z_i;\overline \lambda_i))+\log C
ELBO=−KL(qi(Zi;λi)∣∣qi∗(Zi;λi))+H(q(Zi;λi))+logC
可以看出,当
q
i
(
Z
i
;
λ
j
)
=
q
i
∗
(
Z
i
;
λ
i
)
q_i(Z_i;\lambda_j)=q_i^* (Z_i;\lambda_i)
qi(Zi;λj)=qi∗(Zi;λi) 时,
K
L
(
q
i
(
Z
i
;
λ
j
)
∣
∣
q
i
∗
(
Z
i
;
λ
i
)
)
=
0
KL(q_i(Z_i;\lambda_j)||q_i^* (Z_i;\lambda_i))=0
KL(qi(Zi;λj)∣∣qi∗(Zi;λi))=0 。 这时,ELBO取最大值。
所以参数更新策略就变成了
q
1
(
Z
1
;
λ
1
)
=
q
1
∗
(
Z
1
;
λ
1
)
q
2
(
Z
2
;
λ
2
)
=
q
2
∗
(
Z
2
;
λ
2
)
q
3
(
Z
3
;
λ
3
)
=
q
3
∗
(
Z
3
;
λ
3
)
.
.
.
\begin{aligned} &q_1(Z_1;\lambda_1)=q_1^* (Z_1;\lambda_1)\\ &q_2(Z_2;\lambda_2)=q_2^* (Z_2;\lambda_2)\\ &q_3(Z_3;\lambda_3)=q_3^* (Z_3;\lambda_3)\\ &... \end{aligned}
q1(Z1;λ1)=q1∗(Z1;λ1)q2(Z2;λ2)=q2∗(Z2;λ2)q3(Z3;λ3)=q3∗(Z3;λ3)...
关于
q
i
∗
(
Z
i
;
λ
i
)
q_i^* (Z_i;\lambda_i)
qi∗(Zi;λi)
q
i
(
Z
i
;
λ
i
)
=
q
i
∗
(
Z
i
;
λ
i
)
q
i
(
Z
i
;
λ
i
)
=
1
C
exp
[
E
q
(
Z
‾
i
;
λ
‾
i
)
log
p
(
X
,
Z
)
]
=
1
C
exp
[
E
q
(
Z
‾
i
;
λ
‾
i
)
log
p
(
X
,
Z
i
,
Z
‾
i
)
]
\begin{aligned} q_i(Z_i;\lambda_i)&=q_i^* (Z_i;\lambda_i)\\ q_i (Z_i;\lambda_i)&=\frac{1}{C}\exp[E_{q(\overline Z_i;\overline \lambda_i)}\text{log}p(X,Z)]\\ &=\frac{1}{C}\exp[E_{q(\overline Z_i;\overline \lambda_i)}\text{log}p(X,Z_i,\overline Z_i)]\\ & \end{aligned}
qi(Zi;λi)qi(Zi;λi)=qi∗(Zi;λi)=C1exp[Eq(Zi;λi)logp(X,Z)]=C1exp[Eq(Zi;λi)logp(X,Zi,Zi)]
q
i
q_i
qi 是要更新的节点,
X
X
X 是观测的数据,由于 Markov Blanket
(下面介绍),更新公式变成:
log
(
q
i
(
Z
i
;
λ
i
)
)
=
∫
q
(
m
b
(
Z
i
)
)
log
p
(
Z
i
,
m
b
(
Z
i
)
,
X
)
d
m
b
(
Z
i
)
\log(q_i(Z_i;\lambda_i))=\int q(mb(Z_i))\log p(Z_i,mb(Z_i),X)d~mb(Z_i)
log(qi(Zi;λi))=∫q(mb(Zi))logp(Zi,mb(Zi),X)d mb(Zi)
由于式子中和
Z
i
Z_i
Zi 无关的项都被积分积掉了,所以写成了 Markov Blanket
这种形式
Markov Blanket
In machine learning, the Markov blanket for a node
A
A
A in a Bayesian network is the set of nodes
m
b
(
A
)
mb(A)
mb(A) composed of
A
′
s
A's
A′s parents, its children, and its children’s other parents. In a Markov random field, the Markov blanket of a node is its set of neighboring nodes.
Every set of nodes in the network is conditionally independent of
A
A
A when conditioned on the set
m
b
(
A
)
mb(A)
mb(A), that is, when conditioned on the Markov blanket of the node
A
A
A . The probability has the Markov property; formally, for distinct nodes
A
A
A and
B
B
B:
P
r
(
A
∣
m
b
(
A
)
,
B
)
=
P
r
(
A
∣
m
b
(
A
)
)
Pr(A|mb(A),B)=Pr(A|mb(A))
Pr(A∣mb(A),B)=Pr(A∣mb(A))
The Markov blanket of a node contains all the variables that shield the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge needed to predict the behavior of that node.
参考资料
https://en.wikipedia.org/wiki/Markov_blanket
http://edwardlib.org/tutorials/inference
http://edwardlib.org/tutorials/variational-inference