课程笔记:预测分析 2021Spring
参考教材:Murphy, K. P. (2021). Probabilistic Machine Learning: An Introduction. MIT press.
In this class,we’ll cover topics in machine learning from a probabilistic view.
We will also introduce some topics in statistical computing,such as EM,MCMC,varaitional inference,some optimization algorithm.
文章目录
Chapter 4 Bayesian statistics 贝叶斯统计初步
p ( θ ∣ D ) = p ( θ ) p ( D ∣ θ ) p ( D ) = p ( θ ) p ( D ∣ θ ) ∫ p ( θ ′ ) p ( D ∣ θ ′ ) d θ ′ p(\boldsymbol{\theta} \mid \mathcal{D})=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{p(\mathcal{D})}=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{\int p\left(\boldsymbol{\theta}^{\prime}\right) p\left(\mathcal{D} \mid \boldsymbol{\theta}^{\prime}\right) d \boldsymbol{\theta}^{\prime}} p(θ∣D)=p(D)p(θ)p(D∣θ)=∫p(θ′)p(D∣θ′)dθ′p(θ)p(D∣θ)
p ( D ) p(\mathcal{D}) p(D): marginal likelihood/normalizing constant.Do not depend on θ \theta θ.数据的边际分布,与参数值/参数的分布无关。
We will contain the following parts:
- summarize posterior 后验
- posterior computation with conjugate prior
- Bayesian mdel comparision
- approximate posterior
summarizing the posterior 后验
point estimates点估计
eg:posterior mean,posterior median
credible interval区间估计
measure confidence in our paremeter estimates.(small sample size or low quality of data may lead to larger uncertainty)
we use
100
(
1
−
α
)
%
100(1-\alpha)\%
100(1−α)% credible interval,which contains
1
−
α
1-\alpha
1−α of the posterior probability mass.
P
(
l
≤
θ
≤
μ
∣
D
)
=
1
−
α
P(l\le \theta \le \mu|D)=1-\alpha
P(l≤θ≤μ∣D)=1−α
posterior computation with conjugate prior基于共轭先验计算后验
conjugate prior共轭先验
In this section, we consider a set of (prior, likelihood) pairs for which we can compute the posterior in closed form.
在这一章中我们主要考虑后验分布具有显式解的这一部分。
In particular, we will use priors that are “conjugate” to the likelihood.
We say that a prior p ( θ ) ∈ F p(\boldsymbol{\theta}) \in \mathcal{F} p(θ)∈F is a conjugate prior for a likelihood function p ( D ∣ θ ) p(\mathcal{D} \mid \boldsymbol{\theta}) p(D∣θ) if the posterior is in the same parameterized family as the prior, i.e., p ( θ ∣ D ) ∈ F p(\boldsymbol{\theta} \mid \mathcal{D}) \in \mathcal{F} p(θ∣D)∈F.
“共轭”是指先验分布和后验分布均属于同一分布族。
In other words, F \mathcal{F} F is closed under Bayesian updating. If the family F \mathcal{F} F corresponds to the exponential family , then the computations can be performed in closed form.
分布族关于Bayesian更新是封闭的。
The Dirichlet-multinomial model:Y-Cat,先验:狄利克雷分布
- 似然函数:
Let
Y
∼
Cat
(
θ
)
Y \sim \operatorname{Cat}(\boldsymbol{\theta})
Y∼Cat(θ) be a discrete random variable drawn from a categorical distribution. The likelihood has the form
p
(
D
∣
θ
)
=
∏
n
=
1
N
Cat
(
y
n
∣
θ
)
=
∏
n
=
1
N
∏
c
=
1
C
θ
c
I
(
y
n
=
c
)
=
∏
c
=
1
C
θ
c
N
c
p(\mathcal{D} \mid \boldsymbol{\theta})=\prod_{n=1}^{N} \operatorname{Cat}\left(y_{n} \mid \boldsymbol{\theta}\right)=\prod_{n=1}^{N} \prod_{c=1}^{C} \theta_{c}^{\mathrm{I}\left(y_{n}=c\right)}=\prod_{c=1}^{C} \theta_{c}^{N_{c}}
p(D∣θ)=n=1∏NCat(yn∣θ)=n=1∏Nc=1∏CθcI(yn=c)=c=1∏CθcNc
where
N
c
=
∑
n
I
(
y
n
=
c
)
N_{c}=\sum_{n} \mathbb{I}\left(y_{n}=c\right)
Nc=∑nI(yn=c):#face c appears
eg:有6面的骰子(C=6,N组数据), N c N_c Nc代表在N个数据中第c面出现的个数
- 先验分布:
The conjugate prior for a categorical distribution is the Dirichlet distribution, which is a multivariate generalization of the beta distribution. This has support over the probability simplex, defined by
S
K
=
{
θ
:
0
≤
θ
k
≤
1
,
∑
k
=
1
K
θ
k
=
1
}
S_{K}=\left\{\boldsymbol{\theta}: 0 \leq \theta_{k} \leq 1, \sum_{k=1}^{K} \theta_{k}=1\right\}
SK={θ:0≤θk≤1,k=1∑Kθk=1}
The pdf of the Dirichlet is defined as follows:
Dir
(
θ
∣
α
˘
)
≜
1
B
(
α
˘
)
∏
k
=
1
K
θ
k
α
˘
k
−
1
I
(
θ
∈
S
K
)
\operatorname{Dir}(\boldsymbol{\theta} \mid \breve{\boldsymbol{\alpha}}) \triangleq \frac{1}{B(\breve{\boldsymbol{\alpha}})} \prod_{k=1}^{K} \theta_{k}^{\breve{\alpha}_{k}-1} \mathbb{I}\left(\boldsymbol{\theta} \in S_{K}\right)
Dir(θ∣α˘)≜B(α˘)1k=1∏Kθkα˘k−1I(θ∈SK)
where
B
(
α
⃗
)
B(\vec{\alpha})
B(α) is the multivariate beta function,
B
(
α
˘
)
≜
∏
k
=
1
K
Γ
(
α
˘
k
)
Γ
(
∑
k
=
1
K
α
˘
k
)
B(\breve{\boldsymbol{\alpha}}) \triangleq \frac{\prod_{k=1}^{K} \Gamma\left(\breve{\alpha}_{k}\right)}{\Gamma\left(\sum_{k=1}^{K} \breve{\alpha}_{k}\right)}
B(α˘)≜Γ(∑k=1Kα˘k)∏k=1KΓ(α˘k)
α
˘
\breve{\boldsymbol{\alpha}}
α˘:是超参数,是给定的,用来推断参数的后验分布
狄利克雷分布是beta分布的推广,用来模拟(多元?)概率密度的分布,beta是1个概率密度的分布。
- 后验分布:
We can combine the multinomial likelihood and Dirichlet prior to compute the posterior, as follows:
p
(
θ
∣
D
)
∝
p
(
D
∣
θ
)
Dir
(
θ
∣
α
~
)
=
[
∏
k
θ
k
N
k
]
[
∏
k
θ
k
α
˘
k
−
1
]
=
Dir
(
θ
∣
α
˘
1
+
N
1
,
…
,
α
˘
K
+
N
K
)
=
Dir
(
θ
∣
α
^
)
\begin{aligned} p(\boldsymbol{\theta} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\theta}) \operatorname{Dir}(\boldsymbol{\theta} \mid \widetilde{\boldsymbol{\alpha}}) \\ &=\left[\prod_{k} \theta_{k}^{N_{k}}\right]\left[\prod_{k} \theta_{k}^{\breve{\alpha}_{k}-1}\right] \\ &=\operatorname{Dir}\left(\boldsymbol{\theta} \mid \breve{\alpha}_{1}+N_{1}, \ldots, \breve{\alpha}_{K}+N_{K}\right) \\ &=\operatorname{Dir}(\boldsymbol{\theta} \mid \hat{\boldsymbol{\alpha}}) \end{aligned}
p(θ∣D)∝p(D∣θ)Dir(θ∣α
)=[k∏θkNk][k∏θkα˘k−1]=Dir(θ∣α˘1+N1,…,α˘K+NK)=Dir(θ∣α^)
where
α
^
k
=
α
˘
k
+
N
k
\widehat{\alpha}_{k}=\breve{\alpha}_{k}+N_{k}
α
k=α˘k+Nk are the parameters of the posterior.
The Gaussian-Gaussian model:Y-正态,先验:逆Gamma
只讨论当 μ \mu μ已知时, Σ \Sigma Σ的分布。
似然函数:
If
μ
\mu
μ is a known constant, the likelihood for
σ
2
\sigma^{2}
σ2 has the form
p
(
D
∣
σ
2
)
∝
(
σ
2
)
−
N
/
2
exp
(
−
1
2
σ
2
∑
n
=
1
N
(
y
n
−
μ
)
2
)
p\left(\mathcal{D} \mid \sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-N / 2} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2}\right)
p(D∣σ2)∝(σ2)−N/2exp(−2σ21n=1∑N(yn−μ)2)
先验分布:
where we can no longer ignore the
1
/
(
σ
2
)
1 /\left(\sigma^{2}\right)
1/(σ2) term in front. The standard conjugate prior is the inverse Gamma distribution , given by 逆Gamma分布(=1/Gamma分布)
I
G
(
σ
2
∣
a
˘
,
b
˘
)
=
b
˘
a
˘
Γ
(
a
˘
)
(
σ
2
)
−
(
a
˘
+
1
)
exp
(
−
b
˘
σ
2
)
\mathrm{IG}\left(\sigma^{2} \mid \breve{a}, \breve{b}\right)=\frac{\breve{b}^{\breve{a}}}{\Gamma(\breve{a})}\left(\sigma^{2}\right)^{-(\breve{a}+1)} \exp \left(-\frac{\breve{b}}{\sigma^{2}}\right)
IG(σ2∣a˘,b˘)=Γ(a˘)b˘a˘(σ2)−(a˘+1)exp(−σ2b˘)
后验分布:
p
(
σ
2
∣
D
)
∝
p
(
D
∣
σ
2
)
p
(
σ
2
∣
α
˘
,
b
˘
)
=
(
σ
2
)
−
(
N
2
+
α
˘
+
1
)
e
x
p
(
−
b
˘
+
1
2
∑
n
=
1
N
(
y
n
−
μ
)
2
σ
2
)
\begin{aligned} p(\boldsymbol{\sigma^2} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\sigma^2}) p(\boldsymbol{\sigma^2} \mid \breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}) \\ &=(\sigma^2)^{-(\frac{N}{2}+\breve{\boldsymbol{\alpha}}+1)}exp(-\frac{\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2}{\sigma^2}) \end{aligned}
p(σ2∣D)∝p(D∣σ2)p(σ2∣α˘,b˘)=(σ2)−(2N+α˘+1)exp(−σ2b˘+21∑n=1N(yn−μ)2)
i.e.
I
G
(
N
2
+
α
˘
,
b
˘
+
1
2
∑
n
=
1
N
(
y
n
−
μ
)
2
)
IG(\frac{N}{2}+\breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2)
IG(2N+α˘,b˘+21∑n=1N(yn−μ)2)
Multiplying the likelihood and the prior, we see that the posterior is also IG:
p
(
σ
2
∣
μ
,
D
)
=
I
G
(
σ
2
∣
a
^
,
b
^
)
a
^
=
a
˘
+
N
/
2
b
^
=
b
~
+
1
2
∑
n
=
1
N
(
y
n
−
μ
)
2
\begin{aligned} p\left(\sigma^{2} \mid \mu, \mathcal{D}\right) &=\mathrm{IG}\left(\sigma^{2} \mid \widehat{a}, \widehat{b}\right) \\ \widehat{a} &=\breve{a}+N / 2 \\ \widehat{b} &=\widetilde{b}+\frac{1}{2} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2} \end{aligned}
p(σ2∣μ,D)a
b
=IG(σ2∣a
,b
)=a˘+N/2=b
+21n=1∑N(yn−μ)2
Generally,we do not have closed form posterior, so we have to used approximate inference method.但在通常情况下,我们得不到有显式解的后验分布。所以通常使用渐进推断的方法。
Bayesian Model comparison 模型比较
All model are wrong , but some are useful——George Box.
所有模型都是不准确的,但是有些模型是有用的。
we assume we have a set of models M M M.
objective: we want to choose the best model from some set M M M.
选取:在当前数据下,出现概率最大的模型。
m
^
=
argmax
m
∈
M
p
(
m
∣
D
)
\hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(m \mid \mathcal{D})
m^=m∈Margmaxp(m∣D)
where
p
(
m
∣
D
)
=
p
(
D
∣
m
)
p
(
m
)
∑
m
∈
M
p
(
D
∣
m
)
p
(
m
)
p(m \mid \mathcal{D})=\frac{p(\mathcal{D} \mid m) p(m)}{\sum_{m \in \mathcal{M}} p(\mathcal{D} \mid m) p(m)}
p(m∣D)=∑m∈Mp(D∣m)p(m)p(D∣m)p(m)
m:model
D:Data
p ( D ∣ m ) p(\mathcal{D} \mid m) p(D∣m):当选定模型m之后的边际概率密度(相当于 P ( θ ∣ D ) = P ( θ ) ⋅ P ( D ∣ θ ) P ( D ) P(\theta \mid D)=\frac{P(\theta) \cdot P(D \mid \theta)}{P(D)} P(θ∣D)=P(D)P(θ)⋅P(D∣θ) 中 的 p ( D 中的p(\mathcal{D} 中的p(D) )
If the prior over models is uniform, p ( m ) = 1 / ∣ M ∣ , p(m)=1 /|\mathcal{M}|, p(m)=1/∣M∣, then the MAP model is given by
在没有额外信息的情况下,我们认为每种模型的可能性相同
m
^
=
argmax
m
∈
M
p
(
D
∣
m
)
\hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(\mathcal{D} \mid m)
m^=m∈Margmaxp(D∣m)
选取:在选定某模型下,该组数据出现概率最大的模型
The quantity
p
(
D
∣
m
)
p(\mathcal{D} \mid m)
p(D∣m) is given by
p
(
D
∣
m
)
=
∫
p
(
D
∣
θ
,
m
)
p
(
θ
∣
m
)
d
θ
p(\mathcal{D} \mid m)=\int p(\mathcal{D} \mid \boldsymbol{\theta}, m) p(\boldsymbol{\theta} \mid m) d \boldsymbol{\theta}
p(D∣m)=∫p(D∣θ,m)p(θ∣m)dθ
$\theta $为模型m中的参数
Bayes model averaging 贝叶斯模型平均
If our goal is to perform prediction, we can get better results if we marginalize out over all models, by computing
p
(
y
∣
x
,
D
)
=
∑
m
∈
M
p
(
y
∣
x
,
m
)
p
(
m
∣
D
)
p(y \mid \mathbf{x}, \mathcal{D})=\sum_{m \in \mathcal{M}} p(y \mid \mathbf{x}, m) p(m \mid \mathcal{D})
p(y∣x,D)=m∈M∑p(y∣x,m)p(m∣D)
D:training data,(x,y):new data,goal:predict y
如果我们想要让预测效果较好,我们可以将所有模型加权平均。其中权重为 :在该组数据下某模型出现的概率。
Disadvantage : computationally very expensive. 缺点:计算非常昂贵。
week3
Variational Inference 变分推断
objective: we want to approximate P ( z ∣ x ) P(z|x) P(z∣x). 我们希望取估计在给定数据下,参数的后验分布
In variational inference , we pick a family of distribution over the lactect variable, parameter of interest,with its own variational parameters.
our goal is to find q q q that minimize the distance between q q q and p ( z ∣ x ) p(z|x) p(z∣x).
-
the measure of distance: KL divergence . written as K L ( q ∣ ∣ p ) KL(q||p) KL(q∣∣p)
and q = arg min q K L ( q ∥ p ) q=\underset{q}{\arg \min } K L(q \| p) q=qargminKL(q∥p)
Some introduction about KL divergence
Entropy 熵
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability.
For example, suppose we observe a sequence of symbols x n x_n xn ∼ p generated from distribution p p p. If p p p has high entropy, it will be hard to predict the value of each osbervation x n x_n xn.
- The entropy of a discrete random variable X X X with distribution p p p over K K K states is defined by
H ( X ) ≜ − ∑ k = 1 K p ( X = k ) log 2 p ( X = k ) = − E X [ log p ( X ) ] \mathbb{H}(X) \triangleq-\sum_{k=1}^{K} p(X=k) \log _{2} p(X=k)=-\mathbb{E}_{X}[\log p(X)] H(X)≜−k=1∑Kp(X=k)log2p(X=k)=−EX[logp(X)]
The discrete distribution with maximum entropy is the uniform distribution.Hence for a K K K -ary random variable, the entropy is maximized if p ( x = k ) = 1 / K ; p(x=k)=1 / K ; p(x=k)=1/K; in this case, H ( X ) = log 2 K \mathbb{H}(X)=\log _{2} K H(X)=log2K.
Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on one state. Such a distribution has no uncertainty.
- Differential entropy for continuous random variables
If
X
X
X is a continuous random variable with pdf
p
(
x
)
p(x)
p(x), we define the differential entropy as
h
(
X
)
≜
−
∫
X
d
x
p
(
x
)
log
p
(
x
)
h(X) \triangleq-\int_{\mathcal{X}} d x p(x) \log p(x)
h(X)≜−∫Xdxp(x)logp(x)
均匀分布的熵
assuming this integral exists. For example, suppose
X
∼
U
(
0
,
a
)
X \sim U(0, a)
X∼U(0,a). Then
h
(
X
)
=
−
∫
0
a
d
x
1
a
log
1
a
=
log
a
h(X)=-\int_{0}^{a} d x \frac{1}{a} \log \frac{1}{a}=\log a
h(X)=−∫0adxa1loga1=loga
Note that, unlike the discrete case, differential entropy can be negative. This is because pdf’s can be bigger than 1. For example if
X
∼
U
(
0
,
1
/
8
)
X \sim U(0,1 / 8)
X∼U(0,1/8), we have
h
(
X
)
=
log
2
(
1
/
8
)
=
−
3
h(X)=\log _{2}(1 / 8)=-3
h(X)=log2(1/8)=−3.
高斯分布的熵
The entropy of a
d
d
d -dimensional Gaussian is
h
(
N
(
μ
,
Σ
)
)
=
1
2
ln
∣
2
π
e
Σ
∣
=
1
2
ln
[
(
2
π
e
)
d
∣
Σ
∣
]
=
d
2
+
d
2
ln
(
2
π
)
+
1
2
ln
∣
Σ
∣
h(\mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}))=\frac{1}{2} \ln |2 \pi e \mathbf{\Sigma}|=\frac{1}{2} \ln \left[(2 \pi e)^{d}|\mathbf{\Sigma}|\right]=\frac{d}{2}+\frac{d}{2} \ln (2 \pi)+\frac{1}{2} \ln |\mathbf{\Sigma}|
h(N(μ,Σ))=21ln∣2πeΣ∣=21ln[(2πe)d∣Σ∣]=2d+2dln(2π)+21ln∣Σ∣
In the
1
d
1 \mathrm{~d}
1 d case, this becomes
h
(
N
(
μ
,
σ
2
)
)
=
1
2
ln
[
2
π
e
σ
2
]
h\left(\mathcal{N}\left(\mu, \sigma^{2}\right)\right)=\frac{1}{2} \ln \left[2 \pi e \sigma^{2}\right]
h(N(μ,σ2))=21ln[2πeσ2]
Cross entropy交叉熵
The cross entropy between distribution
p
p
p and
q
q
q is defined by
H
(
p
,
q
)
≜
−
∑
k
=
1
K
p
k
log
q
k
\mathbb{H}(p, q) \triangleq-\sum_{k=1}^{K} p_{k} \log q_{k}
H(p,q)≜−k=1∑Kpklogqk
One can show that the cross entropy is the expected number of bits needed to compress some data samples drawn from distribution
p
p
p using a code based on distribution
q
.
q .
q. This can be minimized by setting $q=p.
Relative entropy (KL divergence) KL距离
Given two distributions p and q, it is often useful to define a distance metric to measure how “close” or “similar” they are.
we consider a divergence measure D(p,q) ,which quantifies how far q is from p. and we focus on the Kullback-Leibler divergence.
- KL距离的定义:
For discrete distributions, the KL divergence is defined as follows:
K
L
(
p
∥
q
)
≜
∑
k
=
1
K
p
k
log
p
k
q
k
\mathbb{K} \mathbb{L}(p \| q) \triangleq \sum_{k=1}^{K} p_{k} \log \frac{p_{k}}{q_{k}}
KL(p∥q)≜k=1∑Kpklogqkpk
This naturally extends to continuous distributions as well:
K
L
(
p
∥
q
)
≜
∫
d
x
p
(
x
)
log
p
(
x
)
q
(
x
)
\mathbb{K} \mathbb{L}(p \| q) \triangleq \int d x p(x) \log \frac{p(x)}{q(x)}
KL(p∥q)≜∫dxp(x)logq(x)p(x)
Do not distribute without permission from Kevin P. Murphy and MIT Press.
- Interpretation:
We can rewrite the KL as follows:
K
L
(
p
∥
q
)
=
∑
k
=
1
K
p
k
log
p
k
⏟
−
H
(
p
)
−
∑
k
=
1
K
p
k
log
q
k
⏟
H
(
p
,
q
)
\mathbb{K} \mathbb{L}(p \| q)=\underbrace{\sum_{k=1}^{K} p_{k} \log p_{k}}_{-\mathrm{H}(p)} \underbrace{-\sum_{k=1}^{K} p_{k} \log q_{k}}_{\mathbb{H}(p, q)}
KL(p∥q)=−H(p)
k=1∑KpklogpkH(p,q)
−k=1∑Kpklogqk
We recognize the first term as the negative entropy, and the second term as the cross entropy. Thus we can interpret the KL divergence as the “extra number of bits” you need to pay when compressing data samples from
p
p
p using the incorrect distribution
q
q
q as the basis of your coding scheme.
- For example, one can show that the KL divergence between two multivariate Gaussian distributions is given by
K
L
(
N
(
x
∣
μ
1
,
Σ
1
)
∥
N
(
x
∣
μ
2
,
Σ
2
)
)
=
1
2
[
tr
(
Σ
2
−
1
Σ
1
)
+
(
μ
2
−
μ
1
)
⊤
Σ
2
−
1
(
μ
2
−
μ
1
)
−
D
+
log
(
det
(
Σ
2
)
det
(
Σ
1
)
)
]
\begin{array}{l} \mathbb{K} \mathbb{L}\left(\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}_{1}\right) \| \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}_{2}\right)\right) \\ =\frac{1}{2}\left[\operatorname{tr}\left(\mathbf{\Sigma}_{2}^{-1} \mathbf{\Sigma}_{1}\right)+\left(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}\right)^{\top} \mathbf{\Sigma}_{2}^{-1}\left(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}\right)-D+\log \left(\frac{\operatorname{det}\left(\boldsymbol{\Sigma}_{2}\right)}{\operatorname{det}\left(\mathbf{\Sigma}_{1}\right)}\right)\right] \end{array}
KL(N(x∣μ1,Σ1)∥N(x∣μ2,Σ2))=21[tr(Σ2−1Σ1)+(μ2−μ1)⊤Σ2−1(μ2−μ1)−D+log(det(Σ1)det(Σ2))]
In the scalar case, this becomes
K
L
(
N
(
x
∣
μ
1
,
σ
1
)
∥
N
(
x
∣
μ
2
,
σ
2
)
)
=
log
σ
2
σ
1
+
σ
1
2
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
−
1
2
\mathbb{K} \mathbb{L}\left(\mathcal{N}\left(x \mid \mu_{1}, \sigma_{1}\right) \| \mathcal{N}\left(x \mid \mu_{2}, \sigma_{2}\right)\right)=\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}-\frac{1}{2}
KL(N(x∣μ1,σ1)∥N(x∣μ2,σ2))=logσ1σ2+2σ22σ12+(μ1−μ2)2−21
-
Theorem 6.2.1. (Information inequality) K L ( p ∥ q ) ≥ 0 \mathbb{K} \mathbb{L}(p \| q) \geq 0 KL(p∥q)≥0 with equality iff p = q p=q p=q.
Proof. We now prove the theorem following [CT06, p28]. Let A = { x : p ( x ) > 0 } A=\{x: p(x)>0\} A={x:p(x)>0} be the support of p ( x ) p(x) p(x). Using the concavity of the log function and Jensen’s inequality, we have that
− K L ( p ∥ q ) = − ∑ x ∈ A p ( x ) log p ( x ) q ( x ) = ∑ x ∈ A p ( x ) log q ( x ) p ( x ) ≤ log ∑ x ∈ A p ( x ) q ( x ) p ( x ) − log ∑ x ∈ A q ( x ) ≤ log ∑ x ∈ X q ( x ) = log 1 = 0 \begin{aligned} -\mathbb{K} \mathbb{L}(p \| q) &=-\sum_{x \in A} p(x) \log \frac{p(x)}{q(x)}=\sum_{x \in A} p(x) \log \frac{q(x)}{p(x)} \\ & \leq \log \sum_{x \in A} p(x) \frac{q(x)}{p(x)}-\log \sum_{x \in A} q(x) \\ & \leq \log \sum_{x \in \mathcal{X}} q(x)=\log 1=0 \end{aligned} −KL(p∥q)=−x∈A∑p(x)logq(x)p(x)=x∈A∑p(x)logp(x)q(x)≤logx∈A∑p(x)p(x)q(x)−logx∈A∑q(x)≤logx∈X∑q(x)=log1=0 -
Two properties of KL divergence:
nonnegative and nonsymmetric
非负性和不对称性
Variational inference
K L ( q ∥ p ) = E [ q ( x ) log q ( x ) p ( z ∣ x ) ] \mathbb{K} \mathbb{L}(q \| p)=E[ q(x) \log \frac{q(x)}{p(z|x)} ] KL(q∥p)=E[q(x)logp(z∣x)q(x)]
q ( z ) q(z) q(z):a distribution that approximates p ( z ∣ x ) p(z|x) p(z∣x) , q ( z ) q(z) q(z) is unknown.
week 4