预测分析笔记chapter 4

课程笔记:预测分析 2021Spring

参考教材:Murphy, K. P. (2021). Probabilistic Machine Learning: An Introduction. MIT press.

In this class,we’ll cover topics in machine learning from a probabilistic view.

We will also introduce some topics in statistical computing,such as EM,MCMC,varaitional inference,some optimization algorithm.

Chapter 4 Bayesian statistics 贝叶斯统计初步

p ( θ ∣ D ) = p ( θ ) p ( D ∣ θ ) p ( D ) = p ( θ ) p ( D ∣ θ ) ∫ p ( θ ′ ) p ( D ∣ θ ′ ) d θ ′ p(\boldsymbol{\theta} \mid \mathcal{D})=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{p(\mathcal{D})}=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{\int p\left(\boldsymbol{\theta}^{\prime}\right) p\left(\mathcal{D} \mid \boldsymbol{\theta}^{\prime}\right) d \boldsymbol{\theta}^{\prime}} p(θD)=p(D)p(θ)p(Dθ)=p(θ)p(Dθ)dθp(θ)p(Dθ)

p ( D ) p(\mathcal{D}) p(D): marginal likelihood/normalizing constant.Do not depend on θ \theta θ.数据的边际分布,与参数值/参数的分布无关。

We will contain the following parts:

  • summarize posterior 后验
  • posterior computation with conjugate prior
  • Bayesian mdel comparision
  • approximate posterior

summarizing the posterior 后验

point estimates点估计

eg:posterior mean,posterior median

credible interval区间估计

measure confidence in our paremeter estimates.(small sample size or low quality of data may lead to larger uncertainty)

we use 100 ( 1 − α ) % 100(1-\alpha)\% 100(1α)% credible interval,which contains 1 − α 1-\alpha 1α of the posterior probability mass.
P ( l ≤ θ ≤ μ ∣ D ) = 1 − α P(l\le \theta \le \mu|D)=1-\alpha P(lθμD)=1α

posterior computation with conjugate prior基于共轭先验计算后验

conjugate prior共轭先验

In this section, we consider a set of (prior, likelihood) pairs for which we can compute the posterior in closed form.

在这一章中我们主要考虑后验分布具有显式解的这一部分。

In particular, we will use priors that are “conjugate” to the likelihood.

We say that a prior p ( θ ) ∈ F p(\boldsymbol{\theta}) \in \mathcal{F} p(θ)F is a conjugate prior for a likelihood function p ( D ∣ θ ) p(\mathcal{D} \mid \boldsymbol{\theta}) p(Dθ) if the posterior is in the same parameterized family as the prior, i.e., p ( θ ∣ D ) ∈ F p(\boldsymbol{\theta} \mid \mathcal{D}) \in \mathcal{F} p(θD)F.

“共轭”是指先验分布和后验分布均属于同一分布族。

In other words, F \mathcal{F} F is closed under Bayesian updating. If the family F \mathcal{F} F corresponds to the exponential family , then the computations can be performed in closed form.

分布族关于Bayesian更新是封闭的。

The Dirichlet-multinomial model:Y-Cat,先验:狄利克雷分布

  • 似然函数:

Let Y ∼ Cat ⁡ ( θ ) Y \sim \operatorname{Cat}(\boldsymbol{\theta}) YCat(θ) be a discrete random variable drawn from a categorical distribution. The likelihood has the form
p ( D ∣ θ ) = ∏ n = 1 N Cat ⁡ ( y n ∣ θ ) = ∏ n = 1 N ∏ c = 1 C θ c I ( y n = c ) = ∏ c = 1 C θ c N c p(\mathcal{D} \mid \boldsymbol{\theta})=\prod_{n=1}^{N} \operatorname{Cat}\left(y_{n} \mid \boldsymbol{\theta}\right)=\prod_{n=1}^{N} \prod_{c=1}^{C} \theta_{c}^{\mathrm{I}\left(y_{n}=c\right)}=\prod_{c=1}^{C} \theta_{c}^{N_{c}} p(Dθ)=n=1NCat(ynθ)=n=1Nc=1CθcI(yn=c)=c=1CθcNc
where N c = ∑ n I ( y n = c ) N_{c}=\sum_{n} \mathbb{I}\left(y_{n}=c\right) Nc=nI(yn=c):#face c appears

eg:有6面的骰子(C=6,N组数据), N c N_c Nc代表在N个数据中第c面出现的个数

  • 先验分布:

The conjugate prior for a categorical distribution is the Dirichlet distribution, which is a multivariate generalization of the beta distribution. This has support over the probability simplex, defined by
S K = { θ : 0 ≤ θ k ≤ 1 , ∑ k = 1 K θ k = 1 } S_{K}=\left\{\boldsymbol{\theta}: 0 \leq \theta_{k} \leq 1, \sum_{k=1}^{K} \theta_{k}=1\right\} SK={θ:0θk1,k=1Kθk=1}

The pdf of the Dirichlet is defined as follows:
Dir ⁡ ( θ ∣ α ˘ ) ≜ 1 B ( α ˘ ) ∏ k = 1 K θ k α ˘ k − 1 I ( θ ∈ S K ) \operatorname{Dir}(\boldsymbol{\theta} \mid \breve{\boldsymbol{\alpha}}) \triangleq \frac{1}{B(\breve{\boldsymbol{\alpha}})} \prod_{k=1}^{K} \theta_{k}^{\breve{\alpha}_{k}-1} \mathbb{I}\left(\boldsymbol{\theta} \in S_{K}\right) Dir(θα˘)B(α˘)1k=1Kθkα˘k1I(θSK)
where B ( α ⃗ ) B(\vec{\alpha}) B(α ) is the multivariate beta function,
B ( α ˘ ) ≜ ∏ k = 1 K Γ ( α ˘ k ) Γ ( ∑ k = 1 K α ˘ k ) B(\breve{\boldsymbol{\alpha}}) \triangleq \frac{\prod_{k=1}^{K} \Gamma\left(\breve{\alpha}_{k}\right)}{\Gamma\left(\sum_{k=1}^{K} \breve{\alpha}_{k}\right)} B(α˘)Γ(k=1Kα˘k)k=1KΓ(α˘k)
α ˘ \breve{\boldsymbol{\alpha}} α˘:是超参数,是给定的,用来推断参数的后验分布

狄利克雷分布是beta分布的推广,用来模拟(多元?)概率密度的分布,beta是1个概率密度的分布。

  • 后验分布:

We can combine the multinomial likelihood and Dirichlet prior to compute the posterior, as follows:
p ( θ ∣ D ) ∝ p ( D ∣ θ ) Dir ⁡ ( θ ∣ α ~ ) = [ ∏ k θ k N k ] [ ∏ k θ k α ˘ k − 1 ] = Dir ⁡ ( θ ∣ α ˘ 1 + N 1 , … , α ˘ K + N K ) = Dir ⁡ ( θ ∣ α ^ ) \begin{aligned} p(\boldsymbol{\theta} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\theta}) \operatorname{Dir}(\boldsymbol{\theta} \mid \widetilde{\boldsymbol{\alpha}}) \\ &=\left[\prod_{k} \theta_{k}^{N_{k}}\right]\left[\prod_{k} \theta_{k}^{\breve{\alpha}_{k}-1}\right] \\ &=\operatorname{Dir}\left(\boldsymbol{\theta} \mid \breve{\alpha}_{1}+N_{1}, \ldots, \breve{\alpha}_{K}+N_{K}\right) \\ &=\operatorname{Dir}(\boldsymbol{\theta} \mid \hat{\boldsymbol{\alpha}}) \end{aligned} p(θD)p(Dθ)Dir(θα )=[kθkNk][kθkα˘k1]=Dir(θα˘1+N1,,α˘K+NK)=Dir(θα^)
where α ^ k = α ˘ k + N k \widehat{\alpha}_{k}=\breve{\alpha}_{k}+N_{k} α k=α˘k+Nk are the parameters of the posterior.

The Gaussian-Gaussian model:Y-正态,先验:逆Gamma

只讨论当 μ \mu μ已知时, Σ \Sigma Σ的分布。

似然函数:

If μ \mu μ is a known constant, the likelihood for σ 2 \sigma^{2} σ2 has the form
p ( D ∣ σ 2 ) ∝ ( σ 2 ) − N / 2 exp ⁡ ( − 1 2 σ 2 ∑ n = 1 N ( y n − μ ) 2 ) p\left(\mathcal{D} \mid \sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-N / 2} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2}\right) p(Dσ2)(σ2)N/2exp(2σ21n=1N(ynμ)2)
先验分布:

where we can no longer ignore the 1 / ( σ 2 ) 1 /\left(\sigma^{2}\right) 1/(σ2) term in front. The standard conjugate prior is the inverse Gamma distribution , given by 逆Gamma分布(=1/Gamma分布)
I G ( σ 2 ∣ a ˘ , b ˘ ) = b ˘ a ˘ Γ ( a ˘ ) ( σ 2 ) − ( a ˘ + 1 ) exp ⁡ ( − b ˘ σ 2 ) \mathrm{IG}\left(\sigma^{2} \mid \breve{a}, \breve{b}\right)=\frac{\breve{b}^{\breve{a}}}{\Gamma(\breve{a})}\left(\sigma^{2}\right)^{-(\breve{a}+1)} \exp \left(-\frac{\breve{b}}{\sigma^{2}}\right) IG(σ2a˘,b˘)=Γ(a˘)b˘a˘(σ2)(a˘+1)exp(σ2b˘)

后验分布:
p ( σ 2 ∣ D ) ∝ p ( D ∣ σ 2 ) p ( σ 2 ∣ α ˘ , b ˘ ) = ( σ 2 ) − ( N 2 + α ˘ + 1 ) e x p ( − b ˘ + 1 2 ∑ n = 1 N ( y n − μ ) 2 σ 2 ) \begin{aligned} p(\boldsymbol{\sigma^2} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\sigma^2}) p(\boldsymbol{\sigma^2} \mid \breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}) \\ &=(\sigma^2)^{-(\frac{N}{2}+\breve{\boldsymbol{\alpha}}+1)}exp(-\frac{\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2}{\sigma^2}) \end{aligned} p(σ2D)p(Dσ2)p(σ2α˘,b˘)=(σ2)(2N+α˘+1)exp(σ2b˘+21n=1N(ynμ)2)
i.e. I G ( N 2 + α ˘ , b ˘ + 1 2 ∑ n = 1 N ( y n − μ ) 2 ) IG(\frac{N}{2}+\breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2) IG(2N+α˘,b˘+21n=1N(ynμ)2)

Multiplying the likelihood and the prior, we see that the posterior is also IG:
p ( σ 2 ∣ μ , D ) = I G ( σ 2 ∣ a ^ , b ^ ) a ^ = a ˘ + N / 2 b ^ = b ~ + 1 2 ∑ n = 1 N ( y n − μ ) 2 \begin{aligned} p\left(\sigma^{2} \mid \mu, \mathcal{D}\right) &=\mathrm{IG}\left(\sigma^{2} \mid \widehat{a}, \widehat{b}\right) \\ \widehat{a} &=\breve{a}+N / 2 \\ \widehat{b} &=\widetilde{b}+\frac{1}{2} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2} \end{aligned} p(σ2μ,D)a b =IG(σ2a ,b )=a˘+N/2=b +21n=1N(ynμ)2
Generally,we do not have closed form posterior, so we have to used approximate inference method.但在通常情况下,我们得不到有显式解的后验分布。所以通常使用渐进推断的方法。

Bayesian Model comparison 模型比较

All model are wrong , but some are useful——George Box.

所有模型都是不准确的,但是有些模型是有用的。

we assume we have a set of models M M M.

objective: we want to choose the best model from some set M M M.

选取:在当前数据下,出现概率最大的模型。
m ^ = argmax ⁡ m ∈ M p ( m ∣ D ) \hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(m \mid \mathcal{D}) m^=mMargmaxp(mD)
where
p ( m ∣ D ) = p ( D ∣ m ) p ( m ) ∑ m ∈ M p ( D ∣ m ) p ( m ) p(m \mid \mathcal{D})=\frac{p(\mathcal{D} \mid m) p(m)}{\sum_{m \in \mathcal{M}} p(\mathcal{D} \mid m) p(m)} p(mD)=mMp(Dm)p(m)p(Dm)p(m)
m:model

D:Data

p ( D ∣ m ) p(\mathcal{D} \mid m) p(Dm):当选定模型m之后的边际概率密度(相当于 P ( θ ∣ D ) = P ( θ ) ⋅ P ( D ∣ θ ) P ( D ) P(\theta \mid D)=\frac{P(\theta) \cdot P(D \mid \theta)}{P(D)} P(θD)=P(D)P(θ)P(Dθ) 中 的 p ( D 中的p(\mathcal{D} p(D) )

If the prior over models is uniform, p ( m ) = 1 / ∣ M ∣ , p(m)=1 /|\mathcal{M}|, p(m)=1/M, then the MAP model is given by

在没有额外信息的情况下,我们认为每种模型的可能性相同
m ^ = argmax ⁡ m ∈ M p ( D ∣ m ) \hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(\mathcal{D} \mid m) m^=mMargmaxp(Dm)
选取:在选定某模型下,该组数据出现概率最大的模型

The quantity p ( D ∣ m ) p(\mathcal{D} \mid m) p(Dm) is given by
p ( D ∣ m ) = ∫ p ( D ∣ θ , m ) p ( θ ∣ m ) d θ p(\mathcal{D} \mid m)=\int p(\mathcal{D} \mid \boldsymbol{\theta}, m) p(\boldsymbol{\theta} \mid m) d \boldsymbol{\theta} p(Dm)=p(Dθ,m)p(θm)dθ

$\theta $为模型m中的参数

Bayes model averaging 贝叶斯模型平均

If our goal is to perform prediction, we can get better results if we marginalize out over all models, by computing
p ( y ∣ x , D ) = ∑ m ∈ M p ( y ∣ x , m ) p ( m ∣ D ) p(y \mid \mathbf{x}, \mathcal{D})=\sum_{m \in \mathcal{M}} p(y \mid \mathbf{x}, m) p(m \mid \mathcal{D}) p(yx,D)=mMp(yx,m)p(mD)
D:training data,(x,y):new data,goal:predict y

如果我们想要让预测效果较好,我们可以将所有模型加权平均。其中权重为 :在该组数据下某模型出现的概率。

Disadvantage : computationally very expensive. 缺点:计算非常昂贵。

week3

Variational Inference 变分推断

objective: we want to approximate P ( z ∣ x ) P(z|x) P(zx). 我们希望取估计在给定数据下,参数的后验分布

In variational inference , we pick a family of distribution over the lactect variable, parameter of interest,with its own variational parameters.

our goal is to find q q q that minimize the distance between q q q and p ( z ∣ x ) p(z|x) p(zx).

  • the measure of distance: KL divergence . written as K L ( q ∣ ∣ p ) KL(q||p) KL(qp)

    and q = arg ⁡ min ⁡ q K L ( q ∥ p ) q=\underset{q}{\arg \min } K L(q \| p) q=qargminKL(qp)

Some introduction about KL divergence

Entropy 熵

The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability.

For example, suppose we observe a sequence of symbols x n x_n xn ∼ p generated from distribution p p p. If p p p has high entropy, it will be hard to predict the value of each osbervation x n x_n xn.

  • The entropy of a discrete random variable X X X with distribution p p p over K K K states is defined by

H ( X ) ≜ − ∑ k = 1 K p ( X = k ) log ⁡ 2 p ( X = k ) = − E X [ log ⁡ p ( X ) ] \mathbb{H}(X) \triangleq-\sum_{k=1}^{K} p(X=k) \log _{2} p(X=k)=-\mathbb{E}_{X}[\log p(X)] H(X)k=1Kp(X=k)log2p(X=k)=EX[logp(X)]

The discrete distribution with maximum entropy is the uniform distribution.Hence for a K K K -ary random variable, the entropy is maximized if p ( x = k ) = 1 / K ; p(x=k)=1 / K ; p(x=k)=1/K; in this case, H ( X ) = log ⁡ 2 K \mathbb{H}(X)=\log _{2} K H(X)=log2K.

Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on one state. Such a distribution has no uncertainty.

  • Differential entropy for continuous random variables

If X X X is a continuous random variable with pdf p ( x ) p(x) p(x), we define the differential entropy as
h ( X ) ≜ − ∫ X d x p ( x ) log ⁡ p ( x ) h(X) \triangleq-\int_{\mathcal{X}} d x p(x) \log p(x) h(X)Xdxp(x)logp(x)
均匀分布的熵

assuming this integral exists. For example, suppose X ∼ U ( 0 , a ) X \sim U(0, a) XU(0,a). Then
h ( X ) = − ∫ 0 a d x 1 a log ⁡ 1 a = log ⁡ a h(X)=-\int_{0}^{a} d x \frac{1}{a} \log \frac{1}{a}=\log a h(X)=0adxa1loga1=loga
Note that, unlike the discrete case, differential entropy can be negative. This is because pdf’s can be bigger than 1. For example if X ∼ U ( 0 , 1 / 8 ) X \sim U(0,1 / 8) XU(0,1/8), we have h ( X ) = log ⁡ 2 ( 1 / 8 ) = − 3 h(X)=\log _{2}(1 / 8)=-3 h(X)=log2(1/8)=3.

高斯分布的熵

The entropy of a d d d -dimensional Gaussian is
h ( N ( μ , Σ ) ) = 1 2 ln ⁡ ∣ 2 π e Σ ∣ = 1 2 ln ⁡ [ ( 2 π e ) d ∣ Σ ∣ ] = d 2 + d 2 ln ⁡ ( 2 π ) + 1 2 ln ⁡ ∣ Σ ∣ h(\mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}))=\frac{1}{2} \ln |2 \pi e \mathbf{\Sigma}|=\frac{1}{2} \ln \left[(2 \pi e)^{d}|\mathbf{\Sigma}|\right]=\frac{d}{2}+\frac{d}{2} \ln (2 \pi)+\frac{1}{2} \ln |\mathbf{\Sigma}| h(N(μ,Σ))=21ln2πeΣ=21ln[(2πe)dΣ]=2d+2dln(2π)+21lnΣ
In the 1   d 1 \mathrm{~d} 1 d case, this becomes
h ( N ( μ , σ 2 ) ) = 1 2 ln ⁡ [ 2 π e σ 2 ] h\left(\mathcal{N}\left(\mu, \sigma^{2}\right)\right)=\frac{1}{2} \ln \left[2 \pi e \sigma^{2}\right] h(N(μ,σ2))=21ln[2πeσ2]

Cross entropy交叉熵

The cross entropy between distribution p p p and q q q is defined by
H ( p , q ) ≜ − ∑ k = 1 K p k log ⁡ q k \mathbb{H}(p, q) \triangleq-\sum_{k=1}^{K} p_{k} \log q_{k} H(p,q)k=1Kpklogqk
One can show that the cross entropy is the expected number of bits needed to compress some data samples drawn from distribution p p p using a code based on distribution q . q . q. This can be minimized by setting $q=p.

Relative entropy (KL divergence) KL距离

Given two distributions p and q, it is often useful to define a distance metric to measure how “close” or “similar” they are.

we consider a divergence measure D(p,q) ,which quantifies how far q is from p. and we focus on the Kullback-Leibler divergence.

  • KL距离的定义:

For discrete distributions, the KL divergence is defined as follows:
K L ( p ∥ q ) ≜ ∑ k = 1 K p k log ⁡ p k q k \mathbb{K} \mathbb{L}(p \| q) \triangleq \sum_{k=1}^{K} p_{k} \log \frac{p_{k}}{q_{k}} KL(pq)k=1Kpklogqkpk
This naturally extends to continuous distributions as well:
K L ( p ∥ q ) ≜ ∫ d x p ( x ) log ⁡ p ( x ) q ( x ) \mathbb{K} \mathbb{L}(p \| q) \triangleq \int d x p(x) \log \frac{p(x)}{q(x)} KL(pq)dxp(x)logq(x)p(x)
Do not distribute without permission from Kevin P. Murphy and MIT Press.

  • Interpretation:

We can rewrite the KL as follows:
K L ( p ∥ q ) = ∑ k = 1 K p k log ⁡ p k ⏟ − H ( p ) − ∑ k = 1 K p k log ⁡ q k ⏟ H ( p , q ) \mathbb{K} \mathbb{L}(p \| q)=\underbrace{\sum_{k=1}^{K} p_{k} \log p_{k}}_{-\mathrm{H}(p)} \underbrace{-\sum_{k=1}^{K} p_{k} \log q_{k}}_{\mathbb{H}(p, q)} KL(pq)=H(p) k=1KpklogpkH(p,q) k=1Kpklogqk
We recognize the first term as the negative entropy, and the second term as the cross entropy. Thus we can interpret the KL divergence as the “extra number of bits” you need to pay when compressing data samples from p p p using the incorrect distribution q q q as the basis of your coding scheme.

  • For example, one can show that the KL divergence between two multivariate Gaussian distributions is given by

K L ( N ( x ∣ μ 1 , Σ 1 ) ∥ N ( x ∣ μ 2 , Σ 2 ) ) = 1 2 [ tr ⁡ ( Σ 2 − 1 Σ 1 ) + ( μ 2 − μ 1 ) ⊤ Σ 2 − 1 ( μ 2 − μ 1 ) − D + log ⁡ ( det ⁡ ( Σ 2 ) det ⁡ ( Σ 1 ) ) ] \begin{array}{l} \mathbb{K} \mathbb{L}\left(\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}_{1}\right) \| \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}_{2}\right)\right) \\ =\frac{1}{2}\left[\operatorname{tr}\left(\mathbf{\Sigma}_{2}^{-1} \mathbf{\Sigma}_{1}\right)+\left(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}\right)^{\top} \mathbf{\Sigma}_{2}^{-1}\left(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}\right)-D+\log \left(\frac{\operatorname{det}\left(\boldsymbol{\Sigma}_{2}\right)}{\operatorname{det}\left(\mathbf{\Sigma}_{1}\right)}\right)\right] \end{array} KL(N(xμ1,Σ1)N(xμ2,Σ2))=21[tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)D+log(det(Σ1)det(Σ2))]
​ In the scalar case, this becomes
K L ( N ( x ∣ μ 1 , σ 1 ) ∥ N ( x ∣ μ 2 , σ 2 ) ) = log ⁡ σ 2 σ 1 + σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 \mathbb{K} \mathbb{L}\left(\mathcal{N}\left(x \mid \mu_{1}, \sigma_{1}\right) \| \mathcal{N}\left(x \mid \mu_{2}, \sigma_{2}\right)\right)=\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}-\frac{1}{2} KL(N(xμ1,σ1)N(xμ2,σ2))=logσ1σ2+2σ22σ12+(μ1μ2)221

  • Theorem 6.2.1. (Information inequality) K L ( p ∥ q ) ≥ 0 \mathbb{K} \mathbb{L}(p \| q) \geq 0 KL(pq)0 with equality iff p = q p=q p=q.

    Proof. We now prove the theorem following [CT06, p28]. Let A = { x : p ( x ) > 0 } A=\{x: p(x)>0\} A={x:p(x)>0} be the support of p ( x ) p(x) p(x). Using the concavity of the log function and Jensen’s inequality, we have that
    − K L ( p ∥ q ) = − ∑ x ∈ A p ( x ) log ⁡ p ( x ) q ( x ) = ∑ x ∈ A p ( x ) log ⁡ q ( x ) p ( x ) ≤ log ⁡ ∑ x ∈ A p ( x ) q ( x ) p ( x ) − log ⁡ ∑ x ∈ A q ( x ) ≤ log ⁡ ∑ x ∈ X q ( x ) = log ⁡ 1 = 0 \begin{aligned} -\mathbb{K} \mathbb{L}(p \| q) &=-\sum_{x \in A} p(x) \log \frac{p(x)}{q(x)}=\sum_{x \in A} p(x) \log \frac{q(x)}{p(x)} \\ & \leq \log \sum_{x \in A} p(x) \frac{q(x)}{p(x)}-\log \sum_{x \in A} q(x) \\ & \leq \log \sum_{x \in \mathcal{X}} q(x)=\log 1=0 \end{aligned} KL(pq)=xAp(x)logq(x)p(x)=xAp(x)logp(x)q(x)logxAp(x)p(x)q(x)logxAq(x)logxXq(x)=log1=0

  • Two properties of KL divergence:

​ nonnegative and nonsymmetric

非负性和不对称性

Variational inference

K L ( q ∥ p ) = E [ q ( x ) log ⁡ q ( x ) p ( z ∣ x ) ] \mathbb{K} \mathbb{L}(q \| p)=E[ q(x) \log \frac{q(x)}{p(z|x)} ] KL(qp)=E[q(x)logp(zx)q(x)]

q ( z ) q(z) q(z):a distribution that approximates p ( z ∣ x ) p(z|x) p(zx) , q ( z ) q(z) q(z) is unknown.

week 4

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值