关于KL距离(KL Divergence)

作者:覃含章
链接:https://www.zhihu.com/question/29980971/answer/103807952
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

最早KL divergence就是从信息论里引入的,不过既然题主问的是ML中的应用,就不多做具体介绍。只是简单概述给定真实概率分布P和近似分布Q,KL divergence所表达的就是如果我们用一套最优的压缩机制(compression scheme)来储存Q的分布,对每个从P来的sample我需要多用的bits(相比我直接用一套最优的压缩机制来储存P的分布)。这也叫做 Kraft–McMillan theorem

所以很自然的它可以被用作统计距离,因为它本身内在的概率意义。然而,也正因为这种意义,题主所说的不对称性是不可避免的。因为D(P||Q)和D(Q||P)回答的是基于不同压缩机制下的“距离”问题。

至于general的统计距离,当然,它们其实没有本质差别。更广泛的来看,KL divergence可以看成是phi-divergence的一种特殊情况(phi取log)。注意下面的定义是针对discrete probability distribution,但是把sum换成integral很自然可以定义连续版本的。

用其它的divergence理论来做上是没有本质区别的,只要phi是convex, closed的。
因为它们都有相似的概率意义,比如说pinsker's theorem保证了KL-divergence是total variation metric的一个tight bound. 其它divergence metric应该也有类似的bound,最多就是order和常数会差一些。而且,用这些divergence定义的minimization问题也都会是convex的,但是具体的computation performance可能会有差别,所以KL还是用的多。

Reference: Bayraksan G, Love DK. Data-Driven Stochastic Programming Using Phi-Divergences.

作者:知乎用户
链接:https://www.zhihu.com/question/29980971/answer/93489660
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

KL divergence KL(p||q), in the context of information theory, measures the amount of extra bits (nats) that is necessary to describe samples from the distribution p with coding based on q instead of p itself. From the Kraft-Macmillan theorem, we know that the coding scheme for one value out of a set X can be represented q(x) = 2^(-l_i) as over X, where l_i is the length of the code for x_i in bits.

We know that KL divergence is also the relative entropy between two distributions, and that gives some intuition as to why in it's used in variational methods. Variational methods use functionals as measures in its objective function (i.e. entropy of a distribution takes in a distribution and return a scalar quantity). It's interpreted as the "loss of information" when using one distribution to approximate another, and is desirable in machine learning due to the fact that in models where dimensionality reduction is used, we would like to preserve as much information of the original input as possible. This is more obvious when looking at VAEs which use the KL divergence between the posterior q and prior p distribution over the latent variable z. Likewise, you can refer to EM, where we decompose

ln p(X) = L(q) + KL(q||p)

Here we maximize the lower bound on L(q) by minimizing the KL divergence, which becomes 0 when p(Z|X) = q(Z). However, in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.

Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.

KL divergence belongs to an alpha family of divergences, where the parameter alpha takes on separate limits for the forward and backwards KL. When alpha = 0, it becomes symmetric, and linearly related to the Hellinger distance. There are other metrics such as the Cauchy Schwartz divergence which are symmetric, but in machine learning settings where the goal is to learn simpler, tractable parameterizations of distributions which approximate a target, they might not be as useful as KL.

  • 0
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
KL-DivergenceKL散度)是一种用来衡量两个概率分布之间差异的指标。它始终大于等于0,当且仅当两个分布完全相同时,KL散度等于0。KL散度具有非对称性,即DKL(P||Q)不等于DKL(Q||P),并且不满足三角不等式的形式,因此KL散度不是用来衡量距离的指标。 KL散度的公式可以表示为DKL(P||Q) = ΣP(x) * log(P(x) / Q(x)),其中P和Q分别是两个概率分布,x表示分布中的某个事件。这个公式可以用来计算P相对于Q的信息损失,或者可以理解为在使用Q来近似表示P时的额外损失。 在机器学习中,KL散度经常被用于衡量两个概率分布之间的差异,例如在概率生成模型和信息检索中。在PyTorch中,可以使用F.kl_div()函数来计算KL散度。这个函数的原型为F.kl_div(input, target, size_average=None, reduce=None, reduction='mean'),其中input和target分别是输入和目标张量。 总结起来,KL散度是一种用来衡量两个概率分布之间差异的指标,它不是用来衡量距离的,并且具有非对称性。在机器学习中,KL散度常被用于衡量模型输出与真实分布之间的差异。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [浅谈KL散度](https://blog.csdn.net/weixin_33774615/article/details/85768162)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* *3* [KL Divergence ——衡量两个概率分布之间的差异](https://blog.csdn.net/weixin_42521185/article/details/124364552)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值