交叉熵是否非负？

最新推荐文章于 2024-08-09 13:44:55 发布

Yang-W

最新推荐文章于 2024-08-09 13:44:55 发布

阅读量5.1k

点赞数 10

分类专栏： math deep-learning

本文链接：https://blog.csdn.net/tiandiwoxin92/article/details/78311173

版权

deep-learning 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

math

4 篇文章 0 订阅

订阅专栏

在Ian Goodfellow那本Deep Learning Book中有这样一段描述：

One unusual property of the cross-entropy cost used to perform maximum likelihood estimation is that it usually does not have a minimum value when applied to the models commonly used in practice. For discrete output variables, most models are parametrized in such a way that they cannot represent a probability of zero or one, but can come arbitrarily close to doing so. Logistic regression is an example of such a model. For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.
http://www.deeplearningbook.org/contents/mlp.html p.175

第一次看的时候很是迷惑，好像从来没有见过负的交叉熵啊。我们知道交叉熵可以写成熵和KL散度之和：

H (p; q) = H (p) + K L (p | | q)

$H(p; q) = H(p) + KL(p || q)$
其中

KL(p||q) $KL(p || q)$ 是非负的。证明很简单：

K L (p | | q) = E p [l o g (p q)] = - E p [l o g (q p)] \geq - l o g (E p [q p]) = - l o g (\int p q p) = 0

$KL(p||q) = E_{p}[log(\frac{p}{q})] = -E_{p}[log(\frac{q}{p})] \ge -log(E_{p}[\frac{q}{p}]) = -log(\int p\frac{q}{p}) = 0$

中间的大于等于号来自于詹森不等式。

对于H(p), 隐约记得熵应该也是非负的：

H (x) = \sum p (x) l o g (1 p ( x ))

$H(x) = \sum p(x)log(\frac{1}{p(x)})$
p是概率，必定在[0, 1],

log(1p(x)) $log(\frac{1}{p(x)})$ 肯定大于0，

H(p) $H(p)$ 就一定大于0了。

这样讲，交叉熵一定是非负了，怎么可能有negative infinity呢？

再仔细读一遍上面出现negative infinity的那句话。

For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.

原来重点在real-valued output variables. 对于连续随机变量来说，熵的定义要写成积分形式：

H (x) = \int x p (x) l o g (1 p ( x ))

$H(x) = \int_{x} p(x)log(\frac{1}{p(x)})$
这里的p就变成了概率密度，取值范围变成了

[0,+∞) $[0, +\infty)$ ，这个积分可就不一定有下界了。极端情况下，

p(x) $p(x)$ 是个 Dirac delta function, 这个积分就是负无穷了。

因此对于，连续随机变量，熵有可能是负的。

我们再看，交叉熵的定义：

H (p, q) = \int x p (x) l o g (1 q ( x ))

$H(p, q) = \int_{x}p(x)log(\frac{1}{q(x)})$
我们同样假定q(x)是个Dirac delta function，那么这个交叉熵也就变成了负无穷。

这也是上面那段话后半段的应有之义。

it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.

因此，我们的结论是：