在Ian Goodfellow那本Deep Learning Book中有这样一段描述:
One unusual property of the cross-entropy cost used to perform maximum likelihood estimation is that it usually does not have a minimum value when applied to the models commonly used in practice. For discrete output variables, most models are parametrized in such a way that they cannot represent a probability of zero or one, but can come arbitrarily close to doing so. Logistic regression is an example of such a model. For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.
http://www.deeplearningbook.org/contents/mlp.html p.175
第一次看的时候很是迷惑,好像从来没有见过负的交叉熵啊。我们知道交叉熵可以写成熵和KL散度之和:
其中 KL(p||q) 是非负的。证明很简单:
中间的大于等于号来自于詹森不等式。
对于H(p), 隐约记得熵应该也是非负的:
p是概率,必定在[0, 1], log(1p(x)) 肯定大于0, H(p) 就一定大于0了。
这样讲,交叉熵一定是非负了,怎么可能有negative infinity呢?
再仔细读一遍上面出现negative infinity的那句话。
For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.
原来重点在real-valued output variables. 对于连续随机变量来说,熵的定义要写成积分形式:
这里的p就变成了概率密度,取值范围变成了 [0,+∞) ,这个积分可就不一定有下界了。极端情况下, p(x) 是个 Dirac delta function, 这个积分就是负无穷了。
因此对于,连续随机变量,熵有可能是负的。
我们再看,交叉熵的定义:
我们同样假定q(x)是个Dirac delta function,那么这个交叉熵也就变成了负无穷。
这也是上面那段话后半段的应有之义。
it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.
因此,我们的结论是:
- 对于离散随机变量,交叉熵是非负的。如果你的分类问题是softmax + cross_entropy_loss 出现了负的loss,那肯定是算错了。
- 对于连续随机变量,交叉熵有可能是负。